伝音多言語音声認識システムはMLC-SLM 2025チャレンジに使用されます

2508.14916v1

日本語タイトル#

Transsion 多言語音声認識システムの MLC-SLM 2025 チャレンジ用

英文タイトル#

Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge

日本語要約#

本論文では、MLC-SLM 2025 チャレンジの Track 1 のために Transsion 音声チームが開発した新しい多言語自動音声認識（ASR）システムのアーキテクチャと性能について紹介します。このシステムは、3 つの重要なコンポーネントで構成されています：1）大規模な事前学習を活用して堅牢な音響特徴抽出を確保するための凍結された Whisper-large-v3 ベースの音声エンコーダー；2）音声とテキストの表現を効果的に整列させるための Linear-ReLU-Linear 変換メカニズムを使用したトレーニング可能なアダプターモジュール；および 3）最適化された文脈言語デコーディングのためにトレーニング可能な LoRA と統合された凍結された Qwen2.5-7B-Instruct 大規模言語モデル（LLM）。事前学習モデルとタスク特化型のファインチューニングを体系的に組み合わせることで、このシステムは評価セットの 11 言語で 9.83% の単語 / 文字誤り率（WER/CER）を達成し、世界の参加者の中で 3 位にランクインしました。

英文要約#

This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge. The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction; 2) a trainable adaptor module using Linear-ReLU-Linear transformation mechanisms to effectively align speech and text representations; and 3) a frozen Qwen2.5-7B-Instruct large language model (LLM) integrated with trainable LoRA for optimized contextual linguistic decoding. By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.

文章ページ#

Transsion 多言語音声認識システムの MLC-SLM 2025 チャレンジ用

PDF 取得#

日本語 PDF を表示 - 2508.14916v1

スマート達人の抖音 QR コード

抖音でさらに素晴らしいコンテンツを見るにはスキャンしてください