動画に関する一つの実証研究 - LLMsがどのように動画の質問に答えるか

2508.15360v1

日本語タイトル#

ビデオ - LLMs がビデオの質問にどのように答えるかに関する実証研究

英文タイトル#

An Empirical Study on How Video-LLMs Answer Video Questions

日本語摘要#

大規模データと事前学習された言語モデルを利用して、ビデオ大型言語モデル（Video-LLMs）はビデオの質問に答える能力を強化しています。しかし、既存の研究のほとんどは性能向上に焦点を当てており、その内部メカニズムの理解には限られた注意が払われています。本論文は、このギャップを体系的な実証研究を通じて埋めることを目的としています。既存の VideoLLMs を解釈するために、注意のノックアウトを主要な分析ツールとして採用し、ビデオ時間ノックアウト、ビデオ空間ノックアウト、言語からビデオへのノックアウトの 3 つの変種を設計しました。次に、これらの 3 つのノックアウトを異なる層数（層のウィンドウ）に適用します。層のウィンドウとノックアウトの種類を慎重に制御することで、グローバル設定と細粒度設定の 2 つの設定を提供します。我々の研究は 3 つの重要な発見を明らかにします：(1) グローバル設定は、ビデオ情報の抽出が主に初期層で行われ、明確な二段階プロセスを形成することを示しています —— 低層は知覚コーディングに焦点を当て、高層は抽象的推論を処理します；(2) 細粒度設定では、特定の中間層がビデオ質問応答に不釣り合いな影響を及ぼし、重要な外れ値として機能し、他のほとんどの層はほとんど寄与しません；(3) 両方の設定において、空間 - 時間モデリングはビデオトークン間の内部およびフレーム間の自己注意よりも、言語に導かれた検索に依存していることが観察されますが、後者は計算コストが高いです。最後に、これらの洞察を利用して Video-LLMs における注意計算を削減できることを示します。私たちの知る限り、これは Video-LLMs が内部でどのようにビデオコンテンツを処理し理解するかを体系的に明らかにする初めての研究であり、将来の研究に対する解釈可能性と効率の視点を提供します。

英文摘要#

Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.

文章ページ#

ビデオ - LLMs がビデオの質問にどのように答えるかに関する実証研究

PDF 获取#

中文 PDF - 2508.15360v1 を表示

スマート達人の抖店 QR コード

抖音でさらに素晴らしいコンテンツを確認する