基於記憶錨定的多模態推理用於可解釋的視頻取證

2508.14581v1

中文标题#

基於記憶錨定的多模態推理用於可解釋的視頻取證

英文标题#

Memory-Anchored Multimodal Reasoning for Explainable Video Forensics

中文摘要#

我們通過提出 FakeHunter，一種結合記憶引導檢索、結構化觀察 - 思考 - 行動推理循環和自適應取證工具調用的統一框架，解決了需要魯棒性和可解釋性的多模態深度偽造檢測問題。來自對比語言 - 圖像預訓練（CLIP）模型的視覺表示和來自對比語言 - 音頻預訓練（CLAP）模型的音頻表示從大規模記憶中檢索語義對齊的真實示例，提供上下文錨點，指導可疑篡改的迭代定位和解釋。在內部置信度較低時，該框架會選擇性地觸發細粒度分析，如空間區域縮放和梅譜圖檢查，以收集區分性證據，而不是依賴不透明的邊緣分數。我們還發布了 X-AVFake，一個全面的音視頻偽造基準，具有細粒度的篡改類型、受影響區域或實體、推理類別和解釋性依據的標註，旨在強調上下文基礎和解釋的真實性。大量實驗表明，FakeHunter 超越了強大的多模態基線，消融研究證實，上下文檢索和選擇性工具激活對於提高魯棒性和解釋精度都是不可或缺的。

英文摘要#

We address multimodal deepfake detection requiring both robustness and interpretability by proposing FakeHunter, a unified framework that combines memory guided retrieval, a structured Observation-Thought-Action reasoning loop, and adaptive forensic tool invocation. Visual representations from a Contrastive Language-Image Pretraining (CLIP) model and audio representations from a Contrastive Language-Audio Pretraining (CLAP) model retrieve semantically aligned authentic exemplars from a large scale memory, providing contextual anchors that guide iterative localization and explanation of suspected manipulations. Under low internal confidence the framework selectively triggers fine grained analyses such as spatial region zoom and mel spectrogram inspection to gather discriminative evidence instead of relying on opaque marginal scores. We also release X-AVFake, a comprehensive audio visual forgery benchmark with fine grained annotations of manipulation type, affected region or entity, reasoning category, and explanatory justification, designed to stress contextual grounding and explanation fidelity. Extensive experiments show that FakeHunter surpasses strong multimodal baselines, and ablation studies confirm that both contextual retrieval and selective tool activation are indispensable for improved robustness and explanatory precision.

文章页面#

基於記憶錨定的多模態推理用於可解釋的視頻取證

PDF 获取#

查看中文 PDF - 2508.14581v1

智能達人抖店二維碼

抖音掃碼查看更多精彩內容