中文标题#
RadReason:帶有原因和子分數的放射學報告評估指標
英文标题#
RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
中文摘要#
評估自動生成的放射科報告仍然是一個基本挑戰,因為缺乏具有臨床依據、可解釋且細粒度的指標。 現有方法要麼生成粗略的整體評分,要麼依賴於不透明的黑盒模型,限制了它們在現實臨床工作流程中的實用性。 我們引入了 RadReason,這是一種用於放射科報告的新評估框架,不僅可以輸出六個臨床定義的錯誤類型的細粒度子評分,還能生成人類可讀的解釋,說明每個評分的依據。 我們的方法基於組相對策略優化,並結合了兩個關鍵創新:(1) 子評分動態加權,根據實時 F1 統計信息自適應地優先考慮臨床上具有挑戰性的錯誤類型;以及 (2) 眾數引導的優勢縮放,根據從子評分一致性中得出的提示難度調整策略梯度更新。 這些組件共同實現了更穩定的優化,並更好地與專家臨床判斷對齊。 在 ReXVal 基準上的實驗表明,RadReason 超越了所有先前的離線指標,並達到了與基於 GPT-4 的評估相當的水平,同時保持可解釋性、成本效率,並適合臨床部署。 代碼將在發表後發布。
英文摘要#
Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
文章页面#
PDF 获取#
抖音掃碼查看更多精彩內容