中文标题#
多尺度视频变换器用于自动驾驶中的类别无关分割
英文标题#
Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
中文摘要#
確保自動駕駛的安全性是一個複雜的挑戰,需要處理未知物體和未預見的駕駛場景。我們開發了多尺度視頻變換器,能夠僅使用運動線索檢測未知物體。視頻語義和全景分割通常依賴於訓練期間看到的已知類別,忽略了新類別。最近的大型語言模型視覺定位在計算上非常昂貴,尤其是在像素級輸出時。我們提出了一種高效的視頻變換器,端到端訓練用於類無關的分割,而無需光流。我們的方法使用多階段多尺度查詢 - 記憶解碼和特定尺度的隨機丟棄標記,以確保效率和準確性,通過共享的學習記憶模塊保持詳細的時空特徵。與傳統解碼器壓縮特徵不同,我們的以記憶為中心的設計在多個尺度上保留了高分辨率信息。我們在 DAVIS'16、KITTI 和 Cityscapes 上進行評估。我們的方法在 GPU 內存和運行時間方面既高效又優於多尺度基線,展示了在安全關鍵機器人中實時、魯棒密集預測的有前途的方向。
英文摘要#
Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.
文章页面#
PDF 获取#
抖音掃碼查看更多精彩內容