中文标题#
多尺度视频变换器用于自动驾驶中的类别无关分割
英文标题#
Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
中文摘要#
确保自动驾驶的安全性是一个复杂的挑战,需要处理未知物体和未预见的驾驶场景。 我们开发了多尺度视频变换器,能够仅使用运动线索检测未知物体。 视频语义和全景分割通常依赖于训练期间看到的已知类别,忽略了新类别。 最近的大型语言模型视觉定位在计算上非常昂贵,尤其是在像素级输出时。 我们提出了一种高效的视频变换器,端到端训练用于类无关的分割,而无需光流。 我们的方法使用多阶段多尺度查询 - 记忆解码和特定尺度的随机丢弃标记,以确保效率和准确性,通过共享的学习记忆模块保持详细的时空特征。 与传统解码器压缩特征不同,我们的以记忆为中心的设计在多个尺度上保留了高分辨率信息。 我们在 DAVIS'16、KITTI 和 Cityscapes 上进行评估。 我们的方法在 GPU 内存和运行时间方面既高效又优于多尺度基线,展示了在安全关键机器人中实时、鲁棒密集预测的有前途的方向。
英文摘要#
Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.
文章页面#
PDF 获取#
抖音扫码查看更多精彩内容