ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Xueyun Tian; Wei Li; Bingbing Xu; Heng Dong; Yuanzhuo Wang; Huawei Shen

arXiv:2601.10323·cs.CV·January 16, 2026

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen

PDF

Open Access 1 Models 1 Datasets

TL;DR

ROMA is a real-time omni-multimodal assistant that effectively processes continuous audio, video, and text streams for proactive and reactive interactions, advancing multimodal understanding and responsiveness.

Contribution

ROMA introduces a unified framework for real-time multimodal streaming understanding with synchronized processing, a lightweight decision head, and a comprehensive benchmark suite.

Findings

01

Achieves state-of-the-art in proactive tasks

02

Competitive performance in reactive tasks

03

Validated robustness across 12 benchmarks

Abstract

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
EurekaTian/ROMA
model· 10 dl· ♡ 1
10 dl♡ 1

Datasets

EurekaTian/ROMA_proactive
dataset· 2.4k dl
2.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Speech and dialogue systems