ChronusOmni: Improving Time Awareness of Omni Large Language Models
Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

TL;DR
ChronusOmni is a novel omni large language model that enhances temporal awareness across audio and visual modalities, improving explicit and implicit temporal grounding in videos through unified modeling and reinforcement learning.
Contribution
The paper introduces ChronusOmni, a model with interleaved timestamp tokens and reinforcement learning, plus a new dataset, advancing cross-modal temporal reasoning in large language models.
Findings
Achieves over 30% improvement on ChronusAV dataset.
Sets new state-of-the-art on multiple temporal grounding benchmarks.
Demonstrates strong cross-modal temporal reasoning capabilities.
Abstract
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
