DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu; Jen-Jee Chen; Yu-Chee Tseng; Feng-Chi Chen; An-Zi Yen

arXiv:2506.11558·cs.CV·January 29, 2026

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen, An-Zi Yen

PDF

Open Access

TL;DR

DaMO is a data-efficient multimodal video language model that excels in fine-grained temporal reasoning and multimodal understanding through a hierarchical architecture and structured training, outperforming prior methods.

Contribution

Introduces DaMO, a novel data-efficient Video LLM with a hierarchical dual-stream architecture and a four-stage training process for improved temporal reasoning.

Findings

01

Outperforms prior methods on temporal grounding tasks

02

Achieves superior results on video QA benchmarks

03

Demonstrates effective multimodal and temporal understanding

Abstract

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling