Chrono: A Simple Blueprint for Representing Time in MLLMs

Hector Rodriguez; Boris Meinardus; Anil Batra; Anna Rohrbach; Marcus Rohrbach

arXiv:2406.18113·cs.CV·January 1, 2026·3 cites

Chrono: A Simple Blueprint for Representing Time in MLLMs

Hector Rodriguez, Boris Meinardus, Anil Batra, Anna Rohrbach, Marcus Rohrbach

PDF

Open Access 1 Repo

TL;DR

Chrono is a simple yet effective universal blueprint for representing time in multimodal large language models, significantly improving temporal localization and grounded video question answering across various benchmarks.

Contribution

Introduces Chrono, a universal sequence blueprint that enhances temporal understanding in image-text pretrained MLLMs without complex architectures.

Findings

01

Achieves state-of-the-art results in moment retrieval on multiple benchmarks.

02

Improves grounded video question answering performance.

03

Works across different MLLM architectures and sizes.

Abstract

The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain, developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. We find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to any image-text pretrained MLLM. In extensive experiments spanning different MLLM architectures and sizes, finetuning and zero-shot settings, we demonstrate new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sudo-Boris/mr-Blip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsBLIP: Bootstrapping Language-Image Pre-training