Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Keunwoo Peter Yu, Joyce Chai

TL;DR
This paper introduces a new benchmark, TGLG, for evaluating real-time vision-language models on their ability to generate temporally aligned language responses to streaming visual input, supported by datasets, a new metric, and a novel model.
Contribution
The paper proposes TGLG, a benchmark for real-time, temporally-grounded language generation, along with datasets, a new evaluation metric TRACE, and a time-synchronized model VLM-TSI.
Findings
VLM-TSI outperforms baseline models in TGLG tasks.
Performance on TGLG remains modest, indicating the challenge of real-time alignment.
The benchmark and model facilitate future research in real-time vision-language systems.
Abstract
Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- and -- and propose a new benchmark task, , to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, , to evaluate TGLG by jointly…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Research on an interesting direction even within the domain of spatio-temporal vision-language learning: time-synchronized processing in real-time scenarios. - Substantial improvements over the baseline.
- No human validation study on how the proposed metric aligns with human preferences. - I'm confused about this work's purpose. It seems that the data resources already exist, and VLM-TSI method actually proposes finetuning a suitable model on these resources. According to my current understanding, the task is actually not novel, the methodology is actually to finetune a proper model on time-synchronized interleaved video-language data. The only actual novelty is the proposed metric, which is no
1.The motivation of incorporating the capabilities of perceptual updating and contingency awareness into real-time video-llms is practical and intuitively reasonable. 2.The proposed benchmark, metric and method in this work are shown to be effective under real-time settings.
1.As the authors mentioned in the manuscript, the existing turn-based video-llms would give response to the environment with overly high latency, which is a major obstacle for them to handle the real-time settings. However, these video-llms are generally at a large size and have huge amout of parameters which naturally make them unsuitable for real-time response. What if the turn-based video-llms are optimized to have fewer parameters and faster response speed, for example, turn-based models cou
- The paper targets an important and timely problem of evaluating real-time multimodal reasoning in VLMs. As such, it contributes to an emerging and rapidly evolving research direction. - The use of two complementary datasets with distinct characteristics (third-person sports vs. first-person human interaction) is a thoughtful design choice that enhances the benchmark’s generality. The inclusion of cross-dataset evaluation (HoloAssist is used just for testing) further strengthens its robustness.
- As a benchmark paper, the experimental evaluation is rather limited. The authors only assess two versions of VideoLLM-Online, omitting several strong recent baselines such as Stream-VLM (Panchal et al., 2024), FlashVStream (Zhang et al., 2024), Dispider (Qian et al., 2025), StreamChat (Xiong et al., 2025), and StreamChat (Liu et al., 2025). Including or at least discussing results from these models would significantly strengthen the empirical validation. - The related work section does not suf
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
MethodsALIGN
