EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung; Junbin Xiao; Junghyun Kim; Byoung-Tak Zhang; Angela Yao

arXiv:2510.26113·cs.CV·October 31, 2025

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao

PDF

3 Reviews

TL;DR

This paper introduces EgoExo-Con, a benchmark for evaluating view-invariant temporal understanding in videos, and proposes View-GRPO, a reinforcement learning method to improve consistency across different viewpoints.

Contribution

The paper presents EgoExo-Con benchmark and View-GRPO framework, advancing the evaluation and training of models for view-invariant temporal understanding in videos.

Findings

01

Existing Video-LLMs struggle with view-invariant consistency.

02

Naive finetuning improves consistency but reduces performance.

03

View-GRPO outperforms naive methods in cross-view consistency.

Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

**[S1] Writing and motivation** - The paper is well-written and easy to follow. The motivation is clearly presented. **[S2] Experiments** - The paper covers extensive range of experiments, including evaluation both open- and closed-source models, reporting detailed per-subset results (CharadesEgo, LEMMA, EgoExo-4D), and fine-tuning analyses.

Weaknesses

**[W1] Dataset and task clarification** - EgoExo-Con combines pre-existing datasets, so its domain diversity still depends on those sources. The paper claims “comprehensive” coverage, but 491 pairs is small compared to modern multimodal benchmarks. - Evaluating temporal consistency across viewpoints is critical in view-invariant video understanding. Traditionally, cross-view temporal consistency has been evaluated through cross-view phase progression or Kendall's $\tau$. Additionally, cross-view

Reviewer 02Rating 2Confidence 5

Strengths

1. The paper is clearly written. 2. The proposed EgoExo-Con eval set can be useful for this area of research. 3. The proposed reinforced approach enhanced view-invariant comprehension in video-LLMs.

Weaknesses

1. The proposed evaluation set EgoExo-Con only contains 491 items, which is pretty small and is hard to say weather it is simply finding a hard set for the current Video LLMs. 2. The close-sourced and human performance are reported on a randomly sampled subset, which may be sensitive to sample selection and cannot be fairly compared with open-sourced models and proposed model. 3. In Figure 1 (b), I don't think it is appropriate to expect model understanding "put a knife" from the provided exo vi

Reviewer 03Rating 6Confidence 5

Strengths

1. **Useful problem and benchmark.** The proposed problem on cross-view consistency is interesting and has several practical applications like learning from human demonstrations. The ExoEgo-Con benchmark, although not as big in size, is thoroughly constructed and is useful in measuring this consistency. 2. **Strong baseline.** The proposed View-GRPO is simple, well-formulated and shows strong performance on the proposed benchmark. 3. **Presentation.** The paper is very well written and was eas

Weaknesses

1. Instead of enforcing cross-view consistency only through the final answer, the View-GRPO method relies on matching reasoning traces for cross-view consistency. This means that the consistency is enforced via the language space and not directly via visual correspondence. In some narrow cases this can be a limitation for examples where it is hard to describe certain visual elements in language. 2. There is still a lot of room for improvement in terms of consistency scores (Tab 3). While this h

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.