VUDG: A Dataset for Video Understanding Domain Generalization

Ziyi Wang; Zhi Gao; Boxuan Yu; Zirui Dai; Yuxiang Song; Qingyuan Lu; Jin Chen; Xinxiao Wu

arXiv:2505.24346·cs.CV·June 2, 2025

VUDG: A Dataset for Video Understanding Domain Generalization

Ziyi Wang, Zhi Gao, Boxuan Yu, Zirui Dai, Yuxiang Song, Qingyuan Lu, Jin Chen, Xinxiao Wu

PDF

Open Access 3 Reviews

TL;DR

VUDG introduces a comprehensive dataset to evaluate the robustness of video understanding models across diverse domain shifts, revealing current models' vulnerabilities and guiding future research in domain generalization.

Contribution

The paper presents VUDG, a new dataset with videos from 11 domains for assessing domain generalization in video understanding, and analyzes model performance under domain shifts.

Findings

01

Most models degrade in performance under domain shifts.

02

State-of-the-art LVLMs show significant robustness gaps.

03

VUDG serves as a benchmark for future domain generalization research.

Abstract

Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

Innovative Benchmark Design The paper introduces a novel benchmark specifically crafted to test how well video understanding models generalize across domain shifts, filling an important gap in current evaluation practices. Thorough and Systematic Evaluation It provides a comprehensive assessment of multiple domain shift scenarios (style, viewpoint, weather, lighting) using both traditional VideoQA models and modern LVLMs, offering a well-rounded perspective on robustness.

Weaknesses

Insufficient Distractor Analysis While the paper reports aggregate counts and durations (Fig. 4–5), it provides no fine-grained QA/distractor analyses - e.g., answer-length distributions, length-matching of options, distractor similarity, so on. For example, the prompt states to have distractors be the same length but from my experience LLMs might just ignore it completely and there is no statistics to verify if it is the case. Further investigation into questions and answer options would stren

Reviewer 02Rating 4Confidence 4

Strengths

The proposed benchmark is the first large-scale benchmark designed to measure domain generalization in VLMs, consisting of a dedicated training/testing set and clear protocols for evaluation The choice of domains and corresponding videos is high quality: The domains selected in VUDG are broad and diverse and a large portion of the videos used for testing are newly collected by the authors, reducing the risk of data leakage with existing VLM training data

Weaknesses

The reviewer understands the primary motivation behind VUDG is to unify the semantic space across domains, and this intuitively makes sense to the reviewer. However, it is not scientifically shown why existing benchmarks fail to assess domain generalization due to the lack of cross-domain semantic alignment * What evidence exists to motivate the community to evaluate the domain generalization ability of their models with VUDG, instead of a benchmark like Video-MME that also contains multiple vid

Reviewer 03Rating 6Confidence 3

Strengths

1. I believe the paper offers a useful dataset (once it is published online), which should contribute to the community. 2. The paper also offers comprehensive details of how the dataset is created.

Weaknesses

1. This may be a minor concern, but since the dataset only offers limited semantics, the evaluation capability may also be limited. It would be nice if the paper could provide some discussion on this point (e.g., whether the set of actions is sufficient for evaluating general performance, and why). 2. As domain generalization can involve fine-tuning, it is desirable to evaluate possible bias in QA pairs and semantics. This may be done by training a language model with only QA pairs and some tok

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging