MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

TL;DR
MAVEN is a multi-stage, agent-driven annotation pipeline that creates high-quality, structured video reasoning data, enabling improved training of vision-language models across diverse video domains.
Contribution
The paper introduces MAVEN, a novel multi-stage pipeline that synthesizes detailed video annotations with agent-driven domain adaptation and iterative quality refinement.
Findings
MAVEN successfully labeled over 5,300 traffic videos.
Fine-tuning Cosmos-Reason2-8B with MAVEN data outperforms some existing models.
Agentic annotation improves model performance on CCTV and accident datasets.
Abstract
Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
