MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

Han Zhang; Wanting Jiang; Tomasz Kornuta; Tian Zheng; Vidya Murali

arXiv:2605.21917·cs.CV·May 22, 2026

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

PDF

TL;DR

MAVEN is a multi-stage, agent-driven annotation pipeline that creates high-quality, structured video reasoning data, enabling improved training of vision-language models across diverse video domains.

Contribution

The paper introduces MAVEN, a novel multi-stage pipeline that synthesizes detailed video annotations with agent-driven domain adaptation and iterative quality refinement.

Findings

01

MAVEN successfully labeled over 5,300 traffic videos.

02

Fine-tuning Cosmos-Reason2-8B with MAVEN data outperforms some existing models.

03

Agentic annotation improves model performance on CCTV and accident datasets.

Abstract

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.