VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models

Ming Cheng; Tong Wu; Jiazhen Hu; Jiaying Gong; Hoda Eldardiry

arXiv:2508.11801·cs.CV·August 19, 2025

VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models

Ming Cheng, Tong Wu, Jiazhen Hu, Jiaying Gong, Hoda Eldardiry

PDF

Open Access

TL;DR

VideoAVE introduces a comprehensive video-to-text attribute value extraction dataset for e-commerce, along with benchmark models, highlighting the challenges and potential for future advancements in video-based attribute extraction.

Contribution

We present the first large-scale, multi-domain video AVE dataset with a filtering system and establish benchmark evaluations for state-of-the-art models.

Findings

01

VideoAVE covers 14 domains and 172 attributes.

02

Video AVE remains a challenging task, especially in open settings.

03

Current models have room for improvement in leveraging temporal information.

Abstract

Attribute Value Extraction (AVE) is important for structuring product information in e-commerce. However, existing AVE datasets are primarily limited to text-to-text or image-to-text settings, lacking support for product videos, diverse attribute coverage, and public availability. To address these gaps, we introduce VideoAVE, the first publicly available video-to-text e-commerce AVE dataset across 14 different domains and covering 172 unique attributes. To ensure data quality, we propose a post-hoc CLIP-based Mixture of Experts filtering system (CLIP-MoE) to remove the mismatched video-product pairs, resulting in a refined dataset of 224k training data and 25k evaluation data. In order to evaluate the usability of the dataset, we further establish a comprehensive benchmark by evaluating several state-of-the-art video vision language models (VLMs) under both attribute-conditioned value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling