DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Xiaoyu Lin; Aniket Ghorpade; Hansheng Zhu; Justin Qiu; Dea Rrozhani; Monica Lama; Mick Yang; Zixuan Bian; Ruohan Ren; Alan B. Hong; Jiatao Gu; Chris Callison-Burch

arXiv:2511.12452·cs.CV·November 18, 2025

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani, Monica Lama, Mick Yang, Zixuan Bian, Ruohan Ren, Alan B. Hong, Jiatao Gu, Chris Callison-Burch

PDF

Open Access

TL;DR

DenseAnnotate is an innovative audio-driven platform that enables scalable, dense, and multilingual annotations for images and 3D scenes, significantly enhancing training data quality for multimodal models.

Contribution

The paper introduces DenseAnnotate, a novel speech-based annotation system that efficiently creates detailed, multilingual annotations for images and 3D assets, addressing limitations of traditional text-based methods.

Findings

01

Created a large, multi-modal dataset with dense annotations in 20 languages.

02

Models trained on this dataset show significant improvements in cultural and spatial understanding.

03

Demonstrated the platform's effectiveness across diverse domains and data types.

Abstract

With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques