VIVECaption: A Split Approach to Caption Quality Improvement

Varun Ananth; Baqiao Liu; Haoran Cai

arXiv:2603.07401·cs.CV·March 10, 2026

VIVECaption: A Split Approach to Caption Quality Improvement

Varun Ananth, Baqiao Liu, Haoran Cai

PDF

Open Access

TL;DR

VIVECaption introduces a systematic two-sided approach to enhance caption quality for text-to-image and text-to-video models, focusing on dataset creation and model alignment to improve caption-image alignment and downstream performance.

Contribution

The paper presents a novel methodology combining dataset stratification and model finetuning to systematically improve caption quality in generative models.

Findings

01

Finetuning with a character detection model improves caption-image alignment.

02

A comprehensive taxonomy of caption evaluation metrics clarifies their use-cases.

03

Structured caption formats enhance downstream parsing and utilization.

Abstract

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media