VIVECaption: A Split Approach to Caption Quality Improvement
Varun Ananth, Baqiao Liu, Haoran Cai

TL;DR
VIVECaption introduces a systematic two-sided approach to enhance caption quality for text-to-image and text-to-video models, focusing on dataset creation and model alignment to improve caption-image alignment and downstream performance.
Contribution
The paper presents a novel methodology combining dataset stratification and model finetuning to systematically improve caption quality in generative models.
Findings
Finetuning with a character detection model improves caption-image alignment.
A comprehensive taxonomy of caption evaluation metrics clarifies their use-cases.
Structured caption formats enhance downstream parsing and utilization.
Abstract
Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media
