Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu, Liu, Yin Cui

TL;DR
This paper introduces VisualFactChecker, a training-free pipeline that generates detailed, high-fidelity captions for images and 3D objects by combining multiple models and fact-checking, outperforming existing methods.
Contribution
The paper presents a novel, flexible, training-free captioning pipeline that integrates multiple models and fact-checking to improve caption detail and fidelity.
Findings
VFC outperforms state-of-the-art captioning methods on COCO and Objaverse datasets.
VFC achieves comparable captioning quality to GPT-4V with significantly smaller models.
Comprehensive evaluations demonstrate VFC's effectiveness across multiple metrics.
Abstract
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques
