Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge; Xiaohui Zeng; Jacob Samuel Huffman; Tsung-Yi Lin; Ming-Yu; Liu; Yin Cui

arXiv:2404.19752·cs.CV·May 1, 2024

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu, Liu, Yin Cui

PDF

Open Access

TL;DR

This paper introduces VisualFactChecker, a training-free pipeline that generates detailed, high-fidelity captions for images and 3D objects by combining multiple models and fact-checking, outperforming existing methods.

Contribution

The paper presents a novel, flexible, training-free captioning pipeline that integrates multiple models and fact-checking to improve caption detail and fidelity.

Findings

01

VFC outperforms state-of-the-art captioning methods on COCO and Objaverse datasets.

02

VFC achieves comparable captioning quality to GPT-4V with significantly smaller models.

03

Comprehensive evaluations demonstrate VFC's effectiveness across multiple metrics.

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques