Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee; Gawon Seo; Kyounggyu Lee; Dogun Kim; Kyungwoo Song; Jiyoung Jung

arXiv:2510.27164·cs.CV·November 3, 2025

Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung

PDF

Open Access

TL;DR

This paper introduces a multi-stage pipeline combining vision-language models, large language models, and object detection to generate more accurate, detailed, and hallucination-free captions for high-resolution images.

Contribution

It presents a novel multi-stage process that refines captions by integrating object detection and language models, improving caption detail and reducing hallucinations for high-res images.

Findings

01

Enhanced captions with more detailed object descriptions

02

Reduced hallucinations in generated captions

03

Improved evaluation scores on high-resolution image datasets

Abstract

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning