Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

Yanpeng Sun; Jing Hao; Ke Zhu; Jiang-Jiang Liu; Yuxiang Zhao; Xiaofan Li; Na Zhao; Zechao Li; Jingdong Wang

arXiv:2412.14233·cs.CV·January 28, 2026

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Na Zhao, Zechao Li, Jingdong Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces EDC, a method that leverages off-the-shelf visual specialists to incorporate detailed visual attributes into image captions, significantly enhancing their descriptive quality and visual understanding capabilities.

Contribution

The paper proposes a novel approach called EDC that integrates rich visual attributes from trained specialists into captions, improving multimodal perception beyond existing captioning methods.

Findings

01

Enhanced caption quality with detailed attributes

02

Improved performance on visual understanding tasks

03

Better reasoning capabilities in multimodal models

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named EDC, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syp2ysy/dce
noneOfficial

Datasets

syp115/EDC-1M
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications