Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Na Zhao, Zechao Li, Jingdong Wang

TL;DR
This paper introduces EDC, a method that leverages off-the-shelf visual specialists to incorporate detailed visual attributes into image captions, significantly enhancing their descriptive quality and visual understanding capabilities.
Contribution
The paper proposes a novel approach called EDC that integrates rich visual attributes from trained specialists into captions, improving multimodal perception beyond existing captioning methods.
Findings
Enhanced caption quality with detailed attributes
Improved performance on visual understanding tasks
Better reasoning capabilities in multimodal models
Abstract
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named EDC, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications
