FlexCap: Describe Anything in Images in Controllable Detail
Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman,, Yusuf Aytar

TL;DR
FlexCap is a controllable vision-language model that generates region-specific descriptions of images with adjustable detail levels, improving dense captioning and zero-shot visual question answering performance.
Contribution
We introduce FlexCap, a novel model capable of producing length-conditioned region descriptions, enabling controllable detail and enhancing various vision-language tasks.
Findings
Achieves strong dense captioning performance on Visual Genome.
Enables zero-shot VQA with state-of-the-art results.
Supports diverse applications like image labeling and visual dialog.
Abstract
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image…
Peer Reviews
Decision·NeurIPS 2024 poster
1. The proposed dataset can promote community research on visual-controllable captioning task, which is useful for the development of user-friendly vision-languague models. 2. FlexCap is easy to follow, and its controllable captioning capability, with positional information and varying information density, is beneficial for downdream tasks like VQA. 3. The experiments are comprehensive, demonstrating the capabilities of region control and length control (Sec 4.1, Sec. 4.3). The VQA results gen
1. The architecture of FlexCap and its training setup lack novelty, as it is a typical transformer-based captioning model. However, this does not lead me to reject this paper, as the contributions on task and dataset is useful. 2. The authors should carefully consider their statements in the paper. While this paper achieves region and length control, there are many other controllable signals such as mask/point control in visuals and emotion/style control in text (as seen in Caption Anything [45
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
