Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li

TL;DR
This paper introduces TUNA, a retrieval-augmented method that enhances multimodal large language models with object-aware tags, significantly improving their ability to identify and describe objects in visual instructions without requiring larger models or more data.
Contribution
The paper proposes TUNA, a retrieval-augmented approach that enriches multimodal models with object tags, addressing object identification and description issues efficiently.
Findings
TUNA outperforms baselines on 12 benchmarks.
TUNA demonstrates strong zero-shot capabilities.
Enhancing image-to-text mapping improves object recognition.
Abstract
Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
