Reminding Multimodal Large Language Models of Object-aware Knowledge   with Retrieved Tags

Daiqing Qi; Handong Zhao; Zijun Wei; Sheng Li

arXiv:2406.10839·cs.CV·November 13, 2024

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li

PDF

Open Access

TL;DR

This paper introduces TUNA, a retrieval-augmented method that enhances multimodal large language models with object-aware tags, significantly improving their ability to identify and describe objects in visual instructions without requiring larger models or more data.

Contribution

The paper proposes TUNA, a retrieval-augmented approach that enriches multimodal models with object tags, addressing object identification and description issues efficiently.

Findings

01

TUNA outperforms baselines on 12 benchmarks.

02

TUNA demonstrates strong zero-shot capabilities.

03

Enhancing image-to-text mapping improves object recognition.

Abstract

Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling