Attributes as Semantic Units between Natural Language and Visual Recognition
Marcus Rohrbach

TL;DR
This paper explores how attributes serve as semantic units bridging natural language and visual recognition, enabling improved interaction, recognition of new categories, image captioning, grounding language in visuals, and answering questions about images.
Contribution
It introduces the concept of attributes as semantic units that facilitate cross-modal interaction between language and vision, enhancing various recognition and understanding tasks.
Findings
Attributes enable recognition of novel visual categories.
Attributes improve image and video captioning.
Attributes facilitate natural language grounding and question answering.
Abstract
Impressive progress has been made in the fields of computer vision and natural language processing. However, it remains a challenge to find the best point of interaction for these very different modalities. In this chapter we discuss how attributes allow us to exchange information between the two modalities and in this way lead to an interaction on a semantic level. Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how we can ground natural language in visual content, and finally, how we can answer natural language questions about images.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
