Attributes as Semantic Units between Natural Language and Visual   Recognition

Marcus Rohrbach

arXiv:1604.03249·cs.CV·April 13, 2016

Attributes as Semantic Units between Natural Language and Visual Recognition

Marcus Rohrbach

PDF

Open Access

TL;DR

This paper explores how attributes serve as semantic units bridging natural language and visual recognition, enabling improved interaction, recognition of new categories, image captioning, grounding language in visuals, and answering questions about images.

Contribution

It introduces the concept of attributes as semantic units that facilitate cross-modal interaction between language and vision, enhancing various recognition and understanding tasks.

Findings

01

Attributes enable recognition of novel visual categories.

02

Attributes improve image and video captioning.

03

Attributes facilitate natural language grounding and question answering.

Abstract

Impressive progress has been made in the fields of computer vision and natural language processing. However, it remains a challenge to find the best point of interaction for these very different modalities. In this chapter we discuss how attributes allow us to exchange information between the two modalities and in this way lead to an interaction on a semantic level. Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how we can ground natural language in visual content, and finally, how we can answer natural language questions about images.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications