Goal-driven text descriptions for images

Ruotian Luo

arXiv:2108.12575·cs.CV·August 31, 2021·1 cites

Goal-driven text descriptions for images

Ruotian Luo

PDF

Open Access

TL;DR

This paper explores methods for generating goal-driven textual descriptions of images, including referring expressions, discriminative captions, diverse outputs, length control, and informative tags, advancing AI's ability to communicate visually.

Contribution

It introduces several novel techniques for improving image captioning and description generation, focusing on discriminability, diversity, controllability, and informativeness.

Findings

01

Discriminative referring expressions improve object identification.

02

Enhanced caption discriminability leads to more descriptive outputs.

03

Training strategies impact caption diversity and quality.

Abstract

A big part of achieving Artificial General Intelligence(AGI) is to build a machine that can see and listen like humans. Much work has focused on designing models for image classification, video classification, object detection, pose estimation, speech recognition, etc., and has achieved significant progress in recent years thanks to deep learning. However, understanding the world is not enough. An AI agent also needs to know how to talk, especially how to communicate with a human. While perception (vision, for example) is more common across animal species, the use of complicated language is unique to humans and is one of the most important aspects of intelligence. In this thesis, we focus on generating textual output given visual input. In Chapter 3, we focus on generating the referring expression, a text description for an object in the image so that a receiver can infer which object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques