Goal-driven text descriptions for images
Ruotian Luo

TL;DR
This paper explores methods for generating goal-driven textual descriptions of images, including referring expressions, discriminative captions, diverse outputs, length control, and informative tags, advancing AI's ability to communicate visually.
Contribution
It introduces several novel techniques for improving image captioning and description generation, focusing on discriminability, diversity, controllability, and informativeness.
Findings
Discriminative referring expressions improve object identification.
Enhanced caption discriminability leads to more descriptive outputs.
Training strategies impact caption diversity and quality.
Abstract
A big part of achieving Artificial General Intelligence(AGI) is to build a machine that can see and listen like humans. Much work has focused on designing models for image classification, video classification, object detection, pose estimation, speech recognition, etc., and has achieved significant progress in recent years thanks to deep learning. However, understanding the world is not enough. An AI agent also needs to know how to talk, especially how to communicate with a human. While perception (vision, for example) is more common across animal species, the use of complicated language is unique to humans and is one of the most important aspects of intelligence. In this thesis, we focus on generating textual output given visual input. In Chapter 3, we focus on generating the referring expression, a text description for an object in the image so that a receiver can infer which object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
