Learning Multiple Object States from Actions via Large Language Models

Masatoshi Tateno; Takuma Yagi; Ryosuke Furuta; Yoichi Sato

arXiv:2405.01090·cs.CV·November 8, 2024

Learning Multiple Object States from Actions via Large Language Models

Masatoshi Tateno, Takuma Yagi, Ryosuke Furuta, Yoichi Sato

PDF

Open Access

TL;DR

This paper introduces a multi-label classification approach for recognizing multiple concurrent object states in videos, leveraging large language models to generate pseudo-labels from narrations and past states, and presents a new dataset for evaluation.

Contribution

It proposes a novel method using LLMs to generate pseudo-labels for multi-object states from narrations, considering past states, and introduces the MOST dataset for this task.

Findings

01

Model trained on LLM-generated pseudo-labels outperforms vision-language baselines.

02

Incorporating past state context improves object state recognition accuracy.

03

The MOST dataset provides comprehensive multi-label annotations for evaluation.

Abstract

Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simultaneously (an egg can be both raw and whisked). However, most existing research assumes a single object state change (e.g., uncracked -> cracked), overlooking the coexisting nature of multiple object states and the influence of past states on the current state. We formulate object state recognition as a multi-label classification task that explicitly handles multiple states. We then propose to learn multiple object states from narrated videos by leveraging large language models (LLMs) to generate pseudo-labels from the transcribed narrations, capturing the influence of past states. The challenge is that narrations mostly describe human actions in the video but rarely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling