Summarize the Past to Predict the Future: Natural Language Descriptions   of Context Boost Multimodal Object Interaction Anticipation

Razvan-George Pasca; Alexey Gavryushin; Muhammad Hamza; Yen-Ling Kuo,; Kaichun Mo; Luc Van Gool; Otmar Hilliges; Xi Wang

arXiv:2301.09209·cs.CV·March 12, 2024·1 cites

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo,, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

PDF

Open Access

TL;DR

This paper introduces TransFusion, a multimodal transformer architecture that uses language-based summaries of past actions to improve object interaction prediction in egocentric videos, outperforming previous methods.

Contribution

The paper presents a novel multimodal transformer model that leverages pre-trained language models to incorporate action context for better interaction anticipation.

Findings

01

TransFusion outperforms state-of-the-art methods by 40.4% in mAP on Ego4D.

02

Using language-based context summaries improves prediction accuracy.

03

The approach demonstrates the benefit of integrating language and vision for egocentric video understanding.

Abstract

We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsTest