Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo,, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

TL;DR
This paper introduces TransFusion, a multimodal transformer architecture that uses language-based summaries of past actions to improve object interaction prediction in egocentric videos, outperforming previous methods.
Contribution
The paper presents a novel multimodal transformer model that leverages pre-trained language models to incorporate action context for better interaction anticipation.
Findings
TransFusion outperforms state-of-the-art methods by 40.4% in mAP on Ego4D.
Using language-based context summaries improves prediction accuracy.
The approach demonstrates the benefit of integrating language and vision for egocentric video understanding.
Abstract
We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsTest
