Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection
Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

TL;DR
This paper investigates whether large multi-modal models truly understand physical interactions in videos by testing their ability to detect contact and release events, revealing a gap between semantic recognition and physical grounding.
Contribution
Introduces a novel large-scale dataset with annotated contact-release events and evaluates state-of-the-art models' physical reasoning capabilities in videos.
Findings
Models reliably identify objects and actions but fail to locate interaction start/end frames.
Models exhibit shortcut learning, recognizing patterns without understanding physical primitives.
There is a disconnect between semantic success and physical grounding in current LMMs.
Abstract
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release'). We asked SoTA LMMs, including GPT, Gemini and Qwen to locate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
