MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction

Joerg Deigmoeller; Nakul Agarwal; Stephan Hasler; Daniel Tanneberg; Anna Belardinelli; Reza Ghoddoosian; Chao Wang; Felix Ocker; Fan Zhang; Behzad Dariush; Michael Gienger

arXiv:2603.18988·cs.RO·March 20, 2026

MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction

Joerg Deigmoeller, Nakul Agarwal, Stephan Hasler, Daniel Tanneberg, Anna Belardinelli, Reza Ghoddoosian, Chao Wang, Felix Ocker, Fan Zhang, Behzad Dariush, Michael Gienger

PDF

Open Access

TL;DR

MERGE is a novel system that combines vision-language models with a perception pipeline to improve situational grounding and reasoning in multi-actor human-robot interactions, enhancing efficiency and accuracy.

Contribution

The paper introduces MERGE, a guided vision-language model system with a perception pipeline for efficient, consistent multi-actor event grounding in dynamic human-robot interactions, and provides a new benchmark dataset.

Findings

01

Grounding score improved by a factor of 2 over baselines

02

Run-time reduced by a factor of 4

03

Effective multi-actor event reasoning demonstrated

Abstract

We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic human-robot group interactions. Effective collaboration in such settings requires consistent situational awareness, built on persistent representations of people and objects and an episodic abstraction of events. MERGE achieves this by uniquely identifying physical instances of actors (humans or robots) and objects and structuring them into actor-action-object relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency, avoiding both the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition