EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Yuejiao Su; Xinshen Zhang; Zhen Ye; Lei Yao; Lap-Pui Chau; Yi Wang

arXiv:2605.14742·cs.CV·May 15, 2026

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

PDF

TL;DR

EARL is a reinforcement learning framework that improves egocentric interaction reasoning and pixel grounding by explicitly transferring coarse semantics to detailed responses, outperforming previous methods.

Contribution

The paper introduces EARL, a novel two-stage egocentric analysis-guided reinforcement learning framework with a new feature synthesizer and reward design for better interaction reasoning and grounding.

Findings

01

EARL achieves 65.48% cIoU on Ego-IRGBench, surpassing previous RL methods.

02

Outperforms prior methods by 8.37% in pixel grounding accuracy.

03

Demonstrates strong transferability to unseen egocentric grounding scenarios.

Abstract

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.