EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding
Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

TL;DR
EARL is a reinforcement learning framework that improves egocentric interaction reasoning and pixel grounding by explicitly transferring coarse semantics to detailed responses, outperforming previous methods.
Contribution
The paper introduces EARL, a novel two-stage egocentric analysis-guided reinforcement learning framework with a new feature synthesizer and reward design for better interaction reasoning and grounding.
Findings
EARL achieves 65.48% cIoU on Ego-IRGBench, surpassing previous RL methods.
Outperforms prior methods by 8.37% in pixel grounding accuracy.
Demonstrates strong transferability to unseen egocentric grounding scenarios.
Abstract
Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
