ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

Xiaomeng Zhu; Fengming Zhu; Weijie Zhou; Ye Tian; Zhenlin Hu; Yufei Huang; Yuchun Guo; Xinyu Wu; Zhengyou Zhang; Fangzhen Lin; Xuantang Xiong

arXiv:2602.03430·cs.RO·February 5, 2026

ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

Xiaomeng Zhu, Fengming Zhu, Weijie Zhou, Ye Tian, Zhenlin Hu, Yufei Huang, Yuchun Guo, Xinyu Wu, Zhengyou Zhang, Fangzhen Lin, Xuantang Xiong

PDF

Open Access

TL;DR

This paper introduces ProAct-75, a comprehensive benchmark for training and evaluating proactive agents across various domains, and proposes ProAct-Helper, a multimodal framework that leverages task graphs and heuristic search for complex, structure-aware decision-making.

Contribution

The paper presents a new benchmark dataset with detailed annotations and task graphs, and a multimodal baseline model that improves proactive decision-making through structural reasoning and parallel action execution.

Findings

01

ProAct-Helper outperforms strong models in trigger detection (6.21% mF1 improvement)

02

It reduces decision steps by 0.25 on average in online tasks

03

Increases parallel action rate by 15.58%

Abstract

While passive agents merely follow instructions, proactive agents align with higher-level objectives, such as assistance and safety by continuously monitoring the environment to determine when and how to act. However, developing proactive agents is hindered by the lack of specialized resources. To address this, we introduce ProAct-75, a benchmark designed to train and evaluate proactive agents across diverse domains, including assistance, maintenance, and safety monitoring. Spanning 75 tasks, our dataset features 91,581 step-level annotations enriched with explicit task graphs. These graphs encode step dependencies and parallel execution possibilities, providing the structural grounding necessary for complex decision-making. Building on this benchmark, we propose ProAct-Helper, a reference baseline powered by a Multimodal Large Language Model (MLLM) that grounds decision-making in state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)