TL;DR
HyperEyes introduces a parallel multimodal search agent that fuses visual grounding and retrieval into atomic actions, optimizing for efficiency and accuracy through dual-level reinforcement learning.
Contribution
It presents a novel dual-grained reinforcement learning framework and a new benchmark for evaluating search efficiency alongside accuracy.
Findings
HyperEyes-30B outperforms comparable agents by 9.9% in accuracy.
It achieves 5.3x fewer tool-call rounds on average.
The framework effectively balances search capability and inference efficiency.
Abstract
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
