TL;DR
This paper introduces POINTS-Seeker, a new multimodal agentic search model built from scratch, featuring innovative training phases and adaptive history compression to improve visual reasoning over long interactions.
Contribution
It presents Agentic Seeding for foundational training, V-Fold for history-aware compression, and the POINTS-Seeker-8B model that outperforms existing models on multiple benchmarks.
Findings
POINTS-Seeker-8B outperforms existing models across six benchmarks.
V-Fold effectively manages long-horizon interaction challenges.
Agentic Seeding enhances the model's ability to elicit agentic behaviors.
Abstract
While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
