TL;DR
This paper introduces LMM-Searcher, a multimodal deep search framework that manages long-horizon search tasks efficiently by externalizing visual data, enabling scalable, high-performance multimodal reasoning over extended interactions.
Contribution
It proposes a novel file-based visual representation and a tailored fetch-image tool, along with a data synthesis pipeline, to enhance long-horizon multimodal search capabilities.
Findings
Achieves state-of-the-art results on long-horizon benchmarks like MM-BrowseComp.
Successfully scales to 100-turn search horizons.
Demonstrates strong generalizability across different base models.
Abstract
Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
