# Open Data Synthesis For Deep Research

**Authors:** Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu

arXiv: 2509.00375 · 2025-09-03

## TL;DR

This paper introduces InfoSeek, a scalable framework for synthesizing complex Deep Research tasks as Hierarchical Constraint Satisfaction Problems, enabling training of models that outperform larger models on challenging benchmarks.

## Contribution

It presents a novel synthetic data generation method for Deep Research tasks using a dual-agent system to create hierarchical, multi-step reasoning questions from web data.

## Key findings

- Models trained on InfoSeek outperform strong baselines.
- 3B LLMs with InfoSeek surpass larger models and commercial APIs.
- Supports advanced optimization strategies with meta-information.

## Abstract

Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00375/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00375/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/2509.00375/full.md

---
Source: https://tomesphere.com/paper/2509.00375