Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Sohyun An (1; 2); Shuibenyang Yuan (1); Hayeon Lee (1); Cho-Jui Hsieh (2); Alexander Min (1) ((1) Meta Superintelligence Labs; (2) UCLA)

arXiv:2604.12967·cs.AI·April 15, 2026

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Sohyun An (1, 2), Shuibenyang Yuan (1), Hayeon Lee (1), Cho-Jui Hsieh (2), Alexander Min (1) ((1) Meta Superintelligence Labs, (2) UCLA)

PDF

TL;DR

This paper introduces Cycle-Consistent Search (CCS), a novel gold-supervision-free framework for training search agents by using cycle-consistency to encode question intent, with information bottlenecks ensuring meaningful reward signals.

Contribution

The paper proposes a new cycle-consistency based training method for search agents that does not require gold supervision, addressing scalability issues in information retrieval tasks.

Findings

01

CCS achieves performance comparable to supervised methods on QA benchmarks.

02

Applying information bottlenecks reduces superficial lexical reliance in question reconstruction.

03

CCS outperforms prior unsupervised methods in search agent training.

Abstract

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.