RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Yiming Zhang; Siyue Zhang; Junbo Zhao; Chen Zhao

arXiv:2602.17366·cs.CL·February 20, 2026

RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao

PDF

Open Access

TL;DR

RPDR is a data augmentation framework that improves long-tail question answering by selecting easy-to-learn training data to enhance dense retrievers, leading to significant performance gains on challenging benchmarks.

Contribution

Introduces RPDR, a novel data augmentation method using round-trip prediction for better training data selection in long-tail QA retrieval tasks.

Findings

01

RPDR outperforms existing retrievers like BM25 and Contriver on long-tail benchmarks.

02

RPDR significantly improves retrieval accuracy on extremely long-tail categories.

03

Human analysis reveals strengths and limitations of the proposed approach.

Abstract

Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems