Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Seokgi Lee

arXiv:2508.09755·cs.CL·August 14, 2025

Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Seokgi Lee

PDF

TL;DR

This paper presents a novel retrieval-augmented generation framework for multihop question answering that decomposes questions with LLMs and uses question-generated document embeddings for improved retrieval accuracy.

Contribution

It introduces a new approach combining LLM-based question decomposition and question-generated document embeddings for enhanced multihop QA performance.

Findings

01

Improved RAG performance on multihop datasets

02

Effective question decomposition reduces ambiguity

03

Question-generated embeddings outperform raw document embeddings

Abstract

We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.