Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Nayoung Choi; Grace Byun; Andrew Chung; Ellie S. Paek; Shinsun Lee; Jinho D. Choi

arXiv:2502.19596·cs.AI·August 28, 2025

Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi

PDF

Open Access

TL;DR

This paper presents a privacy-preserving, reference-aligned retrieval-augmented question answering system tailored for proprietary, heterogeneous corporate documents, demonstrating improved factual accuracy and informativeness in the automotive domain.

Contribution

It introduces a novel RAG-QA framework with a structured data pipeline, on-premise architecture, and reference linking, addressing challenges of heterogeneity, confidentiality, and traceability.

Findings

01

Improved factual correctness (+1.79, +1.94)

02

Enhanced informativeness (+1.33, +1.16)

03

Increased helpfulness (+1.08, +1.67)

Abstract

Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Natural Language Processing Techniques