Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents
Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi

TL;DR
This paper presents a privacy-preserving, reference-aligned retrieval-augmented question answering system tailored for proprietary, heterogeneous corporate documents, demonstrating improved factual accuracy and informativeness in the automotive domain.
Contribution
It introduces a novel RAG-QA framework with a structured data pipeline, on-premise architecture, and reference linking, addressing challenges of heterogeneity, confidentiality, and traceability.
Findings
Improved factual correctness (+1.79, +1.94)
Enhanced informativeness (+1.33, +1.16)
Increased helpfulness (+1.08, +1.67)
Abstract
Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Natural Language Processing Techniques
