NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Ruisheng Cao; Hanchong Zhang; Tiancheng Huang; Zhangyi Kang; Yuxin Zhang; Liangtai Sun; Hanqi Li; Yuxun Miao; Shuai Fan; Lu Chen; Kai Yu

arXiv:2505.19754·cs.CL·June 3, 2025

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

NeuSym-RAG introduces a hybrid neural-symbolic retrieval framework for PDF question answering, leveraging multi-view chunking and schema parsing to improve information retrieval from complex, structured academic papers.

Contribution

It presents a novel interactive retrieval method combining neural and symbolic approaches with multi-view PDF structuring, enhancing QA performance on academic documents.

Findings

01

Outperforms vector-based RAG and structured baselines on PDF QA datasets.

02

Effectively utilizes multiple views and structured data for improved retrieval.

03

Demonstrates stable and superior performance across diverse datasets.

Abstract

The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

x-lance/neusym-rag
noneOfficial

Datasets

OpenDFM/AirQA-Real
dataset· 203 dl
203 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Layer Normalization · Byte Pair Encoding