Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques
Rodrigo Kataishi

TL;DR
This paper introduces topic-enriched embeddings that combine term-based signals and topic structure with contextual sentence embeddings to improve document retrieval in RAG systems, especially in complex, overlapping-topic corpora.
Contribution
It proposes a hybrid embedding method integrating TF-IDF, topic modeling, and contextual encodings to enhance retrieval accuracy and efficiency in knowledge-intensive tasks.
Findings
Improved semantic clustering and retrieval precision on legal texts.
Reduced computational burden compared to purely contextual methods.
Consistent gains in retrieval metrics across experiments.
Abstract
Retrieval-augmented generation (RAG) systems rely on accurate document retrieval to ground large language models (LLMs) in external knowledge, yet retrieval quality often degrades in corpora where topics overlap and thematic variation is high. This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings. The approach combines TF-IDF with topic modeling and dimensionality reduction, using Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to encode latent topical organization, and fuses these representations with a compact contextual encoder (all-MiniLM). By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden relative to purely contextual baselines. Experiments on a legal-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Multimodal Machine Learning Applications
