A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
Jhon Rayo, Raul de la Rosa, Mario Garrido

TL;DR
This paper presents a hybrid information retrieval system combining lexical and semantic search techniques, enhanced with LLMs for answer generation, to improve regulatory text comprehension and retrieval accuracy.
Contribution
It introduces a novel hybrid retrieval approach integrating sentence transformers with BM25 and LLMs within a RAG framework for regulatory texts.
Findings
Significant improvement in Recall@10 and MAP@10 over standalone methods
Effective combination of lexical and semantic search for regulatory document retrieval
Open sharing of models and methodology to support future research
Abstract
Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
