FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

Denys Katerenchuk; Pablo Duboue; Keelan Evanini; David Gondek; Nithin Govindugari; Olivier Allauzen; Joshua Baptiste; David J More; Joshua Schechter

arXiv:2605.05482·cs.AI·May 8, 2026

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

Denys Katerenchuk, Pablo Duboue, Keelan Evanini, David Gondek, Nithin Govindugari, Olivier Allauzen, Joshua Baptiste, David J More, Joshua Schechter

PDF

TL;DR

This paper introduces FinRAG-12B, a grounded, domain-specific language model for banking that achieves high accuracy, reliable citations, and safe refusal, optimized for real-world deployment with significant performance and cost benefits.

Contribution

The paper presents a novel data-efficient training pipeline, a calibrated refusal mechanism, and an end-to-end deployment methodology for grounded banking LLMs, outperforming GPT-4.1 in key metrics.

Findings

01

FinRAG-12B outperforms GPT-4.1 in citation grounding accuracy.

02

The calibrated refusal mechanism reduces unsafe responses to 12%.

03

Deployment at financial institutions improves query resolution by 7.1 percentage points.

Abstract

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.