Adaptation of Embedding Models to Financial Filings via LLM Distillation

Eliot Brenner; Dominic Seyler; Manjunath Hegde; Andrei Simion; Koustuv Dasgupta; Bing Xiang

arXiv:2512.08088·cs.CL·December 10, 2025

Adaptation of Embedding Models to Financial Filings via LLM Distillation

Eliot Brenner, Dominic Seyler, Manjunath Hegde, Andrei Simion, Koustuv Dasgupta, Bing Xiang

PDF

Open Access

TL;DR

This paper presents a scalable, cost-effective pipeline for adapting general retrieval embedding models to the financial domain by distilling domain knowledge through iterative training with LLM-judged relevance, significantly improving retrieval performance.

Contribution

It introduces an iterative distillation method that leverages LLM relevance judgments to enhance financial retrieval embeddings without extensive human annotation.

Findings

01

27.7% improvement in MRR@5

02

44.6% improvement in mean DCG@5

03

Improved NDCG on 3 of 4 document classes in FinanceBench

Abstract

Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR $@$ 5, 44.6% improvement in mean DCG $@$ 5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare