FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets
Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y. Yang, Xiao-Yang Liu

TL;DR
FinLoRA benchmarks LoRA fine-tuning methods on diverse financial datasets, demonstrating significant performance improvements and providing open-source tools to enhance financial AI applications.
Contribution
The paper introduces the FinLoRA project, creating a comprehensive benchmark for LoRA methods on financial tasks, including new datasets and extensive evaluation of multiple models and techniques.
Findings
LoRA methods improved performance by 36% on average over base models.
Extensive datasets and evaluation metrics provided for financial NLP tasks.
Open-source resources enable broader adoption of efficient fine-tuning in finance.
Abstract
Low-rank adaptation (LoRA) methods show great potential for scaling pre-trained general-purpose Large Language Models (LLMs) to hundreds or thousands of use scenarios. However, their efficacy in high-stakes domains like finance is rarely explored, e.g., passing CFA exams and analyzing SEC filings. In this paper, we present the open-source FinLoRA project that benchmarks LoRA methods on both general and highly professional financial tasks. First, we curated 19 datasets covering diverse financial applications; in particular, we created four novel XBRL analysis datasets based on 150 SEC filings. Second, we evaluated five LoRA methods and five base LLMs. Finally, we provide extensive experimental results in terms of accuracy, F1, and BERTScore and report computational cost in terms of time and GPU memory during fine-tuning and inference stages. We find that LoRA methods achieved substantial…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper contributed four new datasets for XBRL analysis tasks, which serve as a supplement to current data resources. 2. Experiment examines five LoRA variants on 19 datasets, which contains lots of observations. 3. This paper has identified four angles to analyze LoRA methods in financial fine-tuning tasks. The conclusion that LoRA does not exhibit catastrophic forgetting after being fine-tuned on these datasets is interesting.
1. As a benchmark paper, it doesn’t contain novel algorithms. However, only 4 out of 19 datasets are newly curated as substantial contribution. Only 4 different algorithms are evaluated, which may be not enough for a nice benchmark paper. 2. Fine-tuning are majorly conducted on Llama-3.1-8B, which is limited for generalization. The comparison conclusions may not be reliable. Using FL only with Gemini for comparison is possibly unfair. 3. Averaging accuracy and BERTScore F1 is strange. Error bars
Comprehensive Benchmarking: The scale of the evaluation is a significant strength, encompassing 19 datasets, 5 base models, and 5 LoRA methods. This provides a rich, multi-faceted view of the landscape. Novel and Valuable Datasets: The introduction of four XBRL analysis datasets fills a genuine gap in the literature. The tasks (tag/value extraction, formula construction/calculation) are well-chosen to test sophisticated financial reasoning and will be a valuable resource for future research. P
Lack of Statistical Rigor: As noted, the most significant weakness is the presentation of results without any measure of variance or statistical significance. This is a critical omission for a benchmark paper claiming to compare methods. Insufficient Hyperparameter Investigation: The comparison of LoRA variants is not equitable because it does not optimize hyperparameters for each method. The fixed, low-rank (r=8) setup particularly disadvantages rsLoRA, which is designed for higher ranks. The
**Novel XBRL datasets address a real gap**: The four new XBRL analysis datasets (tag extraction, value extraction, formula construction, formula calculation) fill a genuine void in financial NLP benchmarking. XBRL is the de facto standard for SEC filings, yet dedicated datasets for XBRL analysis tasks are scarce. The construction methodology using filtered XBRL segments with context IDs is sound, and the datasets could enable future research in automated financial report analysis.
**Misuse of the term “benchmark.”** The paper calls itself a benchmark but only evaluates LoRA fine-tuning variants. A real benchmark must be method-agnostic—supporting any model type (base, instruction-tuned, full fine-tuned, PEFT variants, and proprietary frontier models). This is not a benchmark but a *LoRA comparison study*. You benchmark *tasks*, not *methods*. True benchmarks like MMLU, GSM8K, HumanEval, or GPQA allow any model to be tested. Restricting evaluation to LoRA makes this unusab
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods · Financial Distress and Bankruptcy Prediction
MethodsBalanced Selection
