Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain

Yuma Okochi; Fabio Milentiansen Sim; Tomoyasu Okada

arXiv:2603.01353·cs.LG·March 3, 2026

Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain

Yuma Okochi, Fabio Milentiansen Sim, Tomoyasu Okada

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper presents a method for creating large-scale synthetic instruction datasets with reasoning traces to enhance domain-specific LLMs, demonstrated in the Japanese financial domain with significant performance gains.

Contribution

A novel approach for generating high-quality synthetic instruction data with reasoning traces for any domain, validated through a case study in finance.

Findings

01

Performance improvements on financial benchmarks

02

Impact of reasoning trace length on model performance

03

Open-sourced models and datasets

Abstract

In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

nri-ai/nri-fin-reasoning
dataset· 251 dl
251 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques