Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain
Yuma Okochi, Fabio Milentiansen Sim, Tomoyasu Okada

TL;DR
This paper presents a method for creating large-scale synthetic instruction datasets with reasoning traces to enhance domain-specific LLMs, demonstrated in the Japanese financial domain with significant performance gains.
Contribution
A novel approach for generating high-quality synthetic instruction data with reasoning traces for any domain, validated through a case study in finance.
Findings
Performance improvements on financial benchmarks
Impact of reasoning trace length on model performance
Open-sourced models and datasets
Abstract
In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
