TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks
\.Irem Demirta\c{s}, Burak Payzun, Se\c{c}il Arslan

TL;DR
This paper introduces TULIP, a pipeline for adapting open-source large language models to financial Turkish, improving their domain-specific and language capabilities for privacy-sensitive applications.
Contribution
It presents a novel five-stage pipeline for domain and language adaptation of Llama 3.1 and Qwen 2.5 models specifically for financial Turkish tasks.
Findings
Enhanced model performance on financial Turkish tasks
Effective domain and language adaptation demonstrated
Pipeline enables smaller models to handle specialized tasks
Abstract
Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
