Scope is all you need: Transforming LLMs for HPC Code
Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien,, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy, Mattson, and Gal Oren

TL;DR
This paper introduces Tokompiler, a domain-specific tokenizer for HPC code, enabling smaller, more efficient LLMs like SPT-Code and Polycoder to outperform traditional models in HPC code understanding and completion tasks.
Contribution
The paper presents Tokompiler, a novel tokenizer designed for HPC code, and demonstrates its effectiveness in training smaller, domain-specific LLMs that outperform larger, general-purpose models.
Findings
Tokompiler improves code completion accuracy.
Domain-specific LLMs outperform general-purpose models.
Perplexity scores reduced to ~1 with Tokompiler.
Abstract
With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
