Scope is all you need: Transforming LLMs for HPC Code

Tal Kadosh; Niranjan Hasabnis; Vy A. Vo; Nadav Schneider; Neva Krien,; Abdul Wasay; Nesreen Ahmed; Ted Willke; Guy Tamir; Yuval Pinter; Timothy; Mattson; and Gal Oren

arXiv:2308.09440·cs.CL·October 2, 2023·2 cites

Scope is all you need: Transforming LLMs for HPC Code

Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien,, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy, Mattson, and Gal Oren

PDF

Open Access 2 Repos

TL;DR

This paper introduces Tokompiler, a domain-specific tokenizer for HPC code, enabling smaller, more efficient LLMs like SPT-Code and Polycoder to outperform traditional models in HPC code understanding and completion tasks.

Contribution

The paper presents Tokompiler, a novel tokenizer designed for HPC code, and demonstrates its effectiveness in training smaller, domain-specific LLMs that outperform larger, general-purpose models.

Findings

01

Tokompiler improves code completion accuracy.

02

Domain-specific LLMs outperform general-purpose models.

03

Perplexity scores reduced to ~1 with Tokompiler.

Abstract

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research