FLAME: A small language model for spreadsheet formulas
Harshit Joshi, Abishai Ebenezer, Jos\'e Cambronero, Sumit Gulwani,, Aditya Kanade, Vu Le, Ivan Radi\v{c}ek, Gust Verbruggen

TL;DR
FLAME is a compact transformer model trained solely on Excel formulas, achieving high performance in formula repair, completion, and retrieval tasks despite its small size and limited training data.
Contribution
Introduces FLAME, a small, domain-specific language model for spreadsheets that outperforms larger models in formula-related tasks by leveraging domain insights and specialized training.
Findings
FLAME outperforms larger models in 10 of 14 tasks.
FLAME surpasses models like Codex and CodeT5 in formula repair and completion.
FLAME exceeds other models in formula retrieval accuracy.
Abstract
Spreadsheets are a vital tool for end-user data management. Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpreadsheets and End-User Computing · Data Quality and Management · Information Retrieval and Search Behavior
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · CodeBERT · Linear Layer · Byte Pair Encoding · Inverse Square Root Schedule · SentencePiece · Adafactor · Residual Connection
