Programming by Examples Meets Historical Linguistics: A Large Language   Model Based Approach to Sound Law Induction

Atharva Naik; Darsh Agrawal; Hong Sng; Clayton Marr; Kexun Zhang,; Nathaniel R Robinson; Kalvin Chang; Rebecca Byrnes; Aravind Mysore; Carolyn; Rose; David R Mortensen

arXiv:2501.16524·cs.CL·January 29, 2025

Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Atharva Naik, Darsh Agrawal, Hong Sng, Clayton Marr, Kexun Zhang,, Nathaniel R Robinson, Kalvin Chang, Rebecca Byrnes, Aravind Mysore, Carolyn, Rose, David R Mortensen

PDF

Open Access

TL;DR

This paper introduces a novel approach to automate sound law induction in historical linguistics using large language models trained with synthetic data, achieving state-of-the-art results.

Contribution

It formulates sound law induction as programming by examples with LLMs, proposing synthetic data generation methods and creating a new SOTA open-source model.

Findings

01

+6% pass rate over previous models

02

Synthetic data improves LLM performance in SLI

03

Open-source model facilitates future research

Abstract

Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling