Inductive Linguistic Reasoning with Large Language Models
Raghav Ramji, Keshav Ramji

TL;DR
This paper explores how large language models can improve multilingual linguistic reasoning by using analogical prompting to generate effective in-context demonstrations, significantly enhancing their performance on linguistic puzzles and Olympiad tasks.
Contribution
It introduces a novel two-stage analogical prompting method that automatically induces auxiliary demonstrations, improving LLM reasoning on low-resource languages and linguistic tasks.
Findings
Analogical prompting boosts GPT-4o performance by up to 8.1%.
The method generalizes well across different linguistic tasks and difficulty levels.
Analogical demonstrations, whether self-generated or from weaker models, enhance reasoning capabilities.
Abstract
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar…
Peer Reviews
Decision·Submitted to ICLR 2025
**Originality** The paper introduces an innovative approach to evaluating linguistic reasoning in LLMs through analogical prompting. It applies this method to extremely low-resource languages and further evaluates generating exemplars through a different LLM, increasing overall performance. **Quality** The paper presents experimentation across multiple models and prompting strategies. **Clarity** The paper is well-structured, with clear explanations of each experimental setup, metric, and f
1. Section 4 mentions that each response was manually evaluated to provide exact match scores, but this evaluation process lacks details. Specifically, there’s no mention of how many responses were reviewed, how many LLMs were involved, the number of evaluators, or their inter-annotator agreement. Without this, it’s challenging to assess the reliability of the manual evaluation. 2. Section 5.2 mentions other linguistic reasoning datasets, yet these were not utilized in the experiments. Incorpor
The 2 stage analogical prompt is interesting and suggests that perhaps models might leverage information about related but more represented languages to solve the given linguistic problems in the test set. There is also an interesting difference between larger models like Llama 405B or GPT4o and smaller models; the analogical exemplars work for the larger models but not the smaller ones, pointing to an ability of the larger models to adapt the analogical examples to the given linguistic problem
The paper's main weakness is the disconnect between the empirical investigations, which seem sound enough, and the desired conclusion that is given here: "In summary, our results suggest that the ability of the model to deduce from inductively learned rules is the key performance driver." In other parts of the paper the rules referred to here would seem to be grammar rules. There is little in the paper to suggest in the results that any grammar rules have been really learned or what the form o
1. There are limited works on solving Linguistics Olympiad problems. This paper's methodology is valuable as a benchmark for future studies. 2. The study presents comprehensive experiments across various models and prompting techniques, with a clear presentation of results.
1. The paper's contribution is primarily empirical, with limited conceptual innovation. The approach of using analogical prompting to boost performance is not very inspiring, as it mainly involves augmenting prompts with self-generated information [1]. 2. The authors tested their method only on machine translation tasks, overlooking other question formats in IOL, such as multiple-choice and cloze questions. A more suitable benchmark than modeLing would be [2] or [3]. 3. It is widely known that
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
