Quantifying Generalization Complexity for Large Language Models

Zhenting Qi; Hongyin Luo; Xuliang Huang; Zhuokai Zhao; Yibo Jiang,; Xiangjun Fan; Himabindu Lakkaraju; James Glass

arXiv:2410.01769·cs.CL·October 4, 2024

Quantifying Generalization Complexity for Large Language Models

Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang,, Xiangjun Fan, Himabindu Lakkaraju, James Glass

PDF

1 Repo 3 Reviews

TL;DR

This paper introduces Scylla, a framework for quantitatively evaluating the generalization abilities of large language models by analyzing their performance on diverse tasks and complexities, revealing a critical threshold for generalization limits.

Contribution

The paper presents Scylla, a novel dynamic evaluation framework that disentangles generalization from memorization and introduces the concept of critical complexity to assess LLMs' capabilities.

Findings

01

Identifies a non-monotonic relationship between task complexity and performance gap.

02

Defines the concept of critical complexity where reliance on memorization peaks.

03

Shows larger models can handle more complex tasks before over-relying on memorization.

Abstract

While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Novel evaluation framework that considers both complexity scaling, generalization, and memorization. 2. Comprehensive evaluation across many modern LLMs 3. Clear methodology for testing generalization capabilities 4. Interesting findings about model scaling and complexity relationships

Weaknesses

1. ID/OOD Definition Issues (Major Issue): The paper's method of determining ID data by asking the model to generate examples could be problematic By using model-generated examples to determine what constitutes ID data and simply using complementary numbers for OOD, the paper creates a potentially circular and oversimplified definition of distribution shifts. This approach lacks validation against actual training distributions and may not capture meaningful distribution shifts. 2. Other Issu

Reviewer 02Rating 8Confidence 4

Strengths

- New Evaluation Framework (SCYLLA). They present a new evaluation framework that can be used to measure to evaluate the generalization performance of LLMs. - They uncover and measure the Generalization Valley Phenomenon. According to the experiments conducted on several models, there seems to be a consistent gap between their ID and OOD evaluations. And this gap has a peak (critical complexity) that shifts to the right as model size increases in open-source models. - They conduct a large set

Weaknesses

- Dependence on approximate ID data. Due to the inability to access pre-training data, it is hard to estimate in-distribution data. The approach presented is a workaround that only focuses in the number generation. They do not present the estimated distribution for the models and their difference with their out-of-distribution data. - As far as I understand, they focus on tasks that require mathematical operations. They do not show a possible generalization to other kind of tasks. Adding other

Reviewer 03Rating 6Confidence 3

Strengths

- The idea that the evaluation dataset should be generated dynamically is interesting and can be useful for promoting the accuracy of the evaluation. - The performance gap between ID and OOD data is also an interesting and novel indicator.

Weaknesses

- The paragraph of Line 300 is not very clear. "From these generated responses, we extract test examples and designate them as ID test data". How could you *extract* test examples? I thought all the generated examples form a candidate pool for test data. What are "the individual components within these examples"? Given that you generate no less than 10k examples, why only 256 are selected in the end? - Can "ID/OOD data" selected by one model, i.e., mistral-7b, be regarded as ID/OOD data for anot

Code & Models

Repositories

zhentingqi/scylla
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Cosine Annealing · Byte Pair Encoding · LLaMA · Softmax · Dropout · Attention Dropout