Test-Time Training on Nearest Neighbors for Large Language Models

Moritz Hardt; Yu Sun

arXiv:2305.18466·cs.CL·February 6, 2024·1 cites

Test-Time Training on Nearest Neighbors for Large Language Models

Moritz Hardt, Yu Sun

PDF

Open Access 1 Repo 2 Videos 3 Reviews

TL;DR

This paper introduces a test-time training method for large language models that retrieves nearest neighbors and fine-tunes the model on them, significantly improving performance without increasing input length or computational cost.

Contribution

It presents a novel test-time training approach using nearest neighbors, demonstrating substantial performance gains across multiple language modeling tasks.

Findings

01

Retrieving and training on 20 neighbors improves model performance.

02

Test-time training narrows the gap between small and large language models.

03

High-quality, large index is crucial for effectiveness.

Abstract

Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

* The method is not too complicated, and could likely be reproduced. * In some ways, the evaluation was very impressive. Quite large scale, showing benefits with an index that spans the whole Pile across many domains. * The baselines of kNN and in-context prompting also seemed relevant/strong.

Weaknesses

There were some weaknesses. I think this paper still could have value, but I would be more confident in recommending that the paper be accepted if the following could be addressed: 1. There are some clarity issues with the paper. For instance, it was not very clear to me if retrieval is done after every token or at some other cadence. 2. There is a discussion of inference speed, but it is not very concrete. Could inference throughput be added to table 1? 3. While bits/byte based LLM evaluatio

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The paper is well-written and easy to follow. The idea is clearly described and empirically validated. 2. The empirical results on various pile benchmark show the usefulness of the approach.

Weaknesses

1. While the idea is neat and simple to implement, as shown in Figure 9, the training costs for each neighbor is expensive, thus limiting the usefulness in real-time applications. 2. While the results on the LM perplexity are useful, it would be interesting to see how this compares in an end-to-end task such as code generation, etc. Few-shot prompt tuning (with or without retrieval augmented learning) are popular paradigms that are used in bigger LLMs. It would be interesting to see the compar

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The organization of this paper is well-structured, making it easy to read and comprehend. 2. This paper presents a simple test-time training approach on nearest neighbors (TTT-NN), which significantly improves performance across more than twenty language modeling tasks in the Pile benchmark with minimal fine-tuning. 3. Test-time training effectively increases the capacity of a model, showcasing its potential to narrow the performance gap between smaller and larger models, and offering a valua

Weaknesses

1. Why not use the PQ (Product Quantization) Index, which can significantly reduce storage overhead and thus avoid the cost of distributed retrieval? Although the vectors after PQ are approximations of the original vectors, recent works such as ”**[KNN-MT](https://openreview.net/forum?id=7wCBOfJ8hJM)“** have demonstrated better performance using this approach. 2. Retrieval plus k*seq_len gradient updates may reduce inference speed. How much of a difference is there between the inference speed of

Code & Models

Repositories

socialfoundations/tttlm
noneOfficial

Videos

Learning at test time in LLMs [Jonas Hübotter]· youtube

Test-Time Training on Nearest Neighbors for Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Weight Decay · Linear Layer · Byte Pair Encoding · Discriminative Fine-Tuning · Multi-Head Attention · Linear Warmup With Cosine Annealing · Adam