Evaluating Large Language Models for Generalization and Robustness via Data Compression
Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin

TL;DR
This paper introduces a lossless data compression-based evaluation method for large language models to assess their generalization and robustness across diverse data sources and time periods, addressing limitations of existing benchmarks.
Contribution
It proposes a novel compression-based evaluation framework that measures models' ability to generalize and remain robust over time, using comprehensive datasets and analysis of various model performances.
Findings
Models' compression rates decline after training cutoff, indicating limited generalization.
Mistral and Llama-2 show a good balance of performance and robustness.
Models perform better on arXiv papers than on news and code data.
Abstract
Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
