Chronologically Consistent Large Language Models
Songrun He, Linying Lv, Asaf Manela, Jimmy Wu

TL;DR
This paper introduces ChronoBERT and ChronoGPT, large language models trained with only data available at each time point, reducing lookahead bias and improving the credibility of social science and finance applications.
Contribution
The authors develop a novel training framework for chronologically consistent large language models that effectively mitigate lookahead bias in social science and finance tasks.
Findings
Models outperform or match standard models like BERT on NLP benchmarks.
Real-time outputs achieve Sharpe ratios comparable to larger models in finance.
Framework ensures more credible backtests and predictions in social sciences.
Abstract
Large language models are increasingly used in social sciences, but their training data can introduce lookahead bias and training leakage. A good chronologically consistent language model requires efficient use of training data to maintain accuracy despite time-restricted data. Here, we overcome this challenge by training a suite of chronologically consistent large language models, ChronoBERT and ChronoGPT, which incorporate only the text data that would have been available at each point in time. Despite this strict temporal constraint, our models achieve strong performance on natural language processing benchmarks, outperforming or matching widely used models (e.g., BERT), and remain competitive with larger open-weight models. Lookahead bias is model and application-specific because even if a chronologically consistent language model has poorer language comprehension, a regression or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLookahead
