LokiLM: Technical Report
Justin Kiefel, Shrey Shah

TL;DR
LokiLM is a 1.4B parameter language model trained on 500B tokens, achieving strong reasoning performance and state-of-the-art results among small models, but it suffers from hallucinations and truthfulness issues.
Contribution
The paper introduces LokiLM, a large language model trained with multi-teacher knowledge distillation and high-quality data, achieving competitive benchmarks with fewer parameters.
Findings
Strong performance in natural language reasoning tasks
Achieves state-of-the-art results among models with 1.5B parameters or less
Exhibits significant hallucinations and poor truthfulness scores
Abstract
In this work, we introduce LokiLM, a 1.4B parameter large language model trained on 500B tokens. Our model performs strongly in natural language reasoning tasks and achieves state-of-the-art performance among models with 1.5B parameters or less. LokiLM is trained using multi-teacher knowledge distillation and high-quality training data to achieve benchmark results competitive with larger models trained on significantly more tokens. We support these findings by introducing steps to avoid benchmark contamination and overfitting throughout our development process. Despite its promising performance, LokiLM exhibits a concerning amount of hallucinations and scores poorly on the TruthfulQA benchmark, so we do not release the model publicly.
Peer Reviews
Decision·Submitted to ICLR 2025
- Presents a recipe which has promising performance given the model size and number of steps over several standard knowledge & reasoning-based benchmarks.
- Unless I missed it the description of what the training data consists of is actually very vague: "The training data for LokiLM primarily consists of web-scraped content, supplemented with a small portion of machine-generated text in the early stages of training". Given that the whole paper is about how the training data construction changes performance this seems like a major weakness in terms of learnings from the paper. For example, I also wasn't sure how the distillation / machine-generate
The paper is well motivated for improving the performance of small language model. The approach cover major techniques in pretraining regarding architecture, data and evaluation. The writing is mostly clear.
My biggest concern is I found the paper lacks scientific insights. The report is mostly a description of how the training was done, and what the evaluation results look like, without controlled experiments to understand the impacts of different factors. Specifically, the training is mostly a kitchen-sink combining different ingredients, which is totally valid. But I'd expect the report to present some ablations experiments to understand the key design choices in architectures and data filtering,
This paper details the creation of their strong, smaller LM, and its performance on popular benchmarks. The paper makes it clear that they were very careful in their model architecture design, data curation and filtering, and efforts to avoid benchmark contamination, which makes their model's performance quite impressive, especially for how small it is. While it falls short in some areas (namely TruthfulQA and some qualitative analysis of social biases), they make a good effort to try to explain
While the model's performance is quite impressive, and I appreciate the authors' reasoning on why it falls short in some areas, I think not releasing the model (and potentially not releasing the pretraining data/code?) is a very strong weakness. While the model's TruthfulQA score is lower than one would hope, I do not believe this is a compelling reason for not releasing the model. Instead, if the model were to be released (especially in conjunction with the pretraining data and/or data filterin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlasma Diagnostics and Applications · Parallel Computing and Optimization Techniques
MethodsKnowledge Distillation
