LokiLM: Technical Report

Justin Kiefel; Shrey Shah

arXiv:2407.07370·cs.CL·July 11, 2024

LokiLM: Technical Report

Justin Kiefel, Shrey Shah

PDF

Open Access 3 Reviews

TL;DR

LokiLM is a 1.4B parameter language model trained on 500B tokens, achieving strong reasoning performance and state-of-the-art results among small models, but it suffers from hallucinations and truthfulness issues.

Contribution

The paper introduces LokiLM, a large language model trained with multi-teacher knowledge distillation and high-quality data, achieving competitive benchmarks with fewer parameters.

Findings

01

Strong performance in natural language reasoning tasks

02

Achieves state-of-the-art results among models with 1.5B parameters or less

03

Exhibits significant hallucinations and poor truthfulness scores

Abstract

In this work, we introduce LokiLM, a 1.4B parameter large language model trained on 500B tokens. Our model performs strongly in natural language reasoning tasks and achieves state-of-the-art performance among models with 1.5B parameters or less. LokiLM is trained using multi-teacher knowledge distillation and high-quality training data to achieve benchmark results competitive with larger models trained on significantly more tokens. We support these findings by introducing steps to avoid benchmark contamination and overfitting throughout our development process. Despite its promising performance, LokiLM exhibits a concerning amount of hallucinations and scores poorly on the TruthfulQA benchmark, so we do not release the model publicly.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 1Confidence 4

Strengths

- Presents a recipe which has promising performance given the model size and number of steps over several standard knowledge & reasoning-based benchmarks.

Weaknesses

- Unless I missed it the description of what the training data consists of is actually very vague: "The training data for LokiLM primarily consists of web-scraped content, supplemented with a small portion of machine-generated text in the early stages of training". Given that the whole paper is about how the training data construction changes performance this seems like a major weakness in terms of learnings from the paper. For example, I also wasn't sure how the distillation / machine-generate

Reviewer 02Rating 3Confidence 5

Strengths

The paper is well motivated for improving the performance of small language model. The approach cover major techniques in pretraining regarding architecture, data and evaluation. The writing is mostly clear.

Weaknesses

My biggest concern is I found the paper lacks scientific insights. The report is mostly a description of how the training was done, and what the evaluation results look like, without controlled experiments to understand the impacts of different factors. Specifically, the training is mostly a kitchen-sink combining different ingredients, which is totally valid. But I'd expect the report to present some ablations experiments to understand the key design choices in architectures and data filtering,

Reviewer 03Rating 6Confidence 4

Strengths

This paper details the creation of their strong, smaller LM, and its performance on popular benchmarks. The paper makes it clear that they were very careful in their model architecture design, data curation and filtering, and efforts to avoid benchmark contamination, which makes their model's performance quite impressive, especially for how small it is. While it falls short in some areas (namely TruthfulQA and some qualitative analysis of social biases), they make a good effort to try to explain

Weaknesses

While the model's performance is quite impressive, and I appreciate the authors' reasoning on why it falls short in some areas, I think not releasing the model (and potentially not releasing the pretraining data/code?) is a very strong weakness. While the model's TruthfulQA score is lower than one would hope, I do not believe this is a compelling reason for not releasing the model. Instead, if the model were to be released (especially in conjunction with the pretraining data and/or data filterin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPlasma Diagnostics and Applications · Parallel Computing and Optimization Techniques

MethodsKnowledge Distillation