BabyBear: Cheap inference triage for expensive language models
Leila Khalili, Yao You, John Bohannon

TL;DR
BabyBear introduces a cascading inference framework for NLP that reduces computational costs by early exiting with high-confidence predictions, achieving over 50% cost savings while maintaining accuracy.
Contribution
It adapts model cascading and inference triage to NLP, enabling significant cost reductions in large-scale NLP tasks with minimal accuracy loss.
Findings
Over 50% reduction in compute cost for classification tasks.
33% compute savings in named entity recognition while maintaining high F1 score.
Effective use of cheap models for most inference load.
Abstract
Transformer language models provide superior accuracy over previous models but they are computationally and environmentally expensive. Borrowing the concept of model cascading from computer vision, we introduce BabyBear, a framework for cascading models for natural language processing (NLP) tasks to minimize cost. The core strategy is inference triage, exiting early when the least expensive model in the cascade achieves a sufficiently high-confidence prediction. We test BabyBear on several open source data sets related to document classification and entity recognition. We find that for common NLP tasks a high proportion of the inference load can be accomplished with cheap, fast models that have learned by observing a deep learning model. This allows us to reduce the compute cost of large-scale classification jobs by more than 50% while retaining overall accuracy. For named entity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
