How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra; Lorenz Sparrenberg; Sarthak Khanna; Rafet Sifa

arXiv:2511.09748·cs.CL·November 14, 2025

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa

PDF

Open Access

TL;DR

This paper investigates how small language models can be while still effectively detecting critical translation errors on edge devices, focusing on English-German translation and benchmarking models around one billion parameters.

Contribution

It introduces a standardized framework for evaluating compact LLMs for on-device error detection in machine translation, highlighting Gemma-3-1B as an optimal balance of quality and efficiency.

Findings

01

Gemma-3-1B achieves MCC=0.77 with low latency on MacBook Pro.

02

Qwen-3-1.7B attains higher MCC but with increased compute cost.

03

Ultra-small models can be used with few-shot calibration but miss some error types.

Abstract

Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Healthcare and Education