Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

Arian Eamaz; Farhang Yeganegi; and Mojtaba Soltanalian

arXiv:2605.02853·cs.LG·May 5, 2026

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

Arian Eamaz, Farhang Yeganegi, and Mojtaba Soltanalian

PDF

TL;DR

This paper introduces a layer-wise peeling framework for transformer training monitoring, providing fine-grained diagnostics that reveal hidden inefficiencies and optimize training effectiveness.

Contribution

It develops a novel layer-specific reference solution approach for detailed training analysis, applicable even in quantized and binarized models.

Findings

01

Layer-wise bounds can match or surpass trained models at various training stages.

02

The method exposes hidden inefficiencies not visible in aggregate loss curves.

03

Effective even under quantization and binarization, revealing optimization opportunities.

Abstract

Understanding whether deep neural networks are effectively optimized remains challenging, as training occurs in highly nonconvex landscapes and standard metrics provide limited visibility into layer-wise learning quality. This challenge is particularly acute for transformer-based language models, where training is expensive, models are often reused in frozen form, and poorly optimized layers can silently degrade performance. We propose a layer-wise peeling framework for monitoring training dynamics, in which each transformer layer is locally optimized against intermediate representations of the trained model. By constructing lightweight, layer-specific reference solutions and projecting layers onto multiple intermediate outputs via different permutations, we obtain achievable baselines that enable fine-grained diagnosis of under-optimized layers. Experiments on decoder-only transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.