Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina, Rish, Ravid Shwartz-Ziv, Christopher Pal

TL;DR
Seq-VCR introduces a regularization method to prevent representation collapse in Transformer intermediate layers, significantly improving reasoning performance without explicit chain-of-thought supervision.
Contribution
The paper proposes Seq-VCR, a novel regularization technique that enhances intermediate layer entropy, boosting reasoning abilities in decoder-only Transformers.
Findings
Achieves 99.5% accuracy on 5x5 multiplication, surpassing models and GPT-4 with CoT.
Improves performance on arithmetic and LIS datasets.
Prevents representation collapse, enhancing reasoning capabilities.
Abstract
Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging integer multiplication task, our approach achieves exact match accuracy, outperforming models of the same size (which yield accuracy) and GPT-4 with five-shot CoT prompting (). We also demonstrate superior results on arithmetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
