Seq-VCR: Preventing Collapse in Intermediate Transformer Representations   for Enhanced Reasoning

Md Rifat Arefin; Gopeshh Subbaraj; Nicolas Gontier; Yann LeCun; Irina; Rish; Ravid Shwartz-Ziv; Christopher Pal

arXiv:2411.02344·cs.LG·March 21, 2025

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning

Md Rifat Arefin, Gopeshh Subbaraj, Nicolas Gontier, Yann LeCun, Irina, Rish, Ravid Shwartz-Ziv, Christopher Pal

PDF

Open Access 1 Repo

TL;DR

Seq-VCR introduces a regularization method to prevent representation collapse in Transformer intermediate layers, significantly improving reasoning performance without explicit chain-of-thought supervision.

Contribution

The paper proposes Seq-VCR, a novel regularization technique that enhances intermediate layer entropy, boosting reasoning abilities in decoder-only Transformers.

Findings

01

Achieves 99.5% accuracy on 5x5 multiplication, surpassing models and GPT-4 with CoT.

02

Improves performance on arithmetic and LIS datasets.

03

Prevents representation collapse, enhancing reasoning capabilities.

Abstract

Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5%$ exact match accuracy, outperforming models of the same size (which yield $0%$ accuracy) and GPT-4 with five-shot CoT prompting ( $44%$ ). We also demonstrate superior results on arithmetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rarefin/seq_vcr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings