Semformer: Transformer Language Models with Semantic Planning
Yongjing Yin, Junran Ding, Kai Song, Yue Zhang

TL;DR
Semformer introduces semantic planning into Transformer language models, improving their ability to predict responses accurately and mitigating shortcut learning, with strong results in minimal planning tasks and downstream NLP applications.
Contribution
The paper proposes a novel training method for Transformers that explicitly models semantic planning, addressing shortcut learning and enhancing performance on various tasks.
Findings
Near-perfect performance in graph path-finding task
Effective mitigation of shortcut learning
Improved perplexity and in-context learning results
Abstract
Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout
