TL;DR
NRGPT introduces an energy-based modification to GPT, viewing inference as exploration on an energy landscape, and demonstrates its effectiveness across various language tasks.
Contribution
It unifies GPT with energy-based models through a minimal modification, providing a new perspective on inference as energy landscape exploration.
Findings
NRGPT performs well on Shakespeare, ListOPS, and OpenWebText datasets.
The model can be interpreted as gradient descent on the energy landscape.
NRGPT shows increased resistance to overfitting during long training.
Abstract
Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
