Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale
Xiang Hu, Pengyu Ji, Qingyang Zhu, Wei Wu, Kewei Tu

TL;DR
GPST is an unsupervised, scalable syntactic language model that jointly learns to generate sentences and parse trees, outperforming previous models in language understanding, generation, and grammar induction.
Contribution
Introduces GPST, a novel unsupervised structured transformer model that enables parallel training and surpasses prior models in multiple NLP tasks.
Findings
Outperforms GPT-2 in various language tasks
Significantly better grammar induction results
Faster training compared to existing SLMs
Abstract
A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with billion tokens, and demonstrate the superiority of GPST over GPT-2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Residual Connection · Weight Decay · Linear Layer · Dense Connections · Adam · Dropout · Multi-Head Attention
