DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger, Ofek Glick, Chaim Baskin, Yonatan Belinkov

TL;DR
DEPTH is a hierarchical pre-training approach for language models that enhances discourse understanding by learning sentence-level representations with novel objectives, leading to improved performance on discourse and NLU tasks.
Contribution
It introduces DEPTH, a discourse-oriented pre-training method that combines hierarchical sentence representations with novel objectives, improving discourse capabilities of language models.
Findings
DEPTH outperforms T5 in span-corruption loss.
DEPTH learns faster and better on discourse tasks.
Minimal impact on other NLU capabilities.
Abstract
Language Models (LMs) struggle with linguistic understanding at the discourse level, even though discourse patterns such as coherence, cohesion, and narrative flow are prevalent in their pre-training data. To improve the discourse capabilities of LMs already at the pre-training stage, we introduce DEPTH, an encoder-decoder model that learns latent representations for sentences using a discourse-oriented pre-training objective. DEPTH combines hierarchical sentence representations with two objectives: (1) Sentence Un-Shuffling, and (2) Span-Corruption. Our approach trains the model to represent both sub-word-level and sentence-level dependencies over a pre-training corpora. When trained either from scratch or continuing from a pre-trained T5 checkpoint, DEPTH learns semantic and discourse-level representations faster than T5, outperforming it in span-corruption loss despite the additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDiscourse Analysis in Language Studies · EFL/ESL Teaching and Learning
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Attention Dropout · Adafactor · SentencePiece
