TL;DR
This paper empirically assesses how bidirectional models like BERT and Transformers perform in incremental natural language understanding tasks, exploring methods to adapt them for real-time, partial input processing.
Contribution
It demonstrates that bidirectional encoders can be effectively used incrementally with minimal performance loss, and proposes training and testing adaptations to improve their incremental capabilities.
Findings
Bidirectional models retain most non-incremental quality when used incrementally.
BERT's performance is more affected by incremental access compared to other models.
Training and testing modifications can mitigate performance drops in incremental settings.
Abstract
While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The "omni-directional" BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Cosine Annealing · WordPiece · Adam · Byte Pair Encoding · Softmax · Multi-Head Attention · Layer Normalization · Dense Connections · Linear Warmup With Cosine Annealing
