Active Learning for Sequence Tagging with Deep Pre-trained Models and   Bayesian Uncertainty Estimates

Artem Shelmanov; Dmitri Puzyrev; Lyubov Kupriyanova; Denis Belyakov,; Daniil Larionov; Nikita Khromov; Olga Kozlova; Ekaterina Artemova; Dmitry V.; Dylov; and Alexander Panchenko

arXiv:2101.08133·cs.CL·February 19, 2021

Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates

Artem Shelmanov, Dmitri Puzyrev, Lyubov Kupriyanova, Denis Belyakov,, Daniil Larionov, Nikita Khromov, Olga Kozlova, Ekaterina Artemova, Dmitry V., Dylov, and Alexander Panchenko

PDF

TL;DR

This paper explores combining active learning with deep pre-trained models and Bayesian uncertainty estimates to efficiently reduce annotation efforts in sequence tagging tasks, including practical model distillation for better performance.

Contribution

It provides the first comprehensive empirical analysis of Bayesian uncertainty methods with deep pre-trained models in active learning for sequence tagging, and demonstrates the effectiveness of distilled models.

Findings

01

Bayesian methods improve active learning efficiency

02

Distilled Transformer models outperform full-size models in active learning

03

Optimal uncertainty estimation techniques vary by model type

Abstract

Annotating training data for sequence tagging of texts is usually very time-consuming. Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We are the first to thoroughly investigate this powerful combination for the sequence tagging task. We conduct an extensive empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework and find the best combinations for different types of models. Besides, we also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance and reduces obstacles for applying deep active learning in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Monte Carlo Dropout · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Attention Is All You Need