A Comparison of LSTM and BERT for Small Corpus
Aysu Ezen-Can

TL;DR
This paper compares LSTM and BERT models on small datasets for intent classification, finding that simpler LSTM models outperform BERT in accuracy and training time, emphasizing task and data considerations in model selection.
Contribution
It provides empirical evidence that traditional LSTM models can outperform BERT on small datasets, challenging the assumption that larger pre-trained models are always superior in such scenarios.
Findings
LSTM outperforms BERT on small intent classification datasets.
LSTM models train faster than BERT in small data settings.
Model choice should consider task and data characteristics.
Abstract
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch. Transformers have made a significant improvement in creating new state-of-the-art results for many NLP tasks including but not limited to text classification, text generation, and sequence labeling. Most of these success stories were based on large datasets. In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models? To answer this question, we use a small dataset for intent classification collected for building chatbots and compare the performance of a simple bidirectional LSTM model with a pre-trained BERT model. Our experimental results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Softmax · Sigmoid Activation · Layer Normalization · Tanh Activation · Long Short-Term Memory · Weight Decay · Dropout · Linear Warmup With Linear Decay · Dense Connections
