TL;DR
This paper enhances COVID-19 tweet classification by combining RoBERTa ensembles with handcrafted features, demonstrating improved performance on noisy data outside the pre-training domain.
Contribution
It introduces an ensemble approach incorporating corpus-level info and handcrafted features to improve transformer-based text classification on noisy, domain-specific data.
Findings
Ensemble methods improve classification accuracy.
Handcrafted features contribute to handling noisy data.
Achieved near-top performance in WNUT-2020 Task 2.
Abstract
This paper presents our submission to Task 2 of the Workshop on Noisy User-generated Text. We explore improving the performance of a pre-trained transformer-based language model fine-tuned for text classification through an ensemble implementation that makes use of corpus level information and a handcrafted feature. We test the effectiveness of including the aforementioned features in accommodating the challenges of a noisy data set centred on a specific subject outside the remit of the pre-training data. We show that inclusion of additional features can improve classification results and achieve a score within 2 points of the top performing team.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
