CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets --   RoBERTa Ensembles and The Continued Relevance of Handcrafted Features

Calum Perrio; Harish Tayyar Madabushi

arXiv:2010.07988·cs.CL·October 19, 2020

CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets -- RoBERTa Ensembles and The Continued Relevance of Handcrafted Features

Calum Perrio, Harish Tayyar Madabushi

PDF

1 Repo

TL;DR

This paper enhances COVID-19 tweet classification by combining RoBERTa ensembles with handcrafted features, demonstrating improved performance on noisy data outside the pre-training domain.

Contribution

It introduces an ensemble approach incorporating corpus-level info and handcrafted features to improve transformer-based text classification on noisy, domain-specific data.

Findings

01

Ensemble methods improve classification accuracy.

02

Handcrafted features contribute to handling noisy data.

03

Achieved near-top performance in WNUT-2020 Task 2.

Abstract

This paper presents our submission to Task 2 of the Workshop on Noisy User-generated Text. We explore improving the performance of a pre-trained transformer-based language model fine-tuned for text classification through an ensemble implementation that makes use of corpus level information and a handcrafted feature. We test the effectiveness of including the aforementioned features in accommodating the challenges of a noisy data set centred on a specific subject outside the remit of the pre-training data. We show that inclusion of additional features can improve classification results and achieve a score within 2 points of the top performing team.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CalumPerrio/WNUT-2020
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.