An Empirical Study of Sections in Classifying Disease Outbreak Reports

Son Doan; Mike Conway; Nigel Collier

arXiv:1911.09319·cs.CL·November 22, 2019

An Empirical Study of Sections in Classifying Disease Outbreak Reports

Son Doan, Mike Conway, Nigel Collier

PDF

TL;DR

This study examines how different sections of news articles impact the accuracy of classifying disease outbreak reports, highlighting the importance of section weighting and specific parts like headlines.

Contribution

It provides empirical insights into the significance of article sections and section weighting for improving disease report classification accuracy.

Findings

01

Headlines and leading sentences yield high classification performance.

02

Using full text with bag-of-words achieves highest recall.

03

Section weighting improves classification accuracy.

Abstract

Identifying articles that relate to infectious diseases is a necessary step for any automatic bio-surveillance system that monitors news articles from the Internet. Unlike scientific articles which are available in a strongly structured form, news articles are usually loosely structured. In this chapter, we investigate the importance of each section and the effect of section weighting on performance of text classification. The experimental results show that (1) classification models using the headline and leading sentence achieve a high performance in terms of F-score compared to other parts of the article; (2) all section with bag-of-word representation (full text) achieves the highest recall; and (3) section weighting information can help to improve accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.