Comparative Study of Long Document Classification

Vedangi Wagh; Snehal Khandve; Isha Joshi; Apurva Wani; Geetanjali; Kale; Raviraj Joshi

arXiv:2111.00702·cs.CL·February 22, 2022

Comparative Study of Long Document Classification

Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali, Kale, Raviraj Joshi

PDF

TL;DR

This study systematically compares various machine learning and transformer-based models for long document classification across multiple datasets, highlighting that simpler models often perform competitively with BERT, especially when computational resources are limited.

Contribution

It provides an exhaustive benchmarking of traditional and modern models on standard long document datasets, emphasizing the effectiveness of simple algorithms.

Findings

01

Basic algorithms perform competitively with BERT on most datasets.

02

BERT models perform consistently well across datasets.

03

Simple models like BiLSTM + Max are effective for long document classification.

Abstract

The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based architectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Sigmoid Activation · Refunds@Expedia|||How do I get a full refund from Expedia? · Tanh Activation · Long Short-Term Memory · Residual Connection · Layer Normalization · Dense Connections