Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Prathamesh Kokate; Mitali Sarnaik; Manavi Khopade; Raviraj Joshi

arXiv:2506.07248·cs.CL·June 24, 2025

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Raviraj Joshi

PDF

Open Access

TL;DR

This paper introduces a TF-IDF-based sentence ranking method that efficiently selects key sentences in long documents, enabling near-original classification accuracy with reduced input size and faster inference, suitable for real-world applications.

Contribution

The paper proposes a novel sentence ranking approach using TF-IDF and sentence length to improve long document classification efficiency without sacrificing accuracy.

Findings

01

Achieves over 50% reduction in input size

02

Reduces inference latency by 43%

03

Maintains near-baseline classification accuracy

Abstract

Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Sentiment Analysis and Opinion Mining