Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Raviraj Joshi

TL;DR
This paper introduces a TF-IDF-based sentence ranking method that efficiently selects key sentences in long documents, enabling near-original classification accuracy with reduced input size and faster inference, suitable for real-world applications.
Contribution
The paper proposes a novel sentence ranking approach using TF-IDF and sentence length to improve long document classification efficiency without sacrificing accuracy.
Findings
Achieves over 50% reduction in input size
Reduces inference latency by 43%
Maintains near-baseline classification accuracy
Abstract
Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Sentiment Analysis and Opinion Mining
