BERT for Long Documents: A Case Study of Automated ICD Coding
Arash Afkanpour, Shabir Adeel, Hansenclever Bassani, Arkady Epshteyn,, Hongbo Fan, Isaac Jones, Mahan Malihi, Adrian Nauth, Raj Sinha, Sanjana, Woonna, Shiva Zamani, Elli Kanal, Mikhail Fomitchev, Donny Cheung

TL;DR
This paper introduces a scalable method for applying transformer models like BERT to long documents, significantly improving automated ICD coding and surpassing CNN-based methods.
Contribution
The paper proposes a simple, scalable approach to adapt BERT for long texts, demonstrating improved performance in ICD coding tasks over prior transformer-based studies.
Findings
Transformer-based models outperform CNNs in ICD coding.
The proposed method significantly improves transformer performance.
BERT-based models achieve state-of-the-art results in long document processing.
Abstract
Transformer models have achieved great success across many NLP problems. However, previous studies in automated ICD coding concluded that these models fail to outperform some of the earlier solutions such as CNN-based models. In this paper we challenge this conclusion. We present a simple and scalable method to process long text with the existing transformer models such as BERT. We show that this method significantly improves the previous results reported for transformer models in ICD coding, and is able to outperform one of the prominent CNN-based methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Music and Audio Processing
MethodsAttention Is All You Need · fail · Layer Normalization · Residual Connection · Dropout · Softmax · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout
