Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical   Data for Medical Coding with Large Language Models

Surjya Ray; Pratik Mehta; Hongen Zhang; Ada Chaman; Jian Wang,; Chung-Jen Ho; Michael Chiou; Tashfeen Suleman

arXiv:2310.04595·cs.CL·October 10, 2023

Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical Data for Medical Coding with Large Language Models

Surjya Ray, Pratik Mehta, Hongen Zhang, Ada Chaman, Jian Wang,, Chung-Jen Ho, Michael Chiou, Tashfeen Suleman

PDF

Open Access

TL;DR

This paper introduces Segmented Harmonic Loss, a novel loss function designed to improve multi-label medical coding with large language models by addressing class imbalance and noise, leading to significant performance improvements.

Contribution

We propose Segmented Harmonic Loss and a segmentation algorithm to handle class imbalance and noise in medical coding datasets for LLMs, achieving notable accuracy gains.

Findings

01

Achieved over 10% F1 score improvement on noisy, long-tailed datasets.

02

Demonstrated effectiveness of the loss function on MIMIC III and IV datasets.

03

Showed LLMs outperform state-of-the-art methods with the proposed approach.

Abstract

The precipitous rise and adoption of Large Language Models (LLMs) have shattered expectations with the fastest adoption rate of any consumer-facing technology in history. Healthcare, a field that traditionally uses NLP techniques, was bound to be affected by this meteoric rise. In this paper, we gauge the extent of the impact by evaluating the performance of LLMs for the task of medical coding on real-life noisy data. We conducted several experiments on MIMIC III and IV datasets with encoder-based LLMs, such as BERT. Furthermore, we developed Segmented Harmonic Loss, a new loss function to address the extreme class imbalance that we found to prevail in most medical data in a multi-label scenario by segmenting and decoupling co-occurring classes of the dataset with a new segmentation algorithm. We also devised a technique based on embedding similarity to tackle noisy data. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Medical Coding and Health Information · AI in cancer detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dropout · Weight Decay · Multi-Head Attention · Softmax · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · Attention Dropout