Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding

Katharina Beckh; Sven Heuser; Stefan R\"uping

arXiv:2511.07055·cs.CL·May 12, 2026

Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding

Katharina Beckh, Sven Heuser, Stefan R\"uping

PDF

TL;DR

This paper explores using model ensembles to extract complete evidence in medical coding, significantly improving evidence recall by aggregating multiple models' token-level evidence.

Contribution

It introduces a novel ensemble approach based on the Rashomon effect to enhance evidence completeness in high-stakes decision support systems.

Findings

01

Rashomon ensembles increase evidence recall substantially.

02

Ensembles of three models outperform single models.

03

Ensembles recover evidence missed by individual models.

Abstract

High-stakes decisions informed by decision support systems require explicit evidence. While prior work focuses on short sufficient evidence, regulatory compliance and medical billing call for complete evidence: all relevant input tokens that support a decision. We formulate complete evidence extraction as a task and study it in a medical coding setting. Motivated by the Rashomon effect, we aggregate token-level evidence from multiple language models to increase evidence completeness. We perform a case study using existing equally-performing models, feature attributions, and a dataset with human-annotated evidence. Our results show that Rashomon ensembles significantly increase evidence recall while incurring only a small token overhead over individual models. Ensembles of only three models already outperform the best single model and recover information that individual models miss.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.