LLMs can construct powerful representations and streamline sample-efficient supervised learning
Ilker Demirel, Lawrence Shi, Zeshan Hussain, David Sontag

TL;DR
This paper introduces an LLM-based agentic pipeline that creates interpretive rubrics to improve input representation and sample efficiency in supervised learning, especially for complex multimodal data.
Contribution
It presents a novel method where LLMs generate global and local rubrics to standardize and interpret inputs, enhancing performance and operational efficiency in clinical tasks.
Findings
Rubrics significantly outperform naive models and large pretrained clinical models.
The approach improves interpretability and auditability of input representations.
Operational advantages include cost-effectiveness and ease of scaling.
Abstract
As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Biomedical Text Mining and Ontologies
