ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models
Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

TL;DR
ArcheType leverages large language models with a novel zero-shot framework for semantic column type annotation, outperforming existing methods and enabling effective domain-specific and cross-dataset applications.
Contribution
The paper introduces ArcheType, a new zero-shot CTA framework using LLMs with context sampling, prompt serialization, and label remapping, achieving state-of-the-art results.
Findings
ArcheType achieves new state-of-the-art zero-shot CTA performance.
It outperforms fine-tuned models on the SOTAB benchmark.
The method is effective across multiple domain-specific benchmarks.
Abstract
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
