TL;DR
This paper introduces Columbo, a large language model-based approach for expanding abbreviated column names in tabular data, improving accuracy over previous methods and demonstrating real-world application in environmental sciences.
Contribution
The paper presents new datasets with real-world abbreviations, proposes synonym-aware accuracy measures, and develops Columbo, a novel LLM-based method that significantly outperforms existing solutions.
Findings
Columbo outperforms NameGuess by 4-29% across datasets.
New datasets with real-world abbreviations improve evaluation relevance.
Synonym-aware measures better capture true expansion accuracy.
Abstract
Expanding the abbreviated column names of tables, such as "esal" to "employee salary", is critical for many downstream NLP tasks for tabular data, such as NL2SQL, table QA, and keyword search. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper, we make three contributions that significantly advance the state of the art. First, we show that the synthetic public data used by prior work has major limitations, and we introduce four new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
