CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting
David Maria Schmidt, Raoul Schubert, Philipp Cimiano

TL;DR
This paper introduces CompoST, a benchmark to evaluate how well large language models can interpret complex, compositional questions into SPARQL queries, revealing significant limitations in their systematic understanding.
Contribution
The paper presents a controlled benchmark with datasets of varying complexity to assess LLMs' compositional question interpretation abilities, highlighting their struggles in systematic understanding.
Findings
Performance drops significantly with increased question complexity
Even with full input information, F1 scores remain low (~0.57)
LLMs have limited ability to interpret complex questions compositionally
Abstract
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
