SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
Bardiya Akhbari, Manish Gawali, Nicholas A. Dronen

TL;DR
The paper introduces the SetLexSem Challenge, a synthetic benchmark to evaluate the robustness of large language models in performing set operations under lexical and semantic variations, revealing significant robustness issues.
Contribution
It presents a new benchmark, SetLexSem, for systematically testing LLMs' invariance in set operations across lexical and semantic variations, highlighting their vulnerabilities.
Findings
LLMs show poor robustness to variations in operations and operands.
LLMs exhibit specific failure modes with semantic groupings of sets.
Measuring robustness to frequency and length variations is challenging.
Abstract
Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Topic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
