Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity
Qiyao Wei, Edward Morrell, Lea Goetz, Mihaela van der Schaar

TL;DR
This paper introduces a knowledge graph-based method to generate domain-specific benchmarks for evaluating semantic similarity in LLM outputs, addressing limitations of existing methods and revealing domain and variation impacts.
Contribution
The paper presents a novel KG-based approach for creating semantic similarity benchmarks across multiple domains, reducing reliance on subjective human judgment.
Findings
Semantic variation sub-types affect similarity method performance
Domain influences the effectiveness of similarity measures
No single method outperforms others across all settings
Abstract
Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education
