Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking
Eyhab Al-Masri

TL;DR
This paper introduces a benchmarking framework to measure and analyze the divergence among large language models in API discovery and ranking, revealing domain-dependent stability and potential safety risks.
Contribution
The study provides a systematic method to quantify inter-LLM divergence across multiple domains, highlighting stability in structured tasks and instability in open-ended tasks.
Findings
Moderate overall agreement among models (AO ~0.50, tau ~0.45)
Structured tasks show higher stability, open-ended tasks higher divergence
Consensus can mask instability, posing safety risks in multi-agent systems
Abstract
Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking framework to quantify inter-LLM divergence, defined as the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group-level agreement using set-, rank-, and consensus-based metrics including Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall's tau, Kendall's W, and Cronbach's alpha. Results show moderate overall alignment (AO about 0.50, tau about 0.45) but strong domain dependence: structured tasks (Weather, Speech-to-Text) are stable, while open-ended tasks (Sentiment Analysis) exhibit substantially higher divergence. Volatility…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
