Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Eyhab Al-Masri

arXiv:2604.22760·cs.IR·April 28, 2026

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Eyhab Al-Masri

PDF

TL;DR

This paper introduces a benchmarking framework to measure and analyze the divergence among large language models in API discovery and ranking, revealing domain-dependent stability and potential safety risks.

Contribution

The study provides a systematic method to quantify inter-LLM divergence across multiple domains, highlighting stability in structured tasks and instability in open-ended tasks.

Findings

01

Moderate overall agreement among models (AO ~0.50, tau ~0.45)

02

Structured tasks show higher stability, open-ended tasks higher divergence

03

Consensus can mask instability, posing safety risks in multi-agent systems

Abstract

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking framework to quantify inter-LLM divergence, defined as the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group-level agreement using set-, rank-, and consensus-based metrics including Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall's tau, Kendall's W, and Cronbach's alpha. Results show moderate overall alignment (AO about 0.50, tau about 0.45) but strong domain dependence: structured tasks (Weather, Speech-to-Text) are stable, while open-ended tasks (Sentiment Analysis) exhibit substantially higher divergence. Volatility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.