Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan

TL;DR
This paper investigates how switching models mid-conversation in multi-turn LLM systems causes performance drift, quantifies its effects, and proposes metrics for monitoring and mitigating this issue.
Contribution
It introduces a switch-matrix benchmark to measure performance drift due to model handoffs and analyzes compatibility patterns across models and benchmarks.
Findings
Single-turn handoffs cause significant, measurable performance changes.
Some models degrade or improve regardless of dialogue history, indicating compatibility patterns.
Switch-induced drift can explain up to 70% of variance in performance across benchmarks.
Abstract
Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · IPv6, Mobility, Handover, Networks, Security · Distributed systems and fault tolerance
