Routesplain: Towards Faithful and Intervenable Routing for Software-related Tasks
Adam \v{S}torek, Vikas Upadhyay, Marianne Menglin Liu, Daniel W. Peterson, Anshul Mittal, Sujeeth Bharadwaj, Fahad Shah, Dan Roth

TL;DR
Routesplain is a novel LLM routing method for software tasks that uses interpretable concepts to improve accuracy, reduce costs, and provide transparent rationales, outperforming existing black-box approaches.
Contribution
It introduces the first interpretable routing approach for software-related tasks, enabling concept-based routing and intervention for improved performance and transparency.
Findings
Routesplain outperforms individual models in accuracy and cost.
It matches or surpasses black-box routing baselines.
Concept-level intervention reveals improvement opportunities.
Abstract
LLMs now tackle a wide range of software-related tasks, yet we show that their performance varies markedly both across and within these tasks. Routing user queries to the appropriate LLMs can therefore help improve response quality while reducing cost. Prior work, however, has focused mainly on general-purpose LLM routing via black-box models. We introduce Routesplain, the first LLM router for software-related tasks, including multilingual code generation and repair, input/output prediction, and computer science QA. Unlike existing routing approaches, Routesplain first extracts human-interpretable concepts from each query (e.g., task, domain, reasoning complexity) and only routes based on these concepts, thereby providing intelligible, faithful rationales. We evaluate Routesplain on 16 state-of-the-art LLMs across eight software-related tasks; Routesplain outperforms individual models…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Routing queries to appropriate models to maximize performance and minimze cost is an important research problem. - The proposed method achieved encouraging results, outperforming individual models.
- Suboptimal router formulation: a major drawback of RouteSplain is that while the router understands the queries, it does not understand the strengths and weaknesses of each component models. Thus, the trained router in RouteSplain is just a more advanced version of a count-based solution that counts how many times a model is selected for each combination of concept, and retrieve that during inference. This might explain why most routing strategies have similar result curves in Figure 5 Left. A
- This paper successfully migrates the routing approach to the programming language scenario. - The evaluation comprehensively covers 8 software tasks and 16 mainstream LLMs. - It assesses performance variations of different models across these 8 tasks.
- The paper claims that the system is “interpretable” and “intervenable,” but these claims are largely conceptual. The so-called “interpretable” provided by the system appears to consist merely of a list of concept labels used in its decision-making. Furthermore, the described interventions do not meaningfully improve model performance and primarily serve to correct misclassifications made by the concept classifier. If the core purpose of the intervention is simply to rectify the system’s own er
1. Timely problem. The paper tackles a timely problem. Intelligently routing to use a cost-effective model can save a lot of money without sacrificing much performance. 2. Strong empirical evidence as motivation. In section 3, the paper performs an extensive study to show the inter-task and intra-task performance variance among models, highlighting the necessity of routing for software tasks. 3. Novel approach. The use of the concept bottleneck model (CBM) is novel and appropriate for the routin
1. Ambiguous or imprecise definition about “complexity”. The paper defines “complexity” as the fraction of models that failed on a given input. This definition is kind of circular (i.e., a query is complex because the strong models are required, and the strong models are required because it is complex). 2. Scalability of the concept space. The set of concepts is manually defined and tied directly to the labels available in the evaluation datasets. It remains unclear whether manual labeling and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software Testing and Debugging Techniques
