Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li, Shangqing Liu, Lingxiao Jiang

TL;DR
This study analyzes how behavioral signals of LLM-based software engineering agents vary across different frameworks, revealing that the same signals can have opposite meanings depending on the framework used.
Contribution
It provides a large-scale empirical analysis demonstrating that framework differences significantly impact behavioral signals, challenging the generalizability of findings from single-framework studies.
Findings
Framework swapping causes large behavioral differences.
Behavioral signals often have opposite implications across frameworks.
Framework identity explains more variance than LLM family in behavior.
Abstract
Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
