What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

Weizheng Gu; Chengze Li; Zhuohao Yu; Mengyuan Sun; Zhibang Yang; Wei Wang; Hongrui Jia; Shikun Zhang; Wei Ye

arXiv:2602.01611·cs.LG·February 3, 2026

What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

Weizheng Gu, Chengze Li, Zhuohao Yu, Mengyuan Sun, Zhibang Yang, Wei Wang, Hongrui Jia, Shikun Zhang, Wei Ye

PDF

Open Access

TL;DR

This paper introduces PIPE, a protocol augmentation to diagnose whether language model agents rely on semantics or interface-specific patterns, revealing that trajectory-SFT often leads to interface shortcutting, especially under minimal interface rewrites.

Contribution

The paper proposes PIPE for diagnosing interface reliance and introduces Interface Reliance (IR), highlighting the impact of trajectory-SFT on interface shortcutting and environment-dependent training dynamics.

Findings

01

Trajectory-SFT amplifies interface shortcutting.

02

Agents degrade under minimal interface rewrites.

03

Interface reliance varies with environment and training dynamics.

Abstract

Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)