Mechanistic Interpretability for Neural TSP Solvers
Reuben Narad, Leonard Boussioux, Michael Wagner

TL;DR
This paper uses mechanistic interpretability techniques to analyze Transformer-based neural TSP solvers, revealing that they develop geometric features like boundary detectors and cluster-sensitive responses, enhancing understanding of their decision-making processes.
Contribution
It introduces the first application of activation-based interpretability methods to neural TSP models, uncovering geometric features learned by the network without explicit supervision.
Findings
Neural TSP solvers develop boundary and cluster features.
Geometric structures emerge naturally in the model.
Provides insights into the internal representations of neural TSP solutions.
Abstract
Neural networks have advanced combinatorial optimization, with Transformer-based solvers achieving near-optimal solutions on the Traveling Salesman Problem (TSP) in milliseconds. However, these models operate as black boxes, providing no insight into the geometric patterns they learn or the heuristics they employ during tour construction. We address this opacity by applying sparse autoencoders (SAEs), a mechanistic interpretability technique, to a Transformer-based TSP solver, representing the first application of activation-based interpretability methods to operations research models. We train a pointer network with reinforcement learning on 100-node instances, then fit an SAE to the encoder's residual stream to discover an overcomplete dictionary of interpretable features. Our analysis reveals that the solver naturally develops features mirroring fundamental TSP concepts: boundary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
