Network of Theseus (like the ship)
Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung

TL;DR
Network of Theseus (NoT) is a method that allows transforming a trained neural network into a different architecture during deployment without losing performance, enabling more flexible and efficient model design.
Contribution
NoT introduces a progressive, representational similarity-based approach to convert one neural network architecture into another while maintaining functionality.
Findings
Successfully converts CNNs to MLPs with preserved accuracy
Transforms GPT-2 into RNNs without performance loss
Decouples training and deployment architectures for flexibility
Abstract
A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Unlike prior works, NoT is not limited to structurally similar architecture or reliance on identical computational patterns, e.g., attention to linear attention, but can handle radical family shifts, e.g., ResNet to MLP, GPT-2 to RNN, etc. 2. The progressive replacement strategy is quite elegant and well-justified. The "Ship of Theseus" metaphor effectively communicates the core idea of this work. The proposed method addresses a limitation in the current neural architecture research, the tig
1. The main baselines are naive replacement and training from scratch, but there are no strong comparisons to SoTA methods, such as in progressive distillation, model stitching, and neural architecture search that also attempt cross-architecture or sub-graph-level transfer. For example, some relevant works have been briefly discussed and cited in the paper, but there is no direct comparison in the main results. 2. The tuning of hyper-parameters for the D-MNN metric, such as temperature choice,
1. The core idea, progressively replacing layers while maintaining representational similarity, is intuitive yet practically effective. Although philosophically framed as a "Ship of Theseus," the method essentially defines a similarity loss to align the outputs of the replaced and original layers. This part-by-part replacement and similarity alignment strategy, while conceptually simple, is implemented systematically for the first time. Despite limited algorithmic novelty, the approach is simple
1. While the idea is straightforward and effective, it lacks strong novelty. Overall, the contribution lies more in the systematic implementation and empirical exploration of an intuitive idea than in a fundamentally new conceptual innovation. The four replacement schedules (progressive, sequential, independent, joint) are systematic but predictable; 'progressive' being superior is not surprising. The experiments are solid but could have explored broader domains, such as speech or multimodal set
The method was clearly explained and the experiments were relevant.
The paper could more concisely summarise its contributions in the introduction. It was clear what the method did but less clear what was novel. The paper would be improved by further comparison to other methods. The paper says “Doing the same with distillation would lead to much worse results.”. Why isn’t this just demonstrated empirically in a direct comparison with the proposed method? The results only compare the guide accuracy, NoT accuracy, and from-scratch baseline accuracy. Why was th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques
