Network of Theseus (like the ship)

Vighnesh Subramaniam; Colin Conwell; Boris Katz; Andrei Barbu; Brian Cheung

arXiv:2512.04198·cs.LG·December 5, 2025

Network of Theseus (like the ship)

Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung

PDF

Open Access 3 Reviews

TL;DR

Network of Theseus (NoT) is a method that allows transforming a trained neural network into a different architecture during deployment without losing performance, enabling more flexible and efficient model design.

Contribution

NoT introduces a progressive, representational similarity-based approach to convert one neural network architecture into another while maintaining functionality.

Findings

01

Successfully converts CNNs to MLPs with preserved accuracy

02

Transforms GPT-2 into RNNs without performance loss

03

Decouples training and deployment architectures for flexibility

Abstract

A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Unlike prior works, NoT is not limited to structurally similar architecture or reliance on identical computational patterns, e.g., attention to linear attention, but can handle radical family shifts, e.g., ResNet to MLP, GPT-2 to RNN, etc. 2. The progressive replacement strategy is quite elegant and well-justified. The "Ship of Theseus" metaphor effectively communicates the core idea of this work. The proposed method addresses a limitation in the current neural architecture research, the tig

Weaknesses

1. The main baselines are naive replacement and training from scratch, but there are no strong comparisons to SoTA methods, such as in progressive distillation, model stitching, and neural architecture search that also attempt cross-architecture or sub-graph-level transfer. For example, some relevant works have been briefly discussed and cited in the paper, but there is no direct comparison in the main results. 2. The tuning of hyper-parameters for the D-MNN metric, such as temperature choice,

Reviewer 02Rating 4Confidence 4

Strengths

1. The core idea, progressively replacing layers while maintaining representational similarity, is intuitive yet practically effective. Although philosophically framed as a "Ship of Theseus," the method essentially defines a similarity loss to align the outputs of the replaced and original layers. This part-by-part replacement and similarity alignment strategy, while conceptually simple, is implemented systematically for the first time. Despite limited algorithmic novelty, the approach is simple

Weaknesses

1. While the idea is straightforward and effective, it lacks strong novelty. Overall, the contribution lies more in the systematic implementation and empirical exploration of an intuitive idea than in a fundamentally new conceptual innovation. The four replacement schedules (progressive, sequential, independent, joint) are systematic but predictable; 'progressive' being superior is not surprising. The experiments are solid but could have explored broader domains, such as speech or multimodal set

Reviewer 03Rating 4Confidence 2

Strengths

The method was clearly explained and the experiments were relevant.

Weaknesses

The paper could more concisely summarise its contributions in the introduction. It was clear what the method did but less clear what was novel. The paper would be improved by further comparison to other methods. The paper says “Doing the same with distillation would lead to much worse results.”. Why isn’t this just demonstrated empirically in a direct comparison with the proposed method? The results only compare the guide accuracy, NoT accuracy, and from-scratch baseline accuracy. Why was th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques