TL;DR
This paper introduces Distributed Neural Architectures (DNAs), which are flexible, trainable neural networks that adapt their structure dynamically for vision and language tasks, achieving competitive performance and interpretability.
Contribution
The paper proposes a novel framework for training adaptable, content-dependent neural architectures that generalize sparse methods and learn to optimize computation and communication patterns.
Findings
DNAs are competitive with dense models in vision and language tasks.
Compute efficiency and parameter sharing are learned from data.
Emergent specialization and interpretable compute allocation are observed.
Abstract
We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.) modules and routers. Any token (or patch) can traverse any series of modules in any order. DNAs are a natural generalization of the sparse methods such as Mixture-of-Experts, Mixture-of-Depths, parameter sharing, etc. Computation and communication patterns of DNA modules are learnt end-to-end during training and depend on the content and context of each token (or patch). These patterns can be shaped by further requirements added to the optimization objective such as compute/memory efficiency or load balancing. We empirically show that (i) trained DNAs are competitive with the dense baselines in both domains and (ii) compute efficiency/parameter sharing can be learnt from data. Next, we analyze…
Peer Reviews
Decision·Submitted to ICLR 2026
- This is an innovative work. The authors challenged the fixed architecture pipeline and proposed a genetic and flexible modular architecture. - The authors come up with visualization designs to analyze the routing pattern of modular networks.
- Weak results: Evaluation results in vision tasks underperform the dense models. While the language task also show mixed results against the dense counterpart. - Doubts on interpretability: As the authors also discovered, randomly initialized model also show some degree of "clustering". This cases doubt on the reliability of inspection the patches and read meaning from it. It may simply because the patches starts to be close to each other, and their representation remain close through out the l
- The paper proposes a novel and interesting paradigm for training neural networks. As pointed out in the paper, the main value of the paper is in proposing the paradigm and showing that it is feasible to train performant models in this paradigm. While they do not offer SoTA performance or practical wall-clock time efficiency with current hardware, it is an important research direction. - The proposed DNA architecture could easily have ended up much less performant than the well-optimized Trans
- Even with current Transformer architecture, we achieve a high degree of sparsity via MoEs and efficient attention layers such as Mamba or sliding window attention. In addition, we can also achieve contextual sparsity in principle with methods such as early exit (Confident Adaptive Language Modeling Schuster et. Al. 22). It is unclear if there is evidence to believe that approaches like DNA can achieve a much sparser structure than these known methods.
- Proposes a framework that generalizes most of conditional computation approaches used in training large models - Extensive analysis of routing paths of tokens to strengthen the fact that routing is interpretable - Results match dense baseline while being sparsely activated. - Experiments are comprehensive across vision and language modalities.
- Can you provide flops taken by the proposed method? It is hard to make comparison with baseline without flop comparison as their performances are similar. Include it in the table which presents the results for each domain. - Learning routing is a hard problem faced in MoEs, MoDs. The proposed method doesn’t address it all, which makes the framework not useful at the current stage. - What’s the motivation to include skip identity modules? - Why does Top-2 DNA models always have skip modules
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
