NN-Former: Rethinking Graph Structure in Neural Architecture Representation
Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

TL;DR
This paper introduces NN-Former, a novel neural architecture predictor that combines GNNs and transformers, emphasizing sibling node relationships to improve accuracy and latency predictions for neural network topologies.
Contribution
The paper proposes a new predictor leveraging sibling-aware topology modeling, combining GNNs and transformers with novel token and channel mixers for better neural architecture representation.
Findings
Achieves improved accuracy in architecture performance prediction
Provides better latency estimation for neural networks
Offers insights into DAG topology learning
Abstract
The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above issues, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. We thus propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The idea of attending to sibling nodes in a DAG is new compared with existing DAG transformers. - NN-Former shows better performance in accuracy/latency prediction tasks on NAS benchmarks, and the authors did ablation studies to validate the effectiveness of individual components.
- The paper claims that existing DAG transformers (even with attention masks and position encoding) have poor generalization but doesn’t provide supportive analysis. Why do they have poor generalization, and why might sibling attention help? A more in-depth analysis would make the proposed design choice more convincing. - I think there is some redundancy in architecture design. The GC branch of the BGIFFN module aggregates information from adjacent neighbors, but the ASMA module has already done
1. The model effectively integrates different types of graph information, which is a reasonable approach for this task. 2. The experiments conducted demonstrate that the proposed method outperforms previous approaches on several popular benchmarks.
1. The novelty of the approach is limited. Integrating graph information with transformers for predicting neural network performance is not new, and this work only introduces incremental modifications based on prior research. 2. The authors claim that previous methods suffer from poor generalization as the depth of the architecture increases; however, this claim is not supported by any evaluations. Additionally, there is no evidence provided that the proposed method resolves this issue. 3. Perfo
1. Introducing sibling nodes as a structural consideration in DAGs is interesting and potentially valuable for capturing architectural relationships, offering a fresh perspective on topology in neural network prediction with some theoretical foundation. 2. Combining GNN and transformer leveraging both global and local features to enhance performance. 3. Demonstrating competitive performance across multiple datasets. 4. Detailed methodology and implementation to facilitate reproducibility.
1. Although the paper claims that sibling nodes improve generalization and representation, it lacks comparisons with other advanced graph representation models that do not use sibling relationships but achieve high performance (e.g., models utilizing multi-hop or high-order adjacency information). 2. No discussion of computational overhead or comparison between the trade-offs of the presented method vs existing ones (e.g., FLOPS/latency of the predictor). 3. The BGIFFN module is designed to agg
S1. The paper contains experiments on various benchmarks, NAS-Bench-101, Nas-Bench-201, and NNLQ. S2. The empirical results in Table 1 and 2 show that the proposed method clearly outperforms the baseline methods when the number of training samples is small. S3. The paper provides necessary ablation studies showing the effectiveness of sibling attention masks (table 6) and BGIFFN (table 7). S4. The idea of mixing information from sibling nodes for latency / accuracy prediction seems intuitive.
W1. Why does the proposed method perform worse for ResNet and VGG (Table 4)? Since the proposed method performs well on architectures like EfficientNet, MNasNet, and MobileNet, which share similar combinations of CNN types in their blocks, it seems that the proposed method is specialized for these architectures. If this is the case, it suggests that the method may not have good generalizability across different architectures, and we could expect poor performance even with ResNet or VGG's other v
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Explainable Artificial Intelligence (XAI) · Big Data and Digital Economy
