Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
Yuan Cheng, Yu Huang, Zhe Xiong, Yingbin Liang, Vincent Y. F. Tan

TL;DR
This paper introduces a new information-theoretic metric, KG-MI, enabling transformers to learn and recover complex DAG structures with provable guarantees, extending previous tree-based results to more general graphs.
Contribution
The work proposes KG-MI combined with multi-head attention to provably learn DAGs, providing convergence proofs and structure recovery guarantees for transformer models.
Findings
Gradient ascent on transformers converges to the global optimum for DAG sequences.
Attention scores at convergence reflect the true DAG structure.
Experimental results support theoretical claims.
Abstract
Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
