Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention
Zhenyu Guo, Wenguang Chen

TL;DR
This paper proposes a modular Transformer architecture that explicitly separates knowledge retrieval from reasoning, using a generalized cross-attention mechanism to improve interpretability, adaptability, and scalability in neural models.
Contribution
It introduces a novel modular Transformer design with a generalized cross-attention mechanism that decouples knowledge and reasoning, supported by a mathematical derivation relating FFNs to this framework.
Findings
Demonstrates that FFNs are a special case of generalized cross-attention.
Provides a theoretical foundation for understanding and improving Transformer interpretability.
Lays groundwork for future scalable and adaptable Transformer architectures.
Abstract
Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a globally shared knowledge base with layer-specific transformations, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Rough Sets and Fuzzy Logic · Neural Networks and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Softmax · Dense Connections · Balanced Selection · Absolute Position Encodings · Dropout · Adam · Residual Connection
