The Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition

Xiujiang Tan (Guangzhou Academy of Fine Arts; Guangzhou; China)

arXiv:2604.04465·cs.AI·May 5, 2026

The Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition

Xiujiang Tan (Guangzhou Academy of Fine Arts, Guangzhou, China)

PDF

TL;DR

This paper explores the topological limitations of current multimodal AI architectures, proposing a new geometric framework and benchmarks to improve creative cognition capabilities.

Contribution

It introduces a topological perspective on multimodal fusion, formalizes it mathematically, and proposes new implementations and benchmarks for advancing creative AI.

Findings

01

Identifies modal separability as a key topological limitation.

02

Proposes fiber bundle and Yang-Mills curvature formalism for multimodal fusion.

03

Introduces UOO implementation and new benchmarks for topological isomorphism.

Abstract

This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior -- modal separability -- which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein's saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) -- the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.