The Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition
Xiujiang Tan (Guangzhou Academy of Fine Arts, Guangzhou, China)

TL;DR
This paper explores the topological limitations of current multimodal AI architectures, proposing a new geometric framework and benchmarks to improve creative cognition capabilities.
Contribution
It introduces a topological perspective on multimodal fusion, formalizes it mathematically, and proposes new implementations and benchmarks for advancing creative AI.
Findings
Identifies modal separability as a key topological limitation.
Proposes fiber bundle and Yang-Mills curvature formalism for multimodal fusion.
Introduces UOO implementation and new benchmarks for topological isomorphism.
Abstract
This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior -- modal separability -- which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein's saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) -- the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
