Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?
Fatemeh Ziaeetabar

TL;DR
This paper argues that integrating explicit relational graphs into vision foundation models enhances their reasoning, robustness, and efficiency, especially for tasks requiring understanding of entities, roles, and relations.
Contribution
It proposes the use of dynamic relational graphs as an explicit interface in vision models, demonstrating improved performance and interpretability in various vision tasks.
Findings
Augmenting FMs with graph modules improves semantic fidelity.
Graph hybrids enhance out-of-distribution robustness.
They offer better memory and hardware efficiency.
Abstract
Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
