Graph-based Unsupervised Disentangled Representation Learning via   Multimodal Large Language Models

Baao Xie; Qiuyu Chen; Yunnan Wang; Zequn Zhang; Xin Jin; Wenjun; Zeng

arXiv:2407.18999·cs.CV·July 30, 2024·1 cites

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, Wenjun, Zeng

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel graph-based framework that leverages multimodal large language models to achieve unsupervised disentangled representation learning, effectively capturing correlated factors in complex data.

Contribution

It proposes a bidirectional weighted graph approach combining $eta$-VAE and MLLMs to improve disentanglement and interpretability in unsupervised learning.

Findings

01

Superior disentanglement performance demonstrated

02

Enhanced interpretability and generalizability achieved

03

Effective modeling of correlated factors in data

Abstract

Disentangled representation learning (DRL) aims to identify and decompose underlying factors behind observations, thus facilitating data perception and generation. However, current DRL approaches often rely on the unrealistic assumption that semantic factors are statistically independent. In reality, these factors may exhibit correlations, which off-the-shelf solutions have yet to properly address. To tackle this challenge, we introduce a bidirectional weighted graph-based framework, to learn factorized attributes and their interrelations within complex data. Specifically, we propose a $β$ -VAE based module to extract factors as the initial nodes of the graph, and leverage the multimodal large language model (MLLM) to discover and rank latent correlations, thereby updating the weighted edges. By integrating these complementary modules, our model successfully achieves fine-grained,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling