OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the   Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Shuxin Yang; Xinhan Di

arXiv:2410.01861·cs.CV·October 4, 2024

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Shuxin Yang, Xinhan Di

PDF

Open Access

TL;DR

This paper introduces OCC-MLLM-Alpha, a multi-modal large language model with self-supervised test-time learning and 3D generation support, significantly improving occluded object understanding in visual language tasks.

Contribution

It presents a novel multi-modal framework with self-supervised learning and 3D generation, addressing the gap in occluded object comprehension in large-scale models.

Findings

01

16.92% improvement over state-of-the-art models on SOMVideo dataset

02

Enhanced understanding of occluded objects in multi-modal models

03

Introduction of self-supervised test-time learning strategy

Abstract

There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through universal visual encoders and supervised learning strategies. Therefore, we introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation. We start our experiments comparing with the state-of-the-art models in the evaluation of a large-scale dataset SOMVideo [18]. The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications