General Object Foundation Model for Images and Videos at Scale

Junfeng Wu; Yi Jiang; Qihao Liu; Zehuan Yuan; Xiang Bai; Song Bai

arXiv:2312.09158·cs.CV·December 15, 2023·5 cites

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

PDF

Open Access 1 Repo

TL;DR

GLEE is a versatile, large-scale object foundation model for images and videos that unifies detection, segmentation, tracking, and identification, excelling in zero-shot transfer and multi-modal tasks.

Contribution

This work introduces GLEE, a unified multi-task object model trained on diverse data, enabling zero-shot generalization and integration with large language models.

Findings

01

Achieves state-of-the-art zero-shot performance on multiple benchmarks.

02

Handles diverse object perception tasks simultaneously.

03

Demonstrates strong generalization with over five million images.

Abstract

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FoundationVision/GLEE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques