NExT-Chat: An LMM for Chat, Detection and Segmentation
Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua

TL;DR
NExT-Chat is a multimodal large language model that introduces a novel pix2emb method for object location modeling, enabling improved visual understanding and multi-task performance in visual grounding, captioning, and reasoning.
Contribution
The paper proposes the pix2emb paradigm for flexible object location modeling and trains NExT-Chat, a multimodal model capable of handling diverse visual tasks with enhanced accuracy.
Findings
NExT-Chat outperforms existing models on key visual tasks.
The pix2emb method enables flexible use of different location formats.
Comprehensive experiments validate the effectiveness of NExT-Chat.
Abstract
The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pix2seq). In this paper, we introduce a novel paradigm for object location modeling called pix2emb method, where we ask the LMM to output the location embeddings and then decode them with different decoders. This paradigm allows us to use different location formats (such as bounding boxes and masks) in multimodal conversations. Leveraging the proposed pix2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region captioning, and grounded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
