SSPA: Split-and-Synthesize Prompting with Gated Alignments for   Multi-Label Image Recognition

Hao Tan; Zichang Tan; Jun Li; Jun Wan; Zhen Lei; Stan Z. Li

arXiv:2407.20920·cs.CV·July 31, 2024

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei, Stan Z. Li

PDF

Open Access

TL;DR

This paper introduces SSPA, a novel framework that enhances multi-label image recognition by leveraging large language models through split-and-synthesize prompting and gated alignments, achieving state-of-the-art results across diverse datasets.

Contribution

The paper proposes a new SSPA framework combining in-context learning, split-and-synthesize prompting, and gated dual-modal alignments to improve multi-label image recognition performance.

Findings

01

Achieves state-of-the-art results on nine datasets across three domains.

02

Effectively models label semantics and visual features separately and jointly.

03

Demonstrates improved interpretability and generalizability of the model.

Abstract

Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques