Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision
Kangsheng Yin, Quan Liu, Xuelin Shen, Yulin He, Wenhan Yang, Shiqi, Wang

TL;DR
This paper introduces UG-ICM, a unified image coding model that leverages CLIP supervision and adaptive decoding to support both human perception and machine analytics with a single, self-supervised bitstream, enhancing generalization and versatility.
Contribution
It proposes a novel unified coding framework using CLIP-based supervision and conditional decoding, enabling support for both human and machine vision tasks without task-specific training.
Findings
Achieves significant improvements in unseen machine analytics tasks.
Provides perceptually satisfying images for human viewers.
Supports dual-purpose decoding with a single bitstream.
Abstract
The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Digital Image Processing Techniques · CCD and CMOS Imaging Sensors
MethodsContrastive Language-Image Pre-training
