MCM: Multi-layer Concept Map for Efficient Concept Learning from Masked Images
Yuwei Sun, Lu Mi, Ippei Fujisawa, Ruiqiao Mei, Jimin Chen, Siyu Zhu, Ryota Kanai

TL;DR
This paper introduces MCM, an efficient multi-layer concept learning method using masked images with Transformer models, reducing computational costs and enabling targeted image generation through concept token editing.
Contribution
It proposes a novel asymmetric architecture that correlates encoder and decoder layers for concept learning from masked images, a first in vision tasks.
Findings
Reduces training data by over 25% while improving concept prediction.
Enables targeted image generation by editing concept tokens.
Provides flexible image reconstruction by adjusting mask ratios.
Abstract
Masking strategies commonly employed in natural language processing are still underexplored in vision tasks such as concept learning, where conventional methods typically rely on full images. However, using masked images diversifies perceptual inputs, potentially offering significant advantages in concept learning with large-scale Transformer models. To this end, we propose Multi-layer Concept Map (MCM), the first work to devise an efficient concept learning method based on masked images. In particular, we introduce an asymmetric concept learning architecture by establishing correlations between different encoder and decoder layers, updating concept tokens using backward gradients from reconstruction tasks. The learned concept tokens at various levels of granularity help either reconstruct the masked image patches by filling in gaps or guide the reconstruction results in a direction…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well written and easy to follow. - The model trained on CelebA show both quantitative and qualitative improvements over baselines. - The asymmetric architecture make the training to be efficient.
- The novelty is somewhat incremental, as it mainly integrates known components into a single framework. - The method is only evaluated on CelebA, which is a relatively simple and small dataset; it is unclear whether the approach generalizes to more complex or non-face domains. - The model will need the pretrained CLIP to get the concept embeddings, so it is somehow like distililling the knowledge but not acctually the proposed method's effect. - According to the ablation the proposed looses,
1. MCM is an interesting way to generate counterfactual predictions. 2. The proposed method learns strong concept representations. 3. MCM enables flexible test-time control over edit strength via mask ratio, which is an appealing property.
1. The architecture largely resembles prior cross-attention masked autoencoders; the main semantic capability is imported from CLIP embeddings rather than emerging from reconstruction. As such, the methodological novelty appears incremental. 2. While MCM “does not require binary concept labels for training” (L358), it uses CLIP embeddings as targets derived from those exact binary labels - an almost equivalent form of supervision. 3. The experimental section is limited to only the CELEB-A datase
1. This paper has a solid motivation. The author tackles concept learning from masked images, which is underexplored, and proposes the MCM as an explicit solution. 2. MCM benefits from asymmetric encoder-decoder design, similar to MAE, where only the visible patches are passed through the encoder. This design promotes efficiency. 3. The work evaluates across multiple MCM sizes and performs comprehensive ablations that isolate each proposed component. 4. Empirical results show good tradeoff in
1. All experiments are on single dataset CelebA with only 11 attributes. Without empirical results on dataset with other attributes, this limits the generalization beyond faces or to richer concept taxonomies. 2. Disentanglement loss and weighted concept loss are strictly tied up to the predefined list of concepts. This limits the continual learning or expanding the concept set. 3. The author included the disentanglement loss to forcibly disable self-attention among concept tokens, which can hi
MCM is demonstrated on the CelebA dataset for predicting face attribute concepts and reconstructing masked images, which enables image editing by manipulating concept tokens. The authors highlight architectural novelty and improved efficiency as the main contributions.
The paper suffers from several critical weaknesses in novelty, experimental validation, and clarity of motivation, as detailed below. 1. The architectural contributions are not truly novel. The asymmetric encoder–decoder with a lightweight decoder for masked image modeling is directly inspired by MAE; thus, MCM’s use of a mask-based asymmetric architecture is an application of known methods rather than a new invention. Further, Introducing learnable concept tokens is positioned as a key novelty
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
