Gaussian Masked Autoencoders
Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph, Feichtenhofer, Jitendra Malik, Shiry Ginosar

TL;DR
GMAE introduces a novel self-supervised learning framework that combines semantic abstraction with explicit spatial understanding using Gaussian primitives, enabling zero-shot spatial tasks while maintaining high-level semantics.
Contribution
It is the first to incorporate Gaussian primitives into image representation learning beyond single-scene reconstructions, enhancing spatial awareness in self-supervised models.
Findings
Enables zero-shot spatial tasks like segmentation and edge detection.
Preserves high-level semantic representations similar to MAE.
Introduces Gaussian splatting for image rendering in self-supervised learning.
Abstract
This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation…
Peer Reviews
Decision·Submitted to ICLR 2025
Originality: The proposed method of using 3D gaussians as their intermediate representation is original and interesting. However, the related work section misses very related work and focuses on some more irrelevant topics (discussed more later) Quality: The proposed representation seems to learn better reconstructions compared to MAE. However, beyond this, I personally do not agree with the proposed evaluations to show the benefits of this representation. No comparisons are made to any other
\textbf{Related Work:} The paper does not talk about any related work on using mid-level representations in vision beyond using learned "tokens". The authors misrepresent MAE as only training for pixel reconstruction. MAE has an ablation experiment where they also use tokens to explore the "best of both worlds" approach that the authors suggest they take. MAE-VQGAN proposed in Bar et al. 2022 is also a tokenized MAE learner. Other mid-level representations can be thought of that are similar to t
- The paper explores an interesting topic of adding additional inductive biases to self-supervised image representation learning techniques. - The writing is clear and well-structured. - The experiments section includes a wide variety of downstream applications and comparisons.
- As talked about in the Discussion section, the number of Gaussians used in GMAE is significantly lower than the quantities typically used in scene reconstruction applications, where Gaussian splatting is well-known. This is because each Gaussian corresponds to a unique token in the lightweight decoder, so increasing their number would cause considerable slowdowns. - Minor typo on Line 503: Fig 12 → Fig 11
- I guess the main strength is some zero-shot capabilities, like foreground/background separation and edge detection. - Despite to unconventional design, it does not lead to the loss of the main quality of self-supervised methods: - The overall idea is quite unusual which, I believe, is a good quality of a scientific paper. - Writing is very clear and the presentation quality is high.
- The method looks very unnatural and simply combines 2 popular ideas: 3d gaussians and MAEs. There are no particular advantages or insights in combining them. I feel the benefits are marginal and not worth the complications of the design. - Zero-shot capabilities are not convincing: there are easier ways to obtain them with a higher quality (e.g., generative methods or generative multi-plane images with similar layered representations). - The main advantage I would hope to see is having some 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Processing and 3D Reconstruction
MethodsMasked autoencoder
