TL;DR
3D PixBrush is a novel method for automatic, image-guided local texture editing on 3D meshes, producing coherent and precise localizations and textures without user input.
Contribution
It introduces a localization-modulated image guidance technique and predicts localizations from scratch, enabling accurate, automatic local edits on 3D meshes.
Findings
Effective local texture synthesis on diverse meshes
Automatic localization without user input
Improved coherence and precision in local edits
Abstract
We present 3D PixBrush, a method for performing image-driven edits of local regions on 3D meshes. 3D PixBrush predicts a localization mask and a synthesized texture that faithfully portray the object in the reference image. Our predicted localizations are both globally coherent and locally precise. Globally - our method contextualizes the object in the reference image and automatically positions it onto the input mesh. Locally - our method produces masks that conform to the geometry of the reference image. Notably, our method does not require any user input (in the form of scribbles or bounding boxes) to achieve accurate localizations. Instead, our method predicts a localization mask on the 3D mesh from scratch. To achieve this, we propose a modification to the score distillation sampling technique which incorporates both the predicted localization and the reference image, referred to…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. The storyline of this paper is clear and easy to follow. S2. Editing the local texture of a mesh driven by a reference image is an interesting topic. S3. The proposed framework is reasonable.
W1. In section 3.4, the authors claim that the key to their method is the ability to supervise the local texture edits with image guidance. However, the use of SDS with image guidance has already been explored in the literature [1, 2]. Novelty of the method and contributions are limited. W2. The paper only reports the CLIP score as the quantitative evaluation results, which is not comprehensive for quantitative evaluation. All the ablation studies are demonstrated by qualitative examples. I hav
1. **Good design of editing pipeline**: I like the whole design of editing pipeline, where the localization and texture prediction evolves through two branches, and localization is then used in LMIG. Although it still needs some warmup steps in localization, it’s already very different from prior methods which usually adopt two or three separate editing stages. 2. **Novel design of the masked cross-attention**: Although the method is well-grounded on previous techniques, it still introduces smar
1. Although the design of the localization prediction is appreciated, it is limited to text prompt. Do the authors think this framework also supports user-specified mask or localization? 2. **Computational Efficiency**: Optimization requires 4 hours per edit on an A40 GPU, with reasonable results after 1 hour. It lacks comparisons to faster alternatives or discussions on acceleration. 3. **Limited Editing Operations Scale**: it seems like it only supports texture synthesis on the medium scale p
1.It further introduces a Localization Map–Integrated Guidance (LMIG) mechanism, which integrates a mask into the cross-attention module of the IP-Adapter to ensure that the generated texture aligns with the predicted localization map. 2.Extensive experiments demonstrate the effectiveness and superior performance of the proposed approach.
1.The novelty of this work is limited, as the IP-Adapter used for image conditioning is already widely adopted in the community, and the proposed LMIG merely incorporates a predicted mask into the attention map. 2.The paper employs the IP-Adapter as the image guidance mechanism, extending 3D PaintBrush from text-guided editing to image-guided texture generation. However, since 3D PaintBrush already predicts both the localization map and the local texture, the novelty of this extension is rather
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
