Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang

TL;DR
Stroke3D introduces a novel two-stage framework that generates rigged 3D meshes from 2D strokes and text prompts, enabling more intuitive creation of animatable 3D assets.
Contribution
It pioneers the first method to generate rigged 3D meshes conditioned on user-drawn 2D strokes, combining skeleton generation and mesh synthesis with latent diffusion models.
Findings
Produces plausible skeletons and high-quality meshes
Enables intuitive creation of rigged 3D models from 2D inputs
Outperforms existing methods in structural control and quality
Abstract
Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D…
Peer Reviews
Decision·ICLR 2026 Poster
1. Quality. The technical description is detailed and concrete, including the conditioning mechanism, classifier-free guidance, and the SKA-DPO objective. Training protocols and ablations (e.g., stroke guidance aiding convergence; the DPO margin study) are appropriate. Quantitatively, Stroke3D improves CD metrics and SKA scores over strong baselines. 2. Clarity. The end-to-end pipeline is clearly presented with informative figures (data preparation and overall architecture), and the contribution
1 Data dependence and coverage. Performance and generalization rely on curated sets (MagicArticulate, SKDream, and TextuRig). The paper acknowledges dataset limitations and sensitivity to rare concepts (e.g., plants before DPO). Scaling and coverage remain open issues. 2 Stroke simulation vs. real inputs. Structural conditioning is trained on perturbed 2D projections of 3D skeletons rather than large-scale human sketches. This domain gap may reduce robustness to messy real drawings; analysis of
1. The paper addresses an interesting and novel task: generating 3D articulated objects from a combination of 2D strokes and text descriptions. The problem itself is well-motivated and interesting. 2. The work improves upon existing baselines in several areas, including dataset curation, VAE design, and a post-hoc optimization stage. This results in a stronger baseline for future research in rigged 3D generation. 3. The paper is well-written, clearly structured, and easy to follow.
1. My main concern revolves around the paper's technical novelty. The proposed method appears to be a successful combination of existing technologies: * A data curation step that refines datasets from prior work (Unrig, MagicArticulate). * Slight modifications to the existing Sk-dream model architecture. * The application of a post-training optimization phase, which is a common technique for refinement. Consequently, the contribution seems to lie more in effective engineering and integration t
1. First framework enabling rigged 3D generation directly from 2D strokes and text. 2. Outperforms baselines such as RigNet, UniRig, and SKDream, achieving better Chamfer Distance and SKA scores. 3. Introduce TextuRig, a dataset of textured and rigged 3D meshes with captions
1. Other pipelines generate skeletons from 3D meshes, while this paper generates skeletons directly from 2D strokes. Therefore, comparing skeleton quality metrics between these methods is not entirely fair, since the input modalities differ significantly in both information richness and structural constraints. 2. The contributions of this paper are relatively limited. It mainly proposes a model that generates 3D skeletons from 2D strokes and introduces a modest extension of the SKDream dataset,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Human Motion and Animation · Computer Graphics and Visualization Techniques
