LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

TL;DR
LEGO-Eval is a new evaluation framework that improves the assessment of 3D scene synthesis guided by detailed instructions, revealing current methods' limitations and emphasizing the need for better grounded scene generation.
Contribution
The paper introduces LEGO-Eval, a tool for more accurate alignment assessment of 3D scenes with instructions, and LEGO-Bench, a benchmark of detailed environment instructions.
Findings
LEGO-Eval outperforms existing VLM-based judges by 0.41 F1 score.
Current methods achieve at most 10% success in fully aligning scenes with instructions.
Benchmark reveals significant gaps in current 3D scene synthesis approaches.
Abstract
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess…
Peer Reviews
Decision·Submitted to ICLR 2026
Instead of relying on one AI model to just "look" at the scene, the paper introduces LEGO-EVAL, which acts more like a detective. It uses a set of specialized "tools" to check specific facts—one tool to find all the objects, another to check their color, and another to measure their spatial relationships. The authors created their own difficult test (called LEGO-BENCH) full of complex instructions. They proved their new "judge" (LEGO-EVAL) is far more accurate than older methods. Experiments a
The paper introduces a new test set called LEGO-BENCH, but it only contains 130 instructions. This is a very small number, which might not be enough to prove the necessity of making such a benchmark. In fact, there are many indoor scene synthesis benchmarks and it is not even worthwhile to start a new language-instructure synthesis from scratch. In LEGO-BENCH, the scenes used to test the evaluator were created "manually." This process is very slow, expensive, and hard to scale. Utilizing a sequ
1. LEGO-Eval’s tool-grounded pipeline drives a striking jump in F1 versus the usual VLM-as-judge baselines, showing that explicit grounding leads to better alignment verdicts. 1. LEGO-Bench is valuable: 130 instructions with roughly 1.2k hand-checked constraints covering both architectural makeup and object relations give the community a realistic, fine-grained stress test. The field of scene graphs, while tangential to this paper, _also_ incidentally lacks high-quality fine-grained annotations
1. The paper does not provide conclusive evidence (or even a brief discussion) to the claim that finer-grained text-scene alignment leads to real embodied gains. The paper does provide _preliminary_ evidence via the Holodeck refinement vignette (Fig. 7); however there’s no “detect -> repair -> retrain” loop or even a pointer to existing sim-to-real failures. A minimal downstream study (or stronger citations) would make the story much more convincing. 1. LEGO-Eval leans on several Unity-facing
- Reframes 3D-scene evaluation as a tool-augmented reasoning task. Combining constraint extraction, planning, and multimodal tool calls for grounding is a novel and well-motivated contribution. - The pipeline and tool taxonomy are well-explained. Figures and examples make the method intuitive. - Strong experiments with fair baselines (e.g., CLIPScore, SceneEval). Clear metrics, ablations, and human alignment analyses.
- Simulator dependency: LEGO-EVAL assumes access to the scene graph and Unity backend. This may not be available in many real settings like photorealistic assets. - Scene limination: LEGO-BENCH is limited to indoor scenes. Broader or more varied data would strengthen claims. - Failure analysis: It's unclear which constraint types cause most errors for baselines.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
