LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo; Hyungjoo Chae; Minseok Kang; Hyeonjong Ju; Soohyun Oh; Jinyoung Yeo

arXiv:2511.03001·cs.CL·January 29, 2026

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

LEGO-Eval is a new evaluation framework that improves the assessment of 3D scene synthesis guided by detailed instructions, revealing current methods' limitations and emphasizing the need for better grounded scene generation.

Contribution

The paper introduces LEGO-Eval, a tool for more accurate alignment assessment of 3D scenes with instructions, and LEGO-Bench, a benchmark of detailed environment instructions.

Findings

01

LEGO-Eval outperforms existing VLM-based judges by 0.41 F1 score.

02

Current methods achieve at most 10% success in fully aligning scenes with instructions.

03

Benchmark reveals significant gaps in current 3D scene synthesis approaches.

Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

Instead of relying on one AI model to just "look" at the scene, the paper introduces LEGO-EVAL, which acts more like a detective. It uses a set of specialized "tools" to check specific facts—one tool to find all the objects, another to check their color, and another to measure their spatial relationships. The authors created their own difficult test (called LEGO-BENCH) full of complex instructions. They proved their new "judge" (LEGO-EVAL) is far more accurate than older methods. Experiments a

Weaknesses

The paper introduces a new test set called LEGO-BENCH, but it only contains 130 instructions. This is a very small number, which might not be enough to prove the necessity of making such a benchmark. In fact, there are many indoor scene synthesis benchmarks and it is not even worthwhile to start a new language-instructure synthesis from scratch. In LEGO-BENCH, the scenes used to test the evaluator were created "manually." This process is very slow, expensive, and hard to scale. Utilizing a sequ

Reviewer 02Rating 4Confidence 3

Strengths

1. LEGO-Eval’s tool-grounded pipeline drives a striking jump in F1 versus the usual VLM-as-judge baselines, showing that explicit grounding leads to better alignment verdicts. 1. LEGO-Bench is valuable: 130 instructions with roughly 1.2k hand-checked constraints covering both architectural makeup and object relations give the community a realistic, fine-grained stress test. The field of scene graphs, while tangential to this paper, _also_ incidentally lacks high-quality fine-grained annotations

Weaknesses

1. The paper does not provide conclusive evidence (or even a brief discussion) to the claim that finer-grained text-scene alignment leads to real embodied gains. The paper does provide _preliminary_ evidence via the Holodeck refinement vignette (Fig. 7); however there’s no “detect -> repair -> retrain” loop or even a pointer to existing sim-to-real failures. A minimal downstream study (or stronger citations) would make the story much more convincing. 1. LEGO-Eval leans on several Unity-facing

Reviewer 03Rating 6Confidence 4

Strengths

- Reframes 3D-scene evaluation as a tool-augmented reasoning task. Combining constraint extraction, planning, and multimodal tool calls for grounding is a novel and well-motivated contribution. - The pipeline and tool taxonomy are well-explained. Figures and examples make the method intuitive. - Strong experiments with fair baselines (e.g., CLIPScore, SceneEval). Clear metrics, ablations, and human alignment analyses.

Weaknesses

- Simulator dependency: LEGO-EVAL assumes access to the scene graph and Unity backend. This may not be available in many real settings like photorealistic assets. - Scene limination: LEGO-BENCH is limited to indoor scenes. Broader or more varied data would strengthen claims. - Failure analysis: It's unclear which constraint types cause most errors for baselines.

Code & Models

Datasets

LEGO-Eval/LEGO_Bench
dataset· 68 dl
68 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis