Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome

Jiaqi Xu; Tao Huang; Kai Zhang

arXiv:2602.00611·cs.AI·February 4, 2026

Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome

Jiaqi Xu, Tao Huang, Kai Zhang

PDF

Open Access

TL;DR

This paper evaluates large language models on the VirtualHome benchmark for embodied AI tasks, introducing Structured Self-Consistency to improve structured output quality, revealing complementary strengths of different models.

Contribution

It presents a comprehensive multi-task evaluation of LLMs in embodied AI, and proposes Structured Self-Consistency, a novel decoding strategy that enhances output quality for structured tasks.

Findings

01

SSC improves model performance significantly.

02

OPENPANGU-7B excels in hierarchical planning.

03

QWEN2.5-7B performs better on action-level tasks.

Abstract

Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Artificial Intelligence in Games