STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

Sungeun An; Swanand Ravindra Kadhe; Shailja Thakur; Chad DeLuca; Hima Patel

arXiv:2604.18177·cs.CL·April 22, 2026

STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, Hima Patel

PDF

1 Datasets

TL;DR

The paper introduces STaD, a framework for systematically revealing compositional reasoning skill gaps in LLMs by generating scaffolded task variations, enabling scalable and detailed model analysis.

Contribution

STaD provides a novel method for controlled, incremental task variation to identify specific reasoning skill deficiencies in LLMs.

Findings

01

Identified multiple failure points in reasoning benchmarks.

02

Revealed distinct skill gaps across different models.

03

Enabled systematic probing of model reasoning capabilities.

Abstract

Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/STaD
dataset· 984 dl
984 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.