Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence
Bhavik Agarwal, Ishan Joshi, Viktoria Rojkova

TL;DR
This paper presents a reinforcement learning approach to improve large language models' ability to strictly follow predefined schemas, using a resource-efficient pipeline that combines synthetic data and custom rewards.
Contribution
It introduces a novel reinforcement learning pipeline with synthetic reasoning data and custom reward functions to enhance schema adherence in LLMs, building on the DeepSeek R1 framework.
Findings
Model effectively enforces schema consistency in text generation.
Resource-efficient training requires only 20 hours on a GPU cluster.
Outperforms comparable models in real-world schema adherence tasks.
Abstract
In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
