StructuredRAG: JSON Response Formatting with Large Language Models
Connor Shorten, Charles Pierse, Thomas Benjamin Smith, Erika Cardenas,, Akanksha Sharma, John Trengrove, Bob van Luijt

TL;DR
StructuredRAG introduces a benchmark to evaluate LLMs' ability to generate JSON responses, revealing high variability in performance influenced by task complexity, prompting further research into improving structured output reliability.
Contribution
This work presents StructuredRAG, a new benchmark with evaluation strategies, and provides insights into factors affecting LLMs' structured output generation performance.
Findings
Average success rate of 82.55% across tasks
High variance in performance from 0 to 100%
Task complexity impacts output accuracy
Abstract
The ability of Large Language Models (LLMs) to generate structured outputs, such as JSON, is crucial for their use in Compound AI Systems. However, evaluating and improving this capability remains challenging. In this work, we introduce StructuredRAG, a benchmark of six tasks designed to assess LLMs' proficiency in following response format instructions. We evaluate two state-of-the-art LLMs, Gemini 1.5 Pro and Llama 3 8B-instruct with 4-bit quantization using two distinct prompting strategies. We introduce these prompting strategies as f-String and Follow the Format (FF) prompting. Across 24 experiments, we find an average success rate of 82.55%. We further find a high variance in performance across tasks, models, and prompting strategies with success rates ranging from 0 to 100%. We find that Llama 3 8B-instruct often performs competitively with Gemini 1.5 Pro. We observe that task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
