Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Avanika Narayan, Mayee F. Chen, Kush Bhatia, Christopher R\'e

TL;DR
Cookbook introduces a scalable, privacy-preserving framework for generating instruction data using programmatic templates, significantly enhancing LLM performance across multiple tasks without manual data curation.
Contribution
The paper presents a novel method for automatically generating instruction datasets with templates, improving LLM capabilities while avoiding privacy and legal issues.
Findings
Fine-tuning with Cookbook data improves task accuracy by up to 52.7 points.
Mistral-7B fine-tuned with Cookbook data outperforms other models on average.
The approach enhances model adherence to task-specific templates.
Abstract
Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template -- a data generating Python function -- to produce training data that encourages the model to learn an explicit pattern-based rule…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning
Methodstravel james
