How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot,, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz, Beltagy, Hannaneh Hajishirzi

TL;DR
This paper systematically evaluates instruction-tuned open models of various sizes on multiple tasks, revealing strengths, limitations, and the need for improved data and models to match proprietary systems.
Contribution
It provides a comprehensive evaluation framework and a new best-performing instruction-tuned model suite called Tulu, highlighting the impact of different datasets on model skills.
Findings
Different datasets enhance specific skills but no single dataset excels across all tasks.
Model and human preferences do not fully align with benchmark-based evaluations.
The best models reach 87% of ChatGPT and 73% of GPT-4 performance.
Abstract
In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/open-instruct-dolly-7bmodel· 21 dl21 dl
- 🤗allenai/open-instruct-oasst1-7bmodel· 20 dl20 dl
- 🤗allenai/open-instruct-flan-v2-7bmodel· 22 dl· ♡ 122 dl♡ 1
- 🤗allenai/open-instruct-sni-7bmodel· 29 dl29 dl
- 🤗allenai/open-instruct-cot-7bmodel· 24 dl· ♡ 124 dl♡ 1
- 🤗allenai/open-instruct-sharegpt-7bmodel· 19 dl19 dl
- 🤗allenai/open-instruct-baize-7bmodel· 23 dl23 dl
- 🤗allenai/open-instruct-self-instruct-7bmodel· 27 dl27 dl
- 🤗allenai/tulu-7bmodel· 43 dl· ♡ 943 dl♡ 9
- 🤗allenai/open-instruct-gpt4-alpaca-7bmodel· 25 dl· ♡ 125 dl♡ 1
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings
