Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao

TL;DR
This paper introduces an LLM cascade approach that uses answer consistency and mixture of thought representations to reduce reasoning task costs while maintaining high performance, by selectively deploying stronger models only when needed.
Contribution
It proposes a novel cascade pipeline utilizing answer consistency and mixture of thought representations to efficiently allocate LLM resources for reasoning tasks.
Findings
Achieves comparable performance to strong LLMs at 40% of the cost.
Uses answer consistency as a signal for question difficulty.
Demonstrates effectiveness on six reasoning benchmark datasets.
Abstract
Large language models (LLMs) such as GPT-4 have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services. In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs, particularly for performing reasoning (e.g., mathematical, causal) tasks. Our cascade pipeline follows the intuition that simpler questions can be addressed by a weaker but more affordable LLM, whereas only the challenging questions necessitate the stronger and more expensive LLM. To realize this decision-making, we consider the "answer consistency" of the weaker LLM as a signal of the question difficulty and propose several methods for the answer sampling and consistency checking, including one leveraging a mixture of two thought representations (i.e., Chain-of-Thought and Program-of-Thought).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout · Attention Dropout · Dense Connections · Linear Layer
