FLM-101B: An Open LLM and How to Train It with $100K Budget
Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan,, Peng Han, Jing Li, Li Du, Bowen Qin, Zheng Zhang, Aixin Sun, Yequan Wang

TL;DR
This paper introduces FLM-101B, a large language model trained with a progressive growth strategy under a $100K budget, achieving 80% of baseline performance with significantly fewer computations, promoting cost-effective and green AI.
Contribution
The paper presents a novel progressive training method for LLMs beyond 100B parameters, demonstrating effective training within a limited budget and reducing computational costs.
Findings
FLM-101B reaches 80% of baseline performance.
Achieves this with only 10% of the floating-point operations.
The model is publicly available for community use.
Abstract
Large language models (LLMs) are considered important approaches towards foundational machine intelligence, achieving remarkable success in Natural Language Processing and multimodal tasks, among others. However, the carbon footprints and financial costs originating from heavy pre-training computation is a non-negligible issue. Progressive training methods, inspired by the neurogenesis process that grows neural structures, have shown potential to accelerate LLM pre-training. However, the algorithms, implementation, and practices for progressively training LLMs beyond 100B parameters remain underexplored. In this paper, we show that our model, namely FLM-101B, trained with our growth strategy under a budget of $100K, reaches 80\% of the baselines' performances with only 10\% of their floating-point operations. We believe that further studies on progressive training will benefit the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Softmax · Linear Layer · Residual Connection · Weight Decay · Byte Pair Encoding · Dropout · Adam
