PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency
Preferred Elements: Kenshin Abe, Kaizaburo Chubachi, Yasuhiro Fujita,, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Hiroyoshi Komatsu, Hiroaki, Mikami, Tsuguo Mogami, Shogo Murai, Kosuke Nakago, Daisuke Nishino, Toru, Ogawa, Daisuke Okanohara, Yoshihiko Ozaki, Shotaro Sano

TL;DR
PLaMo-100B is a large-scale Japanese language model trained on 2 trillion tokens, utilizing novel normalization and loss techniques, and refined through fine-tuning to excel in Japanese tasks, achieving competitive results with top models.
Contribution
The paper introduces PLaMo-100B, a Japanese language model built from scratch with innovative training techniques and fine-tuning methods for improved Japanese language proficiency.
Findings
Achieved competitive performance on Japanese-specific benchmarks.
Utilized novel QK Normalization and Z-Loss for training stability.
Model is publicly available for research and development.
Abstract
We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at https://huggingface.co/pfnet/plamo-100b.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding
