Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
Mullosharaf K. Arabov

TL;DR
This comprehensive guide systematically covers the entire NLP pipeline, emphasizing low-resource languages and reproducibility, from tokenisation to reinforcement learning from human feedback, with practical implementation details.
Contribution
It introduces original methods for Tajik and Tatar NLP tasks, integrating low-resource language adaptation into modern NLP workflows with reproducible research practices.
Findings
Demonstrated effective subword tokenisation for Tajik and Tatar.
Provided benchmarks for transliteration in low-resource languages.
Showcased open-source implementations aligned with modern NLP techniques.
Abstract
This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. A distinctive feature of the work is its consistent attention to low-resource and morphologically rich languages -- original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout the twelve sessions, demonstrating how modern NLP can be adapted to data-scarce environments without sacrificing rigour. Each session combines concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
