HITgram: A Platform for Experimenting with n-gram Language Models

Shibaranjani Dasgupta; Chandan Maity; Somdip Mukherjee; Rohan Singh,; Diptendu Dutta; Debasish Jana

arXiv:2412.10717·cs.CL·December 17, 2024

HITgram: A Platform for Experimenting with n-gram Language Models

Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh,, Diptendu Dutta, Debasish Jana

PDF

Open Access 1 Repo

TL;DR

HITgram is a lightweight, efficient platform for experimenting with n-gram language models, enabling resource-constrained environments to perform language modeling tasks with high speed and flexibility.

Contribution

It introduces a novel, resource-efficient platform supporting n-gram models with advanced features like context weighting and smoothing, optimized for speed and scalability.

Findings

01

Achieves 50,000 tokens/second in experiments

02

Constructs 2-grams from 320MB corpus in 62 seconds

03

Scales to 4-grams from 1GB in under 298 seconds

Abstract

Large language models (LLMs) are powerful but resource intensive, limiting accessibility. HITgram addresses this gap by offering a lightweight platform for n-gram model experimentation, ideal for resource-constrained environments. It supports unigrams to 4-grams and incorporates features like context sensitive weighting, Laplace smoothing, and dynamic corpus management to e-hance prediction accuracy, even for unseen word sequences. Experiments demonstrate HITgram's efficiency, achieving 50,000 tokens/second and generating 2-grams from a 320MB corpus in 62 seconds. HITgram scales efficiently, constructing 4-grams from a 1GB file in under 298 seconds on an 8 GB RAM system. Planned enhancements include multilingual support, advanced smoothing, parallel processing, and model saving, further broadening its utility.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chandan789maity/hitgram
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques