OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
Michael J. Bommarito II

TL;DR
OpenGloss is a large, synthetically generated encyclopedic dictionary and semantic knowledge graph for English, created rapidly and cost-effectively using multi-agent LLM pipelines, supporting NLP and educational applications.
Contribution
It introduces a comprehensive, synthetically generated lexical resource with extensive semantic and encyclopedic content, produced efficiently using automated methods.
Findings
Contains 537K senses across 150K lexemes, comparable to WordNet 3.1
Generated in under one week for less than $1,000
Provides integrated definitions, examples, collocations, and encyclopedic content
Abstract
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mjbommar/ogbert-tokenizer-8192model
- 🤗mjbommar/ogbert-tokenizer-16384model
- 🤗mjbommar/ogbert-tokenizer-32768model
- 🤗mjbommar/ogbert-v1-mlmmodel
- 🤗mjbommar/ogbert-tokenizer-8kmodel
- 🤗mjbommar/ogbert-2m-basemodel· 2 dl2 dl
- 🤗mjbommar/ogbert-2m-sentencemodel· 1 dl1 dl
- 🤗mjbommar/ogbert-110m-basemodel
- 🤗mjbommar/ogbert-110m-sentencemodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning · linguistics and terminology studies
