Identifier Namespaces in Mathematical Notation
Alexey Grigorev

TL;DR
This paper proposes a novel method for automatically discovering identifier namespaces in mathematical notation by applying document clustering techniques to group identifiers, validated on source code and Wikipedia datasets.
Contribution
It introduces the first dataset and approach for automatic namespace discovery in mathematical notation, adapting document clustering methods to this new problem.
Findings
Partial recovery of namespaces from source code using identifiers
Effective extraction of namespaces from Wikipedia articles across languages
Hierarchical organization of namespaces using existing classification schemes
Abstract
In this thesis, we look at the problem of assigning each identifier of a document to a namespace. At the moment, there does not exist a special dataset where all identifiers are grouped to namespaces, and therefore we need to create such a dataset ourselves. To do that, we need to find groups of documents that use identifiers in the same way. This can be done with cluster analysis methods. We argue that documents can be represented by the identifiers they contain, and this approach is similar to representing textual information in the Vector Space Model. Because of this, we can apply traditional document clustering techniques for namespace discovery. Because the problem is new, there is no gold standard dataset, and it is hard to evaluate the performance of our method. To overcome it, we first use Java source code as a dataset for our experiments, since it contains the namespace…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Web Data Mining and Analysis
