Modeling Text with Decision Forests using Categorical-Set Splits
Mathieu Guillame-Bert, Sebastian Bruch, Petr Mitrichev, Petr Mikheev,, Jan Pfeifer

TL;DR
This paper introduces a novel categorical-set split condition for decision forests, enabling direct modeling of textual features without prior transformation, and demonstrates its effectiveness on text classification tasks.
Contribution
The work presents a new categorical-set split condition and an efficient learning algorithm, allowing decision forests to directly handle text features.
Findings
Effective on benchmark text classification datasets
Fast evaluation with extended QuickScorer inference
Bridges the gap for modeling textual features in decision forests
Abstract
Decision forest algorithms typically model data by learning a binary tree structure recursively where every node splits the feature space into two sub-regions, sending examples into the left or right branch as a result. In axis-aligned decision forests, the "decision" to route an input example is the result of the evaluation of a condition on a single dimension in the feature space. Such conditions are learned using efficient, often greedy algorithms that optimize a local loss function. For example, a node's condition may be a threshold function applied to a numerical feature, and its parameter may be learned by sweeping over the set of values available at that node and choosing a threshold that maximizes some measure of purity. Crucially, whether an algorithm exists to learn and evaluate conditions for a feature type determines whether a decision forest algorithm can model that feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Topic Modeling · Rough Sets and Fuzzy Logic
