# Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning

**Authors:** D. Paul Joseph, Viswanathan Perumal

PMC · DOI: 10.7717/peerj-cs.2608 · 2025-03-05

## TL;DR

This paper introduces a new forensic file classification system that improves accuracy and efficiency by optimizing topic modeling parameters.

## Contribution

The novel βk hyperparameter enhances seed word selection through semantic and contextual similarity evaluation.

## Key findings

- The proposed SFCS system removed 278k irrelevant files and identified 5.6k suspicious files.
- The model achieved 94.6% accuracy, 94.4% precision, and 96.8% recall.
- The system operates within O(n log n) time complexity.

## Abstract

In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.

## Full-text entities

- **Diseases:** DF (MESH:C000721267), SFCS (MESH:D008310), RDC (MESH:D061085)

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11935782/full.md

---
Source: https://tomesphere.com/paper/PMC11935782