# Minimizing unnecessary tax audits using multi-objective hyperparameter tuning of XGBoost with focal loss

**Authors:** Ivan P. Malashin, Igor S. Masich, Vadim S. Tynchenko, Andrei P. Gantimurov, Vladimir A. Nelyub, Aleksei S. Borodulin

PMC · DOI: 10.3389/frai.2025.1669191 · Frontiers in Artificial Intelligence · 2025-10-16

## TL;DR

This paper uses machine learning to identify tax non-compliance in young companies, reducing unnecessary audits while improving accuracy and fairness.

## Contribution

A novel multi-objective hyperparameter tuning approach for XGBoost with focal loss to address class imbalance and improve interpretability.

## Key findings

- The optimized XGBoost model achieved a ROC-AUC of 0.9417, significantly higher than 0.9161 without optimization.
- SHAP analysis revealed key factors influencing non-compliance, aiding regulatory decision-making.
- The approach reduces unnecessary audits by improving model accuracy and interpretability.

## Abstract

This study presents a machine learning (ML) approach for detecting non-compliance in companies' tax data. The dataset, consisting of over one million records, focuses on three key targets: invalid addresses, invalid director information, and invalid founder information. The analysis prioritizes young companies (≤ 3 years old) with fewer than 100 employees, thereby improving class distributions and model effectiveness. A combination of binary classification techniques was employed, including benchmarked supervised learning models (XGBoost, Random Forest), anomaly detection methods (LOF, Isolation Forest), and semi-supervised learning using deep neural networks (DNNs) with unlabeled data. Given its computational efficiency, XGBoost was selected as the primary model. However, class imbalance persisted even among young companies, necessitating the integration of focal loss to improve classification performance. To further enhance accuracy while maintaining model interpretability, NSGA-II (Non-dominated Sorting Genetic Algorithm II) was used for multi-objective hyperparameter optimization of XGBoost. The objectives were to maximize ROC-AUC for improved predictive performance and minimize the number of trees to enhance interpretability. The optimized model achieved a ROC-AUC of 0.9417, compared to 0.9161 without optimization, demonstrating the effectiveness of this approach. Additionally, SHAP analysis provided insights into key factors influencing non-compliance, supporting explainability and aiding regulatory decision-making. This methodology contributes to fair and efficient oversight by reducing unnecessary inspections, minimizing disruptions to compliant businesses, and improving the overall effectiveness of tax compliance monitoring.

## Full-text entities

- **Genes:** CNTN2 (contactin 2) [NCBI Gene 6900] {aka AXT, EPEO5, FAME5, TAG-1, TAX, TAX1}, ITIH2 (inter-alpha-trypsin inhibitor heavy chain 2) [NCBI Gene 3698] {aka H2P, ITI-HC2, SHAP}
- **Diseases:** FIAD (MESH:C000719195), AML (MESH:D006679), FIAD anomaly (MESH:D000013)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12571755/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12571755/full.md

## References

69 references — full list in the complete paper: https://tomesphere.com/paper/PMC12571755/full.md

---
Source: https://tomesphere.com/paper/PMC12571755