A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata
Rodrigo Tertulino, Ricardo Almeida

TL;DR
This study employs a multi-level machine learning approach, particularly Random Forest, combined with Explainable AI to identify systemic factors influencing student performance in Brazil's basic education, highlighting the importance of school-level socioeconomic context.
Contribution
It introduces an integrated multi-source data model and applies XAI to reveal systemic influences on student performance, advancing policy-relevant insights.
Findings
Random Forest achieved 90.2% accuracy and 96.7% AUC.
School socioeconomic level is the most influential predictor.
Systemic factors outweigh individual characteristics in affecting performance.
Abstract
Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
