ESAT: Environmental Source Apportionment Toolkit Python package

Deron Smith; Michael Cyterski; John M Johnston; Kurt Wolfe; Rajbir Parmar

PMC · DOI:10.21105/joss.07316·December 10, 2025

ESAT: Environmental Source Apportionment Toolkit Python package

Deron Smith, Michael Cyterski, John M Johnston, Kurt Wolfe, Rajbir Parmar

PDF

Open Access

Abstract

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

ESAT

Diseases2

LS NMF

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAir Quality Monitoring and Forecasting · Atmospheric chemistry and aerosols · Atmospheric and Environmental Gas Dynamics

Full text

Summary

Source apportionment is an important tool in environmental science where sample or sensor data are often the product of many, often unknown, contributing sources. Source apportionment is used to understand the relative contributions of air sources (Bhandari et al., 2022) like vehicle emissions, industrial activities, or dust; as well as particulate matter pollution and to identify relative contributions of point sources and non-point sources in water bodies such as lakes, rivers, and estuaries (Jiang et al., 2019; Mamun & An, 2021). Using non-negative matrix factorization (NMF), source apportionment models estimate potential source profiles and contributions providing a cost-efficient method for further strategic data collection or modeling.

Environmental Source Apportionment Toolkit (ESAT) is an open-source Python package that provides a flexible and transparent workflow for source apportionment using NMF algorithms, developed to replace the EPA’s Positive Matrix Factorization version 5 (PMF5) application (EPA, 2014; Pentti Paatero, 1999). ESAT recreates the source apportionment workflow of PMF5 including pre- and post-processing analytical tools, batch modeling, uncertainty estimations and customized constraints. ESAT offers a simulator for generating datasets from synthetic profiles and contributions, allowing for model output evaluation. The synthetic profiles can be randomly generated, use a pre-defined set of profiles, or be a combination of the two. The random synthetic contributions can follow specified curves and value ranges. By running ESAT using the synthetic datasets, users are able to accurately assess ESAT’s ability to find a solution that recreates the original synthetic profiles and contributions.

Statement of Need

The EPA’s PMF5, released in 2014, provides a widely-used source apportionment modeling and analysis workflow that is no longer supported and relies on the proprietary Multilinear Engine v2 (ME2). ESAT has been developed as a replacement to PMF5, and has been designed for increased flexibility, documentation and transparency.

The Python API and CLI of ESAT provides a programmatic interface that can recreate the PMF5 workflow. The matrix factorization algorithms in ESAT have been written in Rust for runtime optimization. The ESAT API and CLI provides a flexible way to create source apportionment workflows and novel research applications. ESAT was developed for environmental research, though it’s not limited to that domain, as matrix factorization is used in many different fields.

Algorithms

Source apportionment algorithms use a loss function to quantify the difference between the input data matrix (V) and the product of a factor contribution matrix (W) and a factor profile matrix (H), weighted by an uncertainty matrix (U) (Pentti Paatero & Tapper, 1994). The goal is to find factor matrices that best reproduce the input matrix, while constraining all, or most of, the factor elements to be non-negative. The solution, W and H, can be used to calculate the residuals and overall model loss. ESAT has two NMF algorithms for updating the profile and contribution matrices: least-squares NMF (LS-NMF) (Wang et al., 2006) and weighted-semi NMF (WS-NMF) (Ding et al., 2008; Melo & Wainer, 2012).

The loss function used in ESAT, and PMF5, is a variation of squared-error loss, where data uncertainty is taken into consideration (both in the loss function and in the matrix update equations):

[eqn]

here $[eqn]$ is the input data matrix of features (columns= $[eqn]$ ) by samples (rows= $[eqn]$ ), $[eqn]$ is the uncertainty matrix of the input data matrix, $[eqn]$ is the factor contribution matrix of samples by factors= $[eqn]$ , $[eqn]$ is the factor profile of factors by features.

The ESAT versions of NMF algorithms convert the uncertainty $[eqn]$ into weights defined as $[eqn]$ . The update equations for LS-NMF then become:

[eqn]

[eqn]

while the update equations for WS-NMF:

[eqn]

[eqn]

where $[eqn]$ and $[eqn]$ .

Error Estimation

An important part of the source apportionment workflow is quantifying potential model error. ESAT offers the error estimation methods that were developed and made available in PMF5 (Brown et al., 2015; P. Paatero et al., 2014).

The displacement method (DISP) determines the amount that a source profile feature, a single value in the H matrix, must increase and decrease to cause targeted changes to the loss value. One or more features can be selected in the DISP uncertainty analysis. The bootstrap method (BS) uses block bootstrap resampling with replacement to create datasets with the original dimensions of the input, where the order of the samples has been modified in blocks of a specified size. The BS method then calculates a new model from the bootstrap dataset, and original initialization, to evaluate how the profiles and concentrations change as a result of sample reordering. The bootstrap-displacement method (BS-DISP) is the combination of the two techniques, where DISP is run for each bootstrap model on one or more features.

These error estimation methods address different uncertainty aspects: DISP targets rotational uncertainty, BS addresses random errors and sample variability, and BS-DISP provides the most comprehensive understanding of how the uncertainty impacts a source apportionment solution.

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bhandari S, Arub Z, Habib G, Apte JS, & Hildebrandt Ruiz L. (2022). Source apportionment resolved by time of day for improved deconvolution of primary source contributions to air pollution. Atmospheric Measurement Techniques, 15(20), 6051–6074. 10.5194/amt-15-6051-2022 · doi ↗
2Brown SG, Eberly S, Paatero P, & Norris GA (2015). Methods for estimating uncertainty in PMF solutions: Examples with ambient air and water quality data and guidance on reporting PMF results. Science of the Total Environment, 518, 626–635. 10.1016/j.scitotenv.2015.01.02225776202 · doi ↗ · pubmed ↗
3Ding CH, Li T, & Jordan MI (2008). Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 45–55. 10.1109/TPAMI.2008.27719926898 · doi ↗ · pubmed ↗
4EPA, U. S. (2014). Positive Matrix Factorization Model for Environmental Data Analyses. https://www.epa.gov/air-research/positive-matrix-factorization-model-environmental-data-analyses
5Jiang J, Khan AU, & Shi B. (2019). Application of positive matrix factorization to identify potential sources of water quality deterioration of Huaihe River, China. Applied Water Science, 9(63, 3). 10.1007/s 13201-019-0938-4 · doi ↗
6Mamun M, & An K-G (2021). Application of Multivariate Statistical Techniques and Water Quality Index for the Assessment of Water Quality and Apportionment of Pollution Sources in the Yeongsan River, South Korea. International Journal of Environmental Research and Public Health, 18(16). 10.3390/ijerph 18168268 PMC 839285934444013 · doi ↗ · pubmed ↗
7Melo E. V. de, & Wainer J. (2012). Semi-NMF and weighted semi-NMF algorithms comparison.
8Paatero Pentti. (1999). The multilinear engine—a table-driven, least squares program for solving multilinear problems, including the n-way parallel factor analysis model. Journal of Computational and Graphical Statistics, 8(4), 854–888. 10.1080/10618600.1999.10474853 · doi ↗