TL;DR
This study applies four supervised machine learning algorithms to classify uncertain blazar sources in the Fermi-LAT catalog using observational data, achieving high accuracy and robustness in distinguishing FSRQs from BL Lacs.
Contribution
It introduces a comparative analysis of machine learning methods for classifying Fermi blazar candidates of uncertain type based on multi-parameter observational data.
Findings
All four machine learning methods perform well with high accuracy.
The methods classify approximately 25% FSRQs and 75% BL Lacs among BCUs.
Mclust Gaussian Mixture Model achieves the highest accuracy of 80%.
Abstract
In the third catalog of active galactic nuclei detected by the Fermi-LAT (3LAC) Clean Sample, there are 402 blazars candidates of uncertain type (BCU). Due to the limitations of astronomical observation or intrinsic properties, it is difficult to classify blazars using optical spectroscopy. The potential classification of BCUs using machine learning algorithms is essential. Based on the 3LAC Clean Sample, we collect 1420 Fermi blazars with 8 parameters of {\gamma}-ray photon spectral index, radio flux, flux density, curve significance, the integral photon flux in 100 to 300 MeV, 0.3 to 1 GeV, 10 to 100 GeV and variability index. Here, we apply 4 different supervised machine learning (SML) algorithms (\emph{Decision trees, Random forests, support vector machines and Mclust Gaussian finite mixture models}) to evaluate the classification of BCUs based on the direct observational…
| Selected | KS test | t-test | Wilcox-test | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Paramaters | df | ||||||||||
| Spectral.Index | 0.627 | 0.0 | -27.096 | 955.17 | 2.10E-120 | 27783.0 | 7.42E-99 | 90.89 | |||
| flux.density | 0.627 | 0.0 | -5.034 | 425.25 | 7.11E-07 | 33001.5 | 9.89E-89 | 74.98 | |||
| Radio.flux.mJy. | 0.562 | 0.0 | -5.659 | 425.62 | 2.80E-08 | 39308.0 | 3.08E-77 | 55.55 | |||
| variability.index | 0.478 | 0.0 | -3.096 | 419.60 | 2.10E-03 | 52423.0 | 6.24E-56 | 45.65 | |||
| flux.100.300.mev | 0.472 | 0.0 | -5.318 | 430.82 | 1.69E-07 | 50263.0 | 3.37E-59 | 32.89 | |||
| flux.0p3.1.gev | 0.424 | 0.0 | -4.559 | 443.64 | 6.67E-06 | 59379.5 | 4.72E-46 | 24.04 | |||
| flux.10.100.gev | 0.405 | 0.0 | 5.874 | 818.82 | 6.17E-09 | 186303.0 | 2.40E-40 | 47.94 | |||
| curve.significance | 0.274 | 2.22E-16 | -9.604 | 555.30 | 2.63E-20 | 84223.5 | 8.37E-19 | 22.20 | |||
| classifier | bll | fsrq | Sensitivity | Specificity | Positive Predictive | Negative Predictive | Accuracy | |
|---|---|---|---|---|---|---|---|---|
| (4/5) | Mclust | 295 | 105 | 0.895 | 0.929 | 0.883 | 0.937 | 0.916 |
| 8 | Random Forest | 308 | 92 | 0.842 | 0.890 | 0.821 | 0.904 | 0.872 |
| parameters | rpart | 279 | 121 | 0.829 | 0.858 | 0.778 | 0.893 | 0.847 |
| seed=123 | svm | 301 | 99 | 0.868 | 0.913 | 0.857 | 0.921 | 0.897 |
| Mclust(EDDA) | 273 | 127 | 0.921 | 0.874 | 0.814 | 0.949 | 0.892 | |
| Forest(10000) | 309 | 91 | 0.855 | 0.890 | 0.823 | 0.911 | 0.877 | |
| rpart (no pruned) | 293 | 107 | 0.776 | 0.882 | 0.797 | 0.868 | 0.842 | |
| svm (cost) | 299 | 101 | 0.829 | 0.921 | 0.863 | 0.900 | 0.887 | |
| (2/3) | Mclust | 304 | 96 | 0.845 | 0.931 | 0.891 | 0.900 | 0.897 |
| 8 | Random Forest | 310 | 90 | 0.837 | 0.897 | 0.843 | 0.893 | 0.873 |
| parameters | rpart | 305 | 95 | 0.770 | 0.892 | 0.825 | 0.854 | 0.844 |
| seed=123 | svm | 301 | 99 | 0.859 | 0.911 | 0.866 | 0.907 | 0.891 |
| (2/3) | Mclust | 297 | 103 | 0.819 | 0.896 | 0.843 | 0.878 | 0.864 |
| 8 | Random Forest | 306 | 94 | 0.833 | 0.896 | 0.846 | 0.887 | 0.870 |
| parameters | rpart | 300 | 100 | 0.746 | 0.891 | 0.824 | 0.836 | 0.832 |
| seed=321 | svm | 299 | 101 | 0.877 | 0.900 | 0.858 | 0.914 | 0.891 |
| (2/3) | Mclust | 287 | 113 | 0.864 | 0.859 | 0.812 | 0.900 | 0.861 |
| 8 | Random Forest | 302 | 98 | 0.893 | 0.884 | 0.845 | 0.921 | 0.888 |
| parameters | rpart | 291 | 109 | 0.871 | 0.854 | 0.808 | 0.904 | 0.861 |
| seed=1234 | svm | 292 | 108 | 0.864 | 0.864 | 0.818 | 0.900 | 0.864 |
| (4/5) | Mclust | 289 | 111 | 0.895 | 0.913 | 0.861 | 0.935 | 0.906 |
| 4 | Random Forest | 312 | 88 | 0.776 | 0.882 | 0.797 | 0.868 | 0.842 |
| parameters | rpart | 305 | 95 | 0.803 | 0.858 | 0.772 | 0.879 | 0.837 |
| seed=123 | svm | 305 | 95 | 0.868 | 0.898 | 0.835 | 0.919 | 0.887 |
| (4/5) | Mclust | 292 | 108 | 0.908 | 0.858 | 0.793 | 0.940 | 0.877 |
| 3 | Random Forest | 304 | 96 | 0.737 | 0.890 | 0.800 | 0.850 | 0.833 |
| parameters | rpart | 280 | 120 | 0.842 | 0.843 | 0.762 | 0.899 | 0.842 |
| seed=123 | svm | 301 | 99 | 0.855 | 0.890 | 0.823 | 0.911 | 0.877 |
| Chi16 | ||||||||
|---|---|---|---|---|---|---|---|---|
| 244 | 3 | |||||||
| bll | 295 | 253 | 277 | 274 | 218 | 47 | 228 | 244 |
| fsrqa | 0 | 42 | 18 | 21 | 76 | 4 | 36 | 22 |
| unc | 1 | 28 | 29 | |||||
| rate | 14.2% | 6.1% | 7.7% | 25.9% | 7.8% | 13.6% | 8.3% | |
| 96 | ||||||||
| bllb | 0 | 26 | 31 | 27 | 8 | 2 | 14 | 29 |
| fsrq | 105 | 79 | 74 | 78 | 97 | 7 | 83 | 54 |
| unc | 8 | 22 | ||||||
| rate | 24.8% | 29.5% | 25.7% | 7.6% | 22.2% | 14.4% | 34.9% | |
| 197 | 3 | |||||||
| bll | 246 | 207 | 46 | 213 | 224 | |||
| fsrqa | 0 | 38 | 3 | 8 | 5 | |||
| unc | 1 | 22 | 17 | |||||
| rate | 15.5% | 6.1% | 3.6% | 2.2% | ||||
| 57 | ||||||||
| bllb | 0 | 0 | 0 | 0 | 6 | |||
| fsrq | 64 | 64 | 7 | 63 | 46 | |||
| unc | 1 | 12 | ||||||
| rate | 0% | 0% | 0% | 11.5% | ||||
| 3FGL Name | log | log | log | log | log | log | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3FGL J0002.24152 | 1.121 | 2.089 | -13.135 | 0.200 | -8.275 | -9.247 | -10.587 | 1.751 | 100.00% | 0.00% | bll | bll | bll | bll | bll | unc | bll | |
| 3FGL J0003.25246 | 1.815 | 1.895 | -13.699 | 0.909 | -8.182 | -12.082 | -10.526 | 1.656 | 90.56% | 9.44% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0017.20643 | 1.973 | 2.116 | -12.955 | 0.948 | -9.233 | -9.206 | -10.979 | 1.573 | 99.88% | 0.12% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0019.15645 | 1.782 | 2.391 | -12.488 | 0.525 | -8.003 | -9.207 | -10.513 | 1.798 | 99.86% | 0.14% | bll | bll | bll | bll | fsrq | fsrq | bll | |
| 3FGL J0028.67507 | 1.909 | 2.342 | -12.298 | 0.407 | -7.975 | -8.724 | -11.472 | 1.577 | 0.93% | 99.07% | fsrq | bll | bll | bll | fsrq | bll | bll | |
| 3FGL J0030.21646 | 0.979 | 1.647 | -13.801 | 1.326 | -12.006 | -14.218 | -10.320 | 1.808 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | bll |
| 3FGL J0030.70209 | 2.473 | 2.378 | -11.547 | 3.137 | -7.836 | -8.406 | -14.839 | 2.545 | 0.00% | 100.00% | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | |
| 3FGL J0031.30724 | 1.086 | 1.824 | -13.917 | 1.060 | -9.117 | -10.830 | -10.359 | 1.519 | 99.95% | 0.05% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0039.02218 | 2.069 | 1.715 | -14.096 | 2.687 | -11.200 | -9.461 | -10.684 | 1.563 | 96.61% | 3.39% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0039.14330 | 0.913 | 1.963 | -13.352 | 1.853 | -8.533 | -9.463 | -10.564 | 1.549 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0040.34049 | 1.683 | 1.132 | -15.375 | 1.377 | -8.680 | -9.480 | -10.426 | 1.481 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0040.52339 | 1.730 | 1.946 | -13.676 | 1.383 | -11.375 | -9.226 | -10.564 | 1.692 | 99.05% | 0.95% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0043.50444 | 1.475 | 1.735 | -14.170 | 0.023 | -8.524 | -9.726 | -10.372 | 1.605 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | bll |
| 3FGL J0043.71117 | 1.397 | 1.594 | -14.050 | 2.115 | -9.092 | -13.231 | -10.386 | 1.442 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0045.23704 | 2.518 | 2.543 | -11.319 | 0.526 | -7.845 | -8.529 | -11.132 | 2.240 | 3.19% | 96.81% | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | |
| 3FGL J0049.45401 | 2.292 | 2.143 | -13.013 | 0.142 | -8.368 | -9.116 | -10.561 | 1.653 | 99.40% | 0.60% | bll | bll | bll | bll | fsrq | bll | bll | |
| 3FGL J0050.04458 | 2.526 | 2.528 | -12.023 | 0.547 | -8.269 | -9.061 | -14.818 | 1.836 | 0.61% | 99.39% | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | unc | |
| 3FGL J0051.26241 | 1.635 | 1.663 | -13.074 | 1.834 | -8.250 | -8.943 | -9.676 | 1.701 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0055.21213 | 2.420 | 2.397 | -12.466 | 0.604 | -8.388 | -8.907 | -10.827 | 1.838 | 21.42% | 78.58% | fsrq | bll | bll | fsrq | fsrq | fsrq | unc | |
| 3FGL J0103.71323 | 1.716 | 1.984 | -13.195 | 2.436 | -8.719 | -9.570 | -10.742 | 1.722 | 99.21% | 0.79% | bll | bll | bll | bll | bll | bll | bll | bll |
| 3FGL J0107.01208 | 1.778 | 2.180 | -12.943 | 0.964 | -8.401 | -9.149 | -11.097 | 1.514 | 88.44% | 11.56% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0116.22744 | 1.237 | 2.023 | -13.369 | 1.034 | -10.128 | -9.161 | -10.553 | 1.606 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0121.75154 | 0.928 | 1.984 | -13.406 | 0.437 | -8.289 | -9.269 | -10.613 | 1.586 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0127.20325 | 1.208 | 1.899 | -12.793 | 1.603 | -9.120 | -8.783 | -10.125 | 1.695 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0132.50802 | 2.488 | 1.753 | -13.863 | 1.681 | -11.932 | -12.006 | -10.425 | 1.517 | 80.33% | 19.67% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0133.25159 | 3.078 | 2.628 | -12.079 | 0.722 | -8.054 | -9.077 | -10.772 | 1.681 | 21.29% | 78.71% | fsrq | fsrq | fsrq | fsrq | fsrq | fsrq | bll | |
| 3FGL J0133.34324 | 2.179 | 2.301 | -12.602 | 1.617 | -8.572 | -8.777 | -10.996 | 1.720 | 52.25% | 47.75% | bll | bll | bll | bll | fsrq | unc | bll | |
| 3FGL J0134.52638 | 1.485 | 1.991 | -12.750 | 3.036 | -9.044 | -8.804 | -10.387 | 1.764 | 100.00% | 0.00% | bll | bll | bll | bll | bll | bll | bll | |
| 3FGL J0139.98735 | 1.063 | 1.891 | -13.833 | 1.268 | -8.273 | -9.975 | -10.342 | 1.624 | 99.98% | 0.02% | bll | bll | bll | bll | bll | bll | bll |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Evaluating the optical classification of Fermi BCUs using machine learning
School of Electrical Engineering, Liupanshui Normal University, Liupanshui, Guizhou, 553004, China
Guizhou Provincial Key Laboratory of Radio Astronomy and Data Processing
Jun-Hui Fan
Center for Astrophysics, Guangzhou University, Guangzhou 510006, China
Weiming Mao
Department of Physics, Yunnan Normal University, Kunming, Yunnan, 650092, China
School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
Jianchao Feng
School of Physics and Electronic Science, Guizhou Normal University, Guiyang, 550001, China
Guizhou Provincial Key Laboratory of Radio Astronomy and Data Processing
Yue Yin
School of Electrical Engineering, Liupanshui Normal University, Liupanshui, Guizhou, 553004, China
(Received October 17, 2018; Revised January 24, 2019; Accepted January 24, 2019)
Abstract
In the third catalog of active galactic nuclei detected by the -LAT (3LAC) Clean Sample, there are 402 blazars candidates of uncertain type (BCU). Due to the limitations of astronomical observation or intrinsic properties, it is difficult to classify blazars using optical spectroscopy. The potential classification of BCUs using machine learning algorithms is essential. Based on the 3LAC Clean Sample, we collect 1420 Fermi blazars with 8 parameters of -ray photon spectral index, radio flux, flux density, curve significance, the integral photon flux in 100 to 300 MeV, 0.3 to 1 GeV, 10 to 100 GeV and variability index. Here, we apply 4 different supervised machine learning (SML) algorithms (Decision trees, Random forests, support vector machines and Gaussian finite mixture models) to evaluate the classification of BCUs based on the direct observational properties. All the 4 methods can perform exceedingly well with a more accuracy and can effective forecast the classification of BCUs. The evaluating results show the results of these methods (SML) are valid and robust, where, about 1/4 sources are FSRQs and 3/4 are BL Lacs in 400 BCUs, which are consistent with some other recent results. Although a number of factors influence the accuracy of SML, the results are stable at a fixed ratio 1:3 between FSRQs and BL Lacs, which suggests that the SML can provides an effective method to evaluate the potential classification of BCUs. Among the 4 methods, Gaussian Mixture Modelling has the highest accuracy for our training sample (4/5, seed=123).
BL Lacertae objects: general, gamma rays: galaxies, methods: statistical, quasars: general
††journal: ApJ
1 Introduction
Blazars are a peculiar sub-class of radio-loud active galactic nuclei (AGNs), whose broadband emission is mainly dominated by non-thermal components produced in a relativistic jet pointed at a small viewing angle to the line of sight (Urry & Padovani, 1995). According to the features of optical emission-line in blazars, they are traditionally sub-divided into two groups: flat spectrum radio quasars (FSRQs) and BL Lacertae objects (BL Lacs), where the BL Lacs have weak or no emission lines (e.g., equivalent width, EW, of the emission line in rest frame is less than ) while FSRQs show stronger emission lines () (Urry & Padovani, 1995; Stocke et al., 1991; Stickel et al., 1991) in their optical spectra. The multi-wavelength spectral energy distributions (SEDs) from radio to -ray bands of blazars dominantly comes from the non-thermal emission, where the SED normally exhibits a two-hump structure in the space. The lower energy hump (peaked at between millimeter and soft X-ray waveband) is normally attributed to the synchrotron emission produced by the non-thermal electrons in the jet, while the second hump (peaked at the MeV-GeV range) mainly comes from inverse Compton (IC) scattering. The location of the peak for the lower energy bump in the SED, , is used to classify the sources as low (LSP, e.g., Hz), intermediate (ISP, e.g., Hz) and high-synchrotron-peaked (HSP, e.g., Hz) blazars (Abdo et al., 2010a).
In 2015, the -LAT Third Source Catalog (3FGL) was publicly released (Acero et al., 2015). The 3FGL catalog includes 3033 -ray sources: 2192 high-latitude () and 841 low-latitude () -ray sources, where most sources belong to blazars (Ackermann et al., 2015). Based on the 3FGL (Acero et al., 2015), the third catalog of AGNs detected by the -LAT (3LAC) was presented by Ackermann et al. (2015). The high-confidence clean sample of the 3LAC (3LAC Clean Sample), using the first four years of the -LAT data, lists 1444 -ray AGNs (Ackermann et al., 2015), which include 414 FSRQs ( 30%), 604 BL Lac objects ( 40%), 402 blazar candidates of uncertain type (BCU, 30%) and 24 non-blazar AGNs ( 2%).
Classified FSRQs and BL Lacs are sources with their optical classifications can be well identified from the literature and/or optical spectrum in the 3FGL catalog (Ackermann et al., 2015; Acero et al., 2015). BCUs are the sources with their counterparts have been established. However, their optical classifications have not been identified as a FSRQ or a BL Lac from the weaker or lacking an optical spectrum, and/or their synchrotron peak frequencies of SED, and/or their broadband emission shows blazar-type characteristics with a flat radio spectrum (see Ackermann et al. 2015; Acero et al. 2015 for the details and references therein). Such a large sample of blazars provides a good chance to explore the nature of -ray emission of blazars (e.g., Singal et al. 2012; Xiong & Zhang 2014; Singal 2015; Xiong et al. 2015a, b; Chen et al. 2016; Fan et al. 2016a, b; Ghisellini 2016; Lin & Fan 2016; Lin et al. 2017; Chen 2018; Kang et al. 2018; Lin & Fan 2018). In the 3LAC Clean Sample, there are about 30% of blazars (BCUs) that have no optical classification. Evaluating potential classification of the BCUs is a meaningful topics, which have been extensively explored based on the Fermi source catalogs (e.g., see
Hassan et al. 2013; Doert & Errando 2014; Chiaro et al. 2016; Einecke 2016; Saz Parkinson et al. 2016; Lefaucheur & Pita 2017; Salvetti et al. 2017; Yi et al. 2017 for the reviews and references therein).
At present, machine learning and data mining techniques are developing rapidly, which has been widely used in the study of astronomy and astrophysics (e.g., see the review in Ball & Brunner 2010; Feigelson & Babu 2012 and Way et al. 2012; also see Ackermann et al. 2012; Mirabal et al. 2012; Hassan et al. 2013; Doert & Errando 2014; Chiaro et al. 2016; Einecke 2016; Saz Parkinson et al. 2016; Lefaucheur & Pita 2017; Salvetti et al. 2017; Yi et al. 2017 ; Bai et al. 2018; Ma et al. 2018). Supervised machine learning (SML) is the most common technique, which aim is to build a classifier (or a decision rule) from the observations of known classification, to classify others (an observation with an unknown class membership to one of K known classes). In the 3LAC Clean Sample, there are about 70% sources with known optical classification (414 FSRQs 30%, 604 BL Lacs 40%), however, there are about 30% of blazars (BCUs) that have no optical classification. Evaluating potential classification of the BCUs using supervised machine learning is an interesting work.
In this work, we employ 4 supervised machine learning algorithms (Decision trees (DT), Random forests (RF), support vector machines (SVMs) and Gaussian finite mixture models (Mclust)) to evaluate the potential classification of BCUs only based on (only focus on) the direct observational properties of the 3LAC Clean Sample. We give some description on the sample selection in Section 2, and the supervised machine learning techniques are introducted in Section 3. Section 4 reports the results of supervised machine learning. The discussion and conclusion are presented in Section 5.
2 Sample
From the 3FGL catalog (Acero et al., 2015) and 3LAC Clean catalog (Ackermann et al., 2015), we select 1420 Fermi Clean blazars (including 414 FSRQs, 604 BL Lacs and 402 BCUs) with 37 variables. In order to select suitable parameters for supervised machine-learning, and to built an available supervised classifier, the independence of these 37 parameters distributions between two subsamples (414 FSRQs and 604 BL Lacs) are calculated using two sample test (KS test, t-test and Wilcox-test) (e.g., Acuner & Ryde 2018). Based on the two sample test results (see Table 1), excluding the same, similar, and related parameters, or some parameters that are directly related to classification (e.g., redshift), 8 parameters (the -ray photon spectral index (), radio flux (log), flux density (log), curve significance (), the integral photon flux in 100 to 300 MeV (log), 0.3 to 1 GeV (log), 10 to 100 GeV (log) and variability index ()) with the better test results (e.g, in KS test, or ) are selected in this work. Here, some of these 8 parameters (e.g., spectral index and variability index) are also used in other recent works (e.g., Doert & Errando 2014, Chiaro et al. 2016 and Lefaucheur & Pita 2017). However, their research focus (e.g., aims and/or selected parameters and/or methods) are different from that of our work. For instance, Doert & Errando (2014) focused and identified “AGN” or “non-AGN” from 576 unassociated sources of the 2FGL catalogue using a neural network and a random forest SML algorithms; Chiaro et al. (2016) focused and identified BL Lacs and FSRQs among the BCUs in the 3FGL catalogue using a neural network SML algorithm; The aim of Lefaucheur & Pita (2017) was, firstly, focused in identifying blazar candidates from the 3FGL unassociated sources, second, to evaluate the BL Lacs or FSRQs from the blazar candidates (determined in their work and the BCUs that are already reported in the 3FGL catalogue) using multivariate classifications; However, our research aim is to identify BL Lacs and FSRQs from the high-confidence clean sample of the 3LAC (3LAC Clean Sample) using 4 different SML algorithms (DT, RF, SVM and Mclust). All the available observational data of the 8 parameters are directly obtained from the 3LAC Website version111http://www.asdc.asi.it/fermi3lac/ and LAT 4-year Point Source Catalog222https://heasarc.gsfc.nasa.gov/W3Browse/fermi/fermilpsc.html. However, excluding 2 sources have no radio data and 2 missing data of curve significance (), 1416 sources (413 FSRQs, 603 BL Lacs and 400 BCUs) are compiled in this work, where, 400 BCUs are listed in Table 4.
3 Method
The fields of unsupervised and supervised machine learning provide many classification methods for predicting categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, neural networks, Bayesian networks, Gaussian finite mixture models, and many others (e.g., see Feigelson & Babu 2012; Kabacoff 2015 for the reviews).
In supervised learning (e.g., see Feigelson & Babu 2012; Kabacoff 2015 for more detail), a dataset containing values for both the predictor variables and the outcome is divided into a training sample and a validation sample. Then one uses the training sample to develop a predictive model, while uses the validation sample to verify the accuracy. This dividing of data is essential for creating an effective model, since one needs a separate validation sample to make a realistic estimation of the effectiveness of the classification schemes developed on a training sample. Once an effective predictive model is created, one can use it to predict outcomes when only the predictor variables are known (e.g., see Feigelson & Babu 2012; Kabacoff 2015 for more detail).
In this section, a brief introduction to DT, RF, SVMs and Mclust is provided. “Decision trees” aim to build a tree that can be used to classify new observations into one of two groups, by creating a set of binary splits on the predictor variables. They are popular in data-mining techniques (Utgoff 1989; Duda & Stork 2001), but very often, they tend to produce a large tree and suffer from overfitting (e.g., see Breiman et al. 1984; Duda & Stork 2001).
“Random forests” involve a large number of decision trees from a single training sample. The strategy is to enhance the classification by conducting votes among those many trees. The method is presented by Breiman (2001) and is applied to an astronomical dataset by Breiman et al. (2003). It is a highly effective method, producing classification with better accuracy compared with other classification methods (e.g., Decision trees) (e.g., see Fernández-Delgado et al. 2014). Additionally, it can handle problems with many observations and variables, and it can handle the cases where there are large amounts of missing data in the training set and where the number of variables is much greater than the number of observations. Another advantage of RF is that it produces OOB (out-of-bag) error rates and measures of variable importance. On the other hand, due to the large number of trees (default 500 trees), it is difficult to understand the classification rules and make communications.
are a group of SML models that can be used for classification and regression (Vapnik 1995, 2000). The mathematical theory behind it is to find the optimal hyperplane or a set of hyperplanes for separating classes in a high-dimensional space. This method produces accurate predicting models and is popular at present.
(Scrucca et al. 2016) is a powerful R package for model-based clustering, classification, density estimation to discriminant analysis. It is based on finite Gaussian mixture modelling and provides several tools for finite mixture models, including functions for parameter estimation using the EM algorithm (e.g., Fraley & Raftery 2002 and Scrucca et al. 2016).
In order to estimate which approach is most accurate, and therefore to choose the best predictive solution, we define the quantity accuracy in binary classifications context. A function for calculating these statistics (Kabacoff, 2015) is provided. The function takes a table containing the true outcome (rows) and predicted outcome (columns) and returns the five accuracy measures. First, the number of true positives (a true positive is an outcome where the model correctly predicts the positive class), true negatives (a true negative is an outcome where the model correctly predicts the negative class), false positives (a false positive is an observation with a positive classification is correctly identified as positive) and false negatives (a false negative is an observation with a negative classification is correctly identified as negative) are extracted. Next, these counts are used to calculate the sensitivity, specificity, positive and negative predictive values, and the accuracy (Kabacoff, 2015).
In this work, we will use the packages in R to create decision trees; the package to fit random forests; and the e1071 package to build support vector machines. the packages to fit Gaussian Mixture Modelling with the function in the base R333http://www.r-project.org installation.
4 Results
The selected sample (1416 sources) includes 1016 sources (413 FSRQs and 603 BL Lacs) with known optical classification and 400 BCUs with unidentified the optical classification in 3FGL catalog. In supervised learning, we randomly (random seed =123) assign approximately 4/5 of the observations (known classification: 603 BL Lacs and 413 FSRQs) to the training dataset, and the remaining ones to the validation dataset (test set) in the 8-dimensional parameter space as described in Section 2. All the 400 BCUs are viewed as a sample for prediction (as forecast dataset). The training set has 813 blazars (476 BL Lacs, 337 FSRQs), and the validation set has 203 blazars (127 BL Lacs, 76 FSRQs). The training dataset is used to create classification schemes using a decision tree, a random forest, a support vector machine and a Gaussian Mixture Modelling (). Where, in order to simplify calculating, all the default settings for each of 4 classification function (e.g., , , and function) are used in this work. The validation dataset is used to evaluate the effectiveness of these schemes. Using an effective predictive model that is developed using the data in the training set to forecast dataset, one can predict outcomes (a BCU belongs to BL Lacs or FSRQs) in situations where only the predictor variables are known. Here, the main R steps can be obtained from a public website444https://github.com/ksj7924/Kang2019ApJRcode.
In discriminat analysis, we use the function with MclustDA model (modelType = “MclustDA”, where each known classification is modeled by a finite mixture of Gaussian distributions with a number of components and covariance matrix structures being different between classes, named as MclustDA , see, e.g., Fraley & Raftery 2002 and Scrucca et al. 2016) to the training dataset. The largest BIC (Bayesian Information Criterion) value of -6370.124 was obtained using the VEV model (assuming clusters having ellipsoidal distributions described by variable volumes, equal shapes and variable orientations) with a 4-component mixture distribution for 476 BL Lacs; and EVE model (assuming clusters having ellipsoidal distributions described by equal volumes, variable shapes and equal orientations) with a 4-component mixture distribution for 337 FSRQs; based on the training sample. The training error rate 0.116 is also obtained based on the the function .
In order to test the result of supervised learning ( discriminate analysis), the function are used to the test set, so the test error rate 0.084 is reported. Here, we also compute the classification error using cross-validation. A cross-validation error 0.138 (which is approximately consistent with the training error rate 0.116) can be computed using the cvMclustDA() function, which by default use nfold = 10 for a 10-fold cross-validation. The classification for the training dataset from are shown in Figure 1. In order to evaluate the utility of a classification scheme, the function is performed and returns Sensitivity = 0.895, Specificity = 0.929, Positive Predictive Value = 0.883, Negative Predictive Value = 0.937, Accuracy = 0.916 (see Table 2 and Figure 2). Using the function “” for classifying predicted dataset (BCUs), we obtain 295 BL Lacs and 105 FSRQs (see Figure 3, Table 2 and machine-readable supplementary material in Table 4) from the 400 BCUs (2 sources have no radio data are excluded).
In discriminat analysis, using the function (the default number of trees is 500) in the random-Forest R package (Liaw & Wiener, 2002) to the training dataset, OOB (out-of-bag) estimate of error rate = 0.124 was obtained. Where random forests also provide a natural measure of variable importance: the coefficient (see Table 1) of Spectral Index = 90.89, flux density = 74.98, Radio flux = 55.55, variability index = 45.65, flux 10100 GeV = 47.94, flux 100300 MeV = 32.89, flux 0.31 GeV = 24.04, curve significance = 22.20, which suggests that Spectral Index is the most important variable and curve significance is the least important among the 8 selected parameters. Applying the predictive model obtained from the random forest to the validation sample, the validation sample is classified and the predictive Sensitivity = 0.842, Specificity = 0.890, Positive Predictive Value = 0.821, Negative Predictive Value = 0.904 and Accuracy = 0.872, are calculated (see Table 2 and Figure 2). Applying the random forest predictive model to the forecast dataset, we obtain 308 BL Lacs and 92 FSRQs (see Table 2 and machine-readable supplementary material in Table 4) from the 400 BCUs.
For the training sample, a decision tree is grown using the function in R package (Therneau & Atkinson, 2018). However, unfortunately, the tree sometimes becomes too large and suffers from overfitting (e.g., Breiman et al. 1984; Duda & Stork 2001). To make up for the deficiency, a function is used to prune back the tree in the package. And then a tree with the desired size can be obtained. Using it to the validation sample, the Sensitivity = 0.829, Specificity = 0.858, Positive Predictive Value = 0.778, Negative Predictive Value = 0.893, Accuracy = 0.847, are shown (see Table 2 and Figure 2). Then using it to the forecast dataset, 279 BL Lacs and 121 FSRQs are obtained (see Table 2 and machine-readable supplementary material in Table 4).
Finally, Support vector machines is also applied to the training sample. The function in the e1071 R package (Meyer et al., 2018) is used. Using the optimal predictive model obtained from SVMs to the validation sample, the Sensitivity = 0.829, Specificity = 0.921, Positive Predictive Value = 0.863, Negative Predictive Value = 0.900, Accuracy = 0.887, are printed (see Table 2 and Figure 2). Also using the optimal predictive model to the forecast dataset, 301 BL Lacs and 99 FSRQs are obtained (see Table 2 and machine-readable supplementary material in Table 4).
5 Discussions and Conclusions
In this work, one try to evaluate the potential classification of BCUs using the supervised machine learning (discriminant analysis). We use 4 methods (DT, RF, SVMs and Mclust) to perform the discriminant analysis for 8 parameters (, log, log, , log, log, log and log ). All the 4 classifiers perform exceedingly well and produce accurate and effective forecast of the classification of BCUs. Compared with the results of these methods, Gaussian Mixture Modelling is the most promising (see Table 2) for our training sample (4/5, seed=123).
FSRQs have stronger emission lines (), while the BL Lacs have weak () or no emission lines (e.g., Urry & Padovani 1995); FSRQs show higher luminosity than that of BL Lacs (e.g., see Fossati et al. 1998; Ghisellini et al. 2011 and Ghisellini et al. 2017); Based on the 3LAC catalogue, Ackermann et al. (2015) argued that FSRQs tend to have softer spectra, stronger variability and lower peak frequencies in both synchrotron and IC components than BL Lacs; And many others. These distinctions suggest different physical origin between in FSRQs and in BL Lacs (e.g., Bhattacharya et al. 2016; Fan et al. 2016b; Yang et al. 2018; Boula et al. 2019). The synchrotron radiation peak frequency of FSRQs is significantly lower than that of BL Lacs (e.g., Fossati et al. 1998; Abdo et al. 2009, 2010b; Ackermann et al. 2011, 2015; Ghisellini et al. 2017), this imply that more electron populations lose their energy through synchrotron cooling in FSRQs. In this scenario, we could expect stronger radio emission in FSRQs. For gamma-ray band, it is commonly believed that the gamma-ray radiation in BL Lacs originate from a pure synchrotron self-Compton (SSC) process, (e.g., Mastichiadis & Kirk 1997; Krawczynski et al. 2004; Zheng & Zhang 2011; Zhang et al. 2014; Zheng et al. 2014; Chen 2017; Zheng et al. 2018), while that in FSRQs come from SSC+EC (external Compton) processes (e.g., Sambruna et al. 1999; Böttcher & Chiang 2002; Chen & Bai 2011; Kang et al. 2014; Kang et al. 2016; Zheng & Yang 2016; Zheng et al. 2017). This indicates, for FSRQs, a complex physical process can be expected in Fermi energy bands. The Fermi energy spectrum in FSRQs could be resulted from the spectrum that is superposed other spectra components (e.g., Zheng & Kang 2013; Zheng et al. 2016; Kang 2017). The Fermi band of FSRQs locating at intersection of both synchrotron self-Compton component and external Compton component could result to a more complex observational features (e.g., Abdo et al. 2009, 2010b; Ackermann et al. 2011, 2015). Other physical origins (e.g., mass accretion rate on to the central black hole) are also discussed in resent works (e.g., Boula et al. 2019). The more fundamental physical origins between in FSRQs and in BL Lacs require further discussion in the future.
One checks the results for different combinations of parameters. Based on the coefficient (a natural measure of variable importance) in random forests supervised learning (see Column 9 in Table 1), one selects part parameters with a higher coefficient (4 parameters: , log, log and log; or 3 parameters: , log, log) to discriminant analysis also using the 4 methods. We find that the predictive accuracy will be smaller than that of 8 parameters. However, Gaussian Mixture Modelling also tends to be more accurate compared with other classification methods for the different testing variables in combination (see Table 2, e.g., 8 parameters, 4 parameters and 3 parameters) for our training sample. In general, which implies more parameters and more accuracy, but it is unstable for the different classification methods (see Table 2 for the details).
However, we should note that the predictive accuracy and results may be affected by the training dataset and validation dataset. When one randomly (seed=123) assigns approximately 2/3 of the known classification blazars (603 BL Lacs and 413 FSRQs) to the training dataset (677 blazars: 399 BL Lacs and 278 FSRQs) and the remaining ones to the validation dataset (339 blazars: 204 BL Lacs and 135 FSRQs) in the 8-dimensional parameter space as in Section 4. The predictive accuracy (Accuracy = 0.897) and results (304 BL Lacs and 96 FSRQs predicted from the 400 BCUs) are slightly different with that of the (4/5, seed=123) training and validation samples (see Table 2) in discriminant analysis. And other methods also show similar results. Also, for the randomly samples (e.g., randomly seed =123, =321, or =1234), the results and accuracy are also different (see Table 2), where the most accuracy are obtained in the support vector machines (seed=321) or in the Random forests (seed=1234) respectively. These suggest the results of discriminant analysis (supervised learning) are significantly affected by the quality (e.g., seed=123, 321 or 1234) and quantity (e.g., changed from 813 blazars (4/5, see Section 4) to 677 blazars (2/3)) of the training samples. Where, sometimes, the support vector machines or the Random forests yield a higher accuracy, which is consistent with other works (e.g., Fernández-Delgado et al. 2014).
In addition, we also should note that all the default settings for each of the 4 classification function (e.g., , , and function) are used in Section 4. For each different classification method, choosing of calculation model and setting of each parameter in fitting function (e.g., the “modelType = MclustDA or EDDA (Eigenvalue Decomposition Discriminant Analysis, e.g., see Scrucca et al. 2016))” in function , the “tree = 500 or 10000” in function, if a function is used in , “gamma=0.1 or 0.01” and “cost=1 or 1000” in function) can also affect predictive models, accuracy and results (see the labels “EDDA”, “Forest”, “no pruned” and “cost” in Table 2). About how to select the appropriate parameter settings are need to further address in future, which also is beyond the scope of this work.
We compare the results of SML algorithms with that of other three SML algorithms. We find that, for BL Lacs, 253, 277 and 274 (mean about 91%) BL Lac candidates in DT, RF and SVMs match with (295) BL Lacs candidate sample (see Table 3); unfortunately, 42, 18 and 21 (mean about 9%) sources classed as FSRQ do not match the (295) BL Lacs candidates respectively. For FSRQs, 79, 74 and 78 (mean about 73%) FSRQ candidates match the results (105 FSRQ candidates) of method; but 26, 31 and 27 (mean about 27%) sources do not match the subset of 105 FSRQ candidates respectively (see Table 3).
We also compare the results of algorithms with other resent similar results (e.g., Chiaro et al. 2016; Massaro et al. 2016; Lefaucheur & Pita 2017; Yi et al. 2017). After cross comparison with the results of Lefaucheur & Pita (2017) using multivariate classifications, in the subset of 295 BL Lac candidates (see method), we find that 3 sources do not match sources and 28 sources did not provide a clear classification in Lefaucheur & Pita (2017). prediction is in accordance with Lefaucheur & Pita (2017) for 228 objects (about 77%) and is inconsistent for 36 (about 12%). For the subset of 105 FSRQ candidates, 83 objects are in agreement with and 14 objects are in disagreement with prediction, and 8 objects do not provide a clear classification (Lefaucheur & Pita, 2017). Also, in the subset of 295 BL Lac candidates, prediction is in accordance with Chiaro et al. (2016) using artificial neural networks (ANN) machine-learning techniques for 244 sources and is inconsistent for 22, 29 objects do not provide a clear classification (Chiaro et al., 2016). For the subset of 105 FSRQ candidates, only 54 objects are in agreement with and 29 objects are in disagreement with prediction, and 22 objects do not provide a clear classification (Chiaro et al., 2016). For comparison with Yi et al. (2017) performed a statistical analysis of the broadband spectral properties (e.g., spectral indices in the gamma-ray, X-ray, optical, and radio bands) of blazars, the similar results are also shown. 218 BL Lac candidates and 97 FSRQ candidates are in agreement with prediction (295 BL Lac and 105 FSRQ candidates); 4 BL Lac or 8 FSRQ candidates objects are in disagreement with prediction (see Table 3). Most of the results of SML are consistent with that (47 BL Lacs 92.2% and 7 FSRQs 77.8%) of Massaro et al. (2016) using optical spectroscopic observations.
However, a fraction of sources (4 BL Lacs and 2 FSRQs) are misjudged using SML. These results suggest that it is a good overall agreement in these SMLs and other resent results. However, SML algorithms probably lead to some misjudgments in evaluating the potential (optical) classification of blazars. Only the optical spectroscopic observations is still most efficient and accurate way to determine the real nature of these sources.
When we combine the results of these 4 methods, 246 BL Lacs and 64 FSRQs candidates are obtained (see Table 3). Although the quantity has been decreased, the quality has been improved. The mismatch rate (e.g., rate = 4/(4+47)% 7.8%, see Table 3) drops significantly from about 25.9%, 7.8%, 13.6% and 8.3% to about 11.5%, 6.1%, 3.6% and 2.2% for BL Lac, and from about 7.6%, 22.2%, 14.4% and 34.9% to about 0%, 0%, 0% and 11.5% for FSRQs in comparison with the results of Yi et al. (2017), Massaro et al. (2016), Lefaucheur & Pita (2017) and Chiaro et al. (2016) respectively (see Table 3), which suggest that the better results can be obtained by applying multiple methods simultaneously.
Although the discriminant analysis can return the probabilities and that a BCU belongs to the BL Lacs (B) or FSRQs (F) classifications, respectively (e.g., in method, see Table 4 and a machine-readable supplementary material). However, it should be noted that the error of the supervised machine learning is still very large (accuracy is still not high enough) in the work, where, the accuracy is less than 92%. It also probably leads to some misjudgments that some FSRQs are falsely classified as BL Lacs, and vice versa (see discussion above). These result may be biased, or only be an apparent phenomenon, or hide the essential difference between BL Lacs and FSRQs (e.g., see Blandford et al. 2018 for the reviews and Boula et al. 2019), which needs further consideration. Although we did not conclusively evaluate their potential classifications (FSRQs and BL Lacs) using SML, it may be helpful for source selections in the spectroscopic observation campaigns in the future performing a spectroscopic and photometric campaign, further diagnosing their optical classification of BCUs (e.g., see Yi et al. 2017; Massaro et al. 2013 for some discussion), or provide some clues for future studying of spectroscopic and photometric.
Finally, it must be highlighted that, in this work, our results are obtained only by supervised machine learning the data obtained from Fermi catalogue, adding no external data (obtained from other archives), and have not done any of the fittings ourselves. The limited sample (made of mostly bright, firmly classified sources, excluded fainter sources) is used to diagnose the optical classifications of the BCUs in 3LAC Clean Sample. Selection effects of the direct observational data because of detection thresholds and energy bands in instruments may affect the source distributions and affect the results of the analysis in this work.
However, it should point out that each of these classifiers (DT, RF, SVMs and Mclust) performed exceedingly well on each of the accuracy measures. The results of SML are in agreement with each classifiers and other resent results (Chiaro et al. 2016; Massaro et al. 2016; Lefaucheur & Pita 2017; Yi et al. 2017), which suggests that the SML can provide an effective and easy method to evaluate the potential classification of BCUs. The evaluating results show the approach (SML) is valid and robust (see Table 2). It is about 1:3 ratio between FSRQs and BL Lacs predicted from the 400 BCUs for any SML algorithms. Here, we also should note that 1:4 ratio (64 FSRQs and 246 BL Lacs) was obtained by combinating the results of these 4 methods (see Table 3). Whether the true ratio is 1:3 or 1:4 or others needs further verification. However, Gaussian Mixture Modelling tend to be more accurate compared with other classification methods for the different testing the variables in combination (see Table 2, e.g., 8 parameters, 4 parameters and 3 parameters) for our training sample. Although there are a number of factors influencing the accuracy of SML. However, this work provides some simple methods to distinguish the BL Lacs or FSRQs with the probabilities and (see Table 4) from BCUs based on the direct observational data. A more preferable statistical approach, that uses the a large and more complete sample (e.g., the upcoming 4LAC) are needed to further test and address the issue.
Acknowledgements
We thank the anonymous referee for very constructive and helpful comments and suggestions, which greatly helped us to improve our paper, and thanks for Lv xin’s help in language and writing. This work is partially supported by the National Natural Science Foundation of China (Grant Nos.11763005, 11873043, 11847091, 11733001, 11622324, U1531245, and 11573009), the Science and Technology Foundation of Guizhou Province (QKHJC[2019]1290), the Research Foundation for Scientific Elitists of the Department of Education of Guizhou Province (QJHKYZ[2018]068), the Open Fund of Guizhou Provincial Key Laboratory of Radio Astronomy and Data Processing (KF201811), the Natural Science Foundation of the Department of Education of Guizhou Province (QJHKYZ[2015]455), the Physical Electronic Key Discipline of Guizhou Province (ZDXK201535), the Research Foundation for Advanced Talents of Liupanshui Normal University (LPSSYKYJJ201506), the Research Foundation of Liupanshui Normal University (LPSSY201401), the cultivation project of Master’s degree of Liupanshui Normal University(LPSSYSSDPY201704) the Key Disciplines Construction Project of Liupanshui Normal University (LPSZDZY201803), the Physics Key Discipline of Liupanshui normal university (LPSSYZDXK201801), and the Experimental Teaching Demonstration Center of Liupanshui Normal University (LPSSYsyjxsfzx201801).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdo et al. (2009) Abdo, A. A., Ackermann, M., Ajello, M., et al. 2009, Ap J, 700, 597, doi: 10.1088/0004-637X/700/1/597 · doi ↗
- 2Abdo et al. (2010 a) Abdo, A. A., Ackermann, M., Agudo, I., et al. 2010 a, Ap J, 716, 30, doi: 10.1088/0004-637X/716/1/30 · doi ↗
- 3Abdo et al. (2010 b) Abdo, A. A., Ackermann, M., Ajello, M., et al. 2010 b, Ap J, 715, 429, doi: 10.1088/0004-637X/715/1/429 · doi ↗
- 4Acero et al. (2015) Acero, F., Ackermann, M., Ajello, M., et al. 2015, Ap JS, 218, 23, doi: 10.1088/0067-0049/218/2/23 · doi ↗
- 5Ackermann et al. (2011) Ackermann, M., Ajello, M., Allafort, A., et al. 2011, Ap J, 743, 171, doi: 10.1088/0004-637X/743/2/171 · doi ↗
- 6Ackermann et al. (2012) —. 2012, Ap J, 753, 83, doi: 10.1088/0004-637X/753/1/83 · doi ↗
- 7Ackermann et al. (2015) Ackermann, M., Ajello, M., Atwood, W. B., et al. 2015, Ap J, 810, 14, doi: 10.1088/0004-637X/810/1/14 · doi ↗
- 8Acuner & Ryde (2018) Acuner, Z., & Ryde, F. 2018, MNRAS, 475, 1708, doi: 10.1093/mnras/stx 3106 · doi ↗
