Machine Learning Regression of stellar effective temperatures in the second $Gaia$ Data Release
Yu Bai, JiFeng Liu, ZhongRui Bai, Song Wang, and DongWei Fan

TL;DR
This study applies supervised machine learning to predict stellar effective temperatures in Gaia DR2, achieving higher precision than previous methods by utilizing a large combined spectroscopic dataset and Gaia parameters.
Contribution
It introduces a novel large-scale regression approach using combined spectroscopic data and Gaia parameters, improving temperature prediction accuracy for millions of stars.
Findings
Achieved a root-mean-squared error of 191 K in temperature predictions.
Predicted effective temperatures for over 132 million Gaia stars.
Demonstrated a new method for blind and external regression tests.
Abstract
This paper reports on the application of the supervised machine-learning algorithm to the stellar effective temperature regression for the second data release, based on the combination of the stars in four spectroscopic surveys: Large Sky Area Multi-Object Fiber Spectroscopic Telescope, Sloan Extension for Galactic Understanding and Exploration, the Apache Point Observatory Galactic Evolution Experiment and the RAdial Velocity Extension. This combination, about four million stars, enables us to construct one of the largest training sample for the regression, and further predict reliable stellar temperatures with a root-mean-squared error of 191 K. This result is more precise than that given by second data release that is based on about sixty thousands stars. After a series of data cleaning processes, the input features that feed the regressor are carefully selected from…
| Parameters | RMSE (K) |
|---|---|
| BP∗ , RP∗ | 407 |
| , BP, RP | 393 |
| , , , , , , , | 227 |
| , , , , BP, RP | |
| , , , , , | 226 |
| , , BP, RP | |
| , , , , , , , , | 198 |
| , , BP , RP | |
| , , , , BP , RP | 196 |
| , , , , , , | 194 |
| BP , RP | |
| , , , , , , | 193 |
| BP , RP | |
| , , , , BP , RP | 192 |
| , , , , , , | 191 |
| BP , RP |
| Source ID | Regressed |
|---|---|
| 2448780173659609728 | 5128 634 |
| 2448781208748235648 | 5463 69 |
| 2448689605685695488 | 5984 91 |
| 2448689777484387072 | 4333 396 |
| 2448783991887042176 | 4166 65 |
| 2448690258520723712 | 5062 55 |
| 2448690327240200576 | 5846 89 |
| 2448689811844125184 | 4328 385 |
| 2448784953959717376 | 5382 58 |
| 2448783991887042048 | 4888 539 |
| RMSE | |||
|---|---|---|---|
| (K) | (K) | (K) | |
| Cross Validation | 17 1 | 91 1 | 191 |
| SSPP | 58 2 | 87 2 | 179 |
| RAVE | 112 4 | 196 4 | 260 |
| APOGEE | 28 3 | 119 3 | 191 |
| Catalog | First subclass | Second subclass |
|---|---|---|
| SSPP | 666 (383) | 74,015 (184) |
| RAVE | 247 (510) | 1,245 (314) |
| APOGEE | 225 (406) | 22,444 (315) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Machine Learning Regression of stellar effective temperatures in the second Data Release
Yu Bai11affiliationmark:
JiFeng Liu11affiliationmark: 22affiliationmark:
ZhongRui Bai11affiliationmark:
Song Wang11affiliationmark:
DongWei Fan33affiliationmark:
1 Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, 20A Datun Road, Chaoyang Distict, Beijing 100012, China; [email protected]
2 College of Astronomy and Space Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
3 National Astronomical Observatories, Chinese Academy of Sciences, 20A Datun Road, Chaoyang Distict, Beijing 100012, China
Abstract
This paper reports on the application of the supervised machine-learning algorithm to the stellar effective temperature regression for the second data release, based on the combination of the stars in four spectroscopic surveys: Large Sky Area Multi-Object Fiber Spectroscopic Telescope, Sloan Extension for Galactic Understanding and Exploration, the Apache Point Observatory Galactic Evolution Experiment and the RAdial Velocity Extension. This combination, about four million stars, enables us to construct one of the largest training sample for the regression, and further predict reliable stellar temperatures with a root-mean-squared error of 191 K. This result is more precise than that given by second data release that is based on about sixty thousands stars. After a series of data cleaning processes, the input features that feed the regressor are carefully selected from the parameters, including the colors, the 3D position and the proper motion. These parameters is used to predict effective temperatures for 132,739,323 valid stars in the second data release. We also present a new method for blind tests and a test for external regression without additional data. The machine-learning algorithm fed with the parameters only in one catalog provides us an effective approach to maximize sample size for prediction, and this methodology has a wide application prospect in future studies of astrophysics.
Subject headings:
stars: fundamental parameters — methods: data analysis — techniques: spectroscopic
1. Introduction
The ESA space mission is performing an all-sky astrometric, photometric and radial velocity survey at optical wavelength (Gaia Collaboration et al., 2016). The main objective of the mission is to survey more than one billion stars, in order to understand the structure, formation, and evolution of our Galaxy. The second data release ( DR2; Gaia Collaboration et al. 2018) includes a total of 1.69 billion sources with -band photometry based on 22 months of observations. Of these, 1.38 billion sources also have the integrated fluxes from the BP and RP spectrophotometers, which span 33006800 Å and 640010500 Å, respectively.
These three broad photometric bands have been used to infer stellar effective temperatures (), for all sources brighter than 17 mag with in the range 300010,000 K (Andrae et al., 2018). A machine learning algorithm, random forest (RF), has been applied to regress . The training data of the algorithm is a combination of five spectrum- or photometry-based catalogs with a total 65,200 stars. A typical accuracy of the regression is 324 K that is estimated from 50% hold-out validation, and no blind test is performed to quantify the performance of the regression and to avoid overfitting.
However, decoupling stellar temperatures and interstellar extinction is a complex problem, and more parameters than two colors is required to regress temperatures with good accuracy (Bai et al., 2019). Moreover, diversity of a sample in a parameter space has been proved to be an influential aspect, and has strong impact on overall performance of machine learning (Wang et al. 2009a, Wang et al. 2009b). The small size of training set in Andrae et al. (2018) could limit the diversity of the stellar sample and further cause regressed having high systematic deviation (e.g., Pelisoli et al. 2019; Sahlholdt et al. 2019).
The availability of spectrum-based stellar parameters for large numbers is now possible thanks to the observations of large Galactic spectral surveys. Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST; Luo et al. 2015) data release 5 (DR5) was available to domestic users in December of 2017, which includes over eight millions observations of stars 111See http://dr5.lamost.org/.. This archive data after six years’ accumulation is a treasure for various studies. One of the catalog mounted on the archive is A, F, G and K type stars catalog, in which the stellar parameters, , log and [Fe/H] are determined by the LAMOST stellar parameter pipeline (Wu et al., 2014).
Another large survey is Sloan Extension for Galactic Understanding and Exploration (SEGUE; Yanny et al. 2009). The spectra are processed through the SEGUE Stellar Parameter Pipeline (SSPP; Allende Prieto et al. 2008; Lee et al. 2008a; Lee et al. 2008b; Smolinski et al. 2011), which uses a number of methods to derive accurate estimates of stellar parameters, , log, [Fe/H], [/Fe] and [C/Fe].
Different from the upper two surveys that are in optical band, the Apache Point Observatory Galactic Evolution Experiment (APOGEE), as one of the programs in both SDSS-III and SDSS-IV, has collected high-resolution ( 22,500) high signal-to-noise (S/N 100) near-infrared (1.511.71 m) spectra of 277,000 stars (data release 15) across the Milky Way (Majewski et al., 2017). These stars are dominated by red giants selected from the Two Micron All Sky Survey. Their stellar parameters and chemical abundances are estimated by the APOGEE Stellar Parameters and Chemical Abundances Pipeline (ASPCAP; García Pérez et al. 2016).
These surveys aim mainly at stars located in the north hemisphere, while the RAdial Velocity Extension (RAVE) covers the south sky. It is designed to provide stellar parameters to complement missions that focus on obtaining radial velocities to study the motions of stars in the Milky Way s thin and thick disk and stellar halo (Steinmetz et al., 2006). Its pipeline processes the RAVE spectra and derives estimates of , log , and [Fe/H] (Kunder et al., 2017).
The large amount of spectroscopic data in these four catalogs provides us an opportunity to apply machine learning technology to regress effectively. In Section 2, we present validation samples and a method of data cleaning. Various input parameters are also explored to regress temperatures in the section. We apply the regressor and present a revised version of catalog for DR2 in Section 3. Blind tests and external regression tests are also provided. A discussion is given in Section 4.
2. Methodology
2.1. Validation Samples
The A, F, G and K type stars catalog of LAMOST DR5 includes the estimates of the stellar with the application of a correlation function interpolation (Du et al., 2012) and Université de Lyon spectroscopic analysis software (Koleva et al., 2009). These two approaches are based on distribution and morphology of absorption lines in normalized stellar spectra, independent from Galactic extinction. The temperatures are in the rang of 3460 8500 K with the uncertainty of 110 K (Gao et al., 2015). We extract 4,340,931 unique stars in the catalog, and cross match them to DR2 with a radius of 2 arcseconds, which yields 4,249,013 stars.
For SEGUE survey, we adopt estimated with the SSPP that is also based on distribution and morphology of stellar absorption lines. The temperatures range from 4000 9710 K with the typical uncertainty of 180 K. We perform a cross match with DR2, and obtain 1,037,433 stars.
The of APOGEE stars is estimated by ASPCAP, which searches a multi-dimensional grid for the best-matching synthetic spectrum (Mészáros et al., 2013). The temperatures are in the range of 3550 8200 K, with the typical uncertainty of 100 K. We cross match these stars with DR2, and obtain 275,019 stars.
The pipeline of RAVE is based on the combination of the MATrix Inversion for Spectral SynthEsis (MATISSE; Recio-Blanco et al. 2006) algorithm and the DEcision tree alGorithm for AStrophysics (DEGAS; Bijaoui et al. 2012). This pipeline is valid for stars with temperatures between 4000 K and 8000 K. The estimated errors in is approximately 250 K, and 100 K for spectra with S/N 50 (Kunder et al., 2017). The cross match with DR2 yields 518,812 stars.
We here only adopt the from spectroscopic surveys, since their stellar parameters are highly reliable (Mathur et al., 2017), compared to photometric surveys, e.g., Kepler Input Catalog. As a result, there are 6,080,277 matched stars in the four catalogs.
2.2. Data Cleaning
Andrae et al. (2018) applied various filters to remove bad data, and some of them are also adopted in our data cleaning processes. We remove the samples with 0 or 0.2. The samples with the high or negative relative uncertainties of the parallaxes may suffer large bias in the distance measurements (Luri et al., 2018), or could include large fraction of non-stellar objects (Bai et al., 2018). We also exclude the samples with \sigma(\mbox{T_{\mathrm{eff}}})/\mbox{T_{\mathrm{eff}}}> 0.05 to remove inaccurate estimates.
We plot color-color diagram in Figure 1, and select the region with number densities higher than 150 per 0.01 mag2. A logarithmic function is used to fit the colors of the sample in this region. The best fit function is = 1.79log10(+0.42)+0.71. We shift the function with 0.15 mag to select the samples with good photometry. This good-quality region is marked with the black solid lines in Figure 1. The region defined by the logarithmic function shows a better consistency with the stellar locus than the cuts in Andrae et al. (2018). As a result, the training sample contains 3,810,143 stars.
The distribution of these stars is shown in Figure 2, which is inhomogeneous. We give the impact of this on the prediction for DR2 in Section 3. The training sample is dominated by F, G, and K stars with 5000-6000 K, different from the distribution of the training sample in Andrae et al. (2018) that concentrates in five specific temperatures.
We present the differences between the in DR2 and the literature estimates in Figure 3. Some vertical concentrated regions are shown in LAMOST and SSPP panels. The stars in these regions have similar temperatures in DR2, but have different estimates in the spectrum-based catalogs. This implies that the temperatures given by DR2 are probably still coupled with Galactic extinction, since the regressor was built with two colors from a small size sample. These two colors couldn’t provide enough information to decouple the temperatures from the extinction (Davenport et al., 2014).
2.3. Input Parameters
The in DR2 was determined using two colors in photometric bands, and (Andrae et al., 2018). They tested randomised trees, support vector machine (SVM) and Gaussian processes. The algorithm of randomised trees showed very fast learning, and its results are as good as the other two algorithms.
We also adopt the random forest algorithm (RF; Breiman 2001) to build the regressor, but try different combinations of input parameters. The working theory of the RF is that it builds an ensemble of unpruned decision trees and merges them together to obtain a more accurate and stable prediction. The algorithm consists of many decision trees, and it outputs the class that is the mode of the class output by individual trees. The RF is often used when we have very large training data sets and a very large number of input variables. One big advantage of RF is fast learning from a very large number of data.
We here apply the 10 folded cross validations to test the performance of the regression, rather than the 50% hold validation that is used in Andrae et al. (2018). The cross validation partitions the sample into ten randomly chosen folds of roughly equal size. One fold is used to validate the regression that is trained using the remaining folds. This process is repeated ten times such that each fold is used exactly once for validation. The 10 folded cross validation can provide an overall assessment of the regression.
The root-mean-squared error (RMSE) is adopted to stand for the performance of the regressors (Table 1). We find that the regressor of eight input parameters, , , , , , , BP and RP, shows the best performance with RMSE of 191 K, while the regressor that is constructed with only two colors is the worst. The one-to-one correlation of the best regression is shown in the left panel in Figure 4. We bin the regressed with a step size of 100 K and fit the distribution of the corresponding test with a Gaussian function (the blue error bars in Figure 4), in order to estimate the uncertainty of the regression for different temperatures. The Gaussian fit to the total residuals is shown in the right panel, and the fitted offset () and the standard deviation () are listed in Table 3.
3. Result
We now use the criteria below to select the samples in DR2, which yields 132,739,323 stars.
[TABLE]
The algorithm of RF constructed with eight input parameters is applied to regress their , and the result is listed in Table 2.
The size of the catalog is a little smaller than that in Andrae et al. (2018), since we use more strict criteria. We compare our results with in DR2 in Figure 5. The is concentrated in some specific temperatures, 4000 K, 4500 K, 5000 K, 5500 K and 6000 K. These temperatures are consisted with the peaks in the distribution of the training set of the regressor. The inhomogeneous training set yielded output with similar distribution (see Fig. 5 and Fig. 18 in Andrae et al. 2018). Our distribution concentrates in two much broad peaks, 5000 K and 6000 K, implying better homogeneousness.
3.1. Blind Tests
Blind test is effective method to measure performance of a machine learning classifier or regressor (Bai et al., 2019). It evaluates prediction accuracy with data that are not in the training set, and provides validation that a regressor is working sufficiently to output reliable results.
In order to apply blind tests, we train sub-regressors with eight input parameters in three catalogs, and use the forth catalog to test these sub-regressors. The LAMOST DR5 is always included in the training set, since it accounts for 87% of the stars in our training set. We omit the testing stars that located outside the parameter spaces of the sub-regressors to avoid external regression. We present the results of the blind tests in Figure 6, and list the parameters of the Gaussian fit to the total residuals in Table 3.
The blind tests show that the offsets of the total residuals are below 112 K, and the standard deviations are less than 200 K. Lee et al. (2015) has applied the SSPP to LAMOST stars and compared the results to those from RAVE and APOGEE catalogs. The offsets of between different pipelines are from 36 to 73 K, and the standard deviations are from 79 to 172 K. This indicates that our regressor can output the stellar temperatures at similar accuracy to the results of spectrum-based pipelines.
3.2. External Regression
In order to test the stars that are located outside our criteria, we adopt the sub-regressors trained with three catalogs and use the stars in the forth catalog to apply external regression. The stars are divided into two subclasses, located outside the quality cuts in Figure 1, and with 0.2 0.4 (Table 4).
The result is shown in Figure 7. The is systematically overestimated for the first testing subclass, and their RMSEs are twice larger than those of the blind tests. The photometry that feed to the sub-regressors is probably worse than the photometry that located inside the quality cuts, and the sub-regressors could not predict with good accuracy.
For the second subclass, most of the stellar temperatures are also overestimated, since a large parallax relative uncertainty may refer to a complex transformation to determine a distance (Bailer-Jones et al., 2018). Such a transformation may bring noise to the sub-regressors and results in bad performances.
We don’t test the regression with outside the training label range of 37009700 K because of the inability of RF to extrapolate. Andrae et al. (2018) fed their regressor with stars that have outside the training interval, and those stars were assigned temperatures inside the training interval.
Therefore, it is suggested that all the criteria should be applied before regression in order to select good samples and further produce reliable .
4. Discussion
In this work, we have attempted to regress the effective temperatures for 132,739,323 stars in DR2 using machine learning algorithm. The regressor is trained with about four million stars in LAMOST, SSPP, RAVE and APOGEE catalogs, one of the largest training sample ever used for machine learning in astrophysics. We have tried several combinations of input parameters, and have applied cross validation to test the performances. The regressor with the smallest RMSE is built with , , , , , , BP and RP. The cross validation indicates that the typical accuracy of the regression is 191 K. In order to examine the performance of the regressor, we use the majority of the training set to build three sub-regressors, and apply the rest small fraction for blind testing. The testing results show similar performance to some spectrum-based piplines. In this section we would like to discuss the processes that haven’t been used in other machine-learning studies.
4.1. Feature Selection
In machine learning technology, feature selection is a process of selecting features in the data that are most useful or most relevant for the problem. The problem in the paper is regressing with parameters in DR2. We adopt the RMSE to indicate the relevance of the problem for the different subsets of the input parameters.
One of the most popular parameters is a stellar color, since a stellar temperature could be roughly described by a color. However, this description suffers from temperature-extinction coupling. When we try to use colors or magnitudes to regress , the performance is bad. This implies that the color parameters are relevant to our problem, but the problem couldn’t be fully described by these colors. The additional input parameters are required to provide information about the Galactic interstellar extinction.
Many works have been done to draw the 3D dusty map of the Milky Way (eg. Green et al. 2018). The extinction value is a function of the stellar location. When we add , and parallax to the parameter subsets, the performance becomes better. The RMSE is slightly smaller for the regressor with and input than and input, probably due to the transformation between equatorial and galactic coordinates. The algorithm need to find this potential transformation when build the regressor with and , which may add additional noise and result in a larger RMSE. When we use and instead of and to build the regressor, and become the most two important parameters (Figure 8). It implies that the information on Galactic extinction plays an important role in the regression.
The proper motion can also improve the performance of the regressor, and its importance is higher than those of colors (Figure 8), implying that its more relevant than colors in our regressing process. The proper motion could provide assistant information on stellar distance statistically, based on the fact that the systematic errors in distance would result in the correlations between the measured , , and velocity components (Schönrich et al., 2012; Wang et al., 2016). This implies that when we add the proper motion to the parameter subsets, the parallax could give more information about the reliability of the stellar distance.
4.2. Blind Tests for Sub-regressors
The training set for the regressor is dominated by the stars in the LAMOST catalog, over 87%, and the other three catalogs comprise 4% of SSPP, 5% of RAVE and 4% of APOGEE. We build three sub-regressors with a combination of three catalogs that are 95% of the training set, in order to apply blind tests and further to avoid potential overfitting. Each one of the sub-regressor could be used to predict the for the stars in DR2, while we use all four catalogs to train the final regressor in order to maximize the performance.
It has been proved that the performance of the regressor can be increased by adding more data to the training set (Banko & Brill, 2001). However, Pilászy & Tikk (2009) argued that more data not always help to improve the performance, and only good data can rather than noisy data. The consistency in the results of our blind tests indicates that the stars in SSPP, RAVE and APOGEE carry more signal than noise into the training set, and using four catalogs rather than three could raise the performance.
On one hand, the systematic error of the regression is mainly from the biases among different input catalogs (Figure 2). Such error doesn’t decrease, when we add more data to build the regressor. The residuals in Figure 6 show that the biases are constrained to and 200 for four spectrum-based catalogs. On the other hand, performing a regression on the combination of four catalogs can be just as biased as performing the regression on three of them. In these cases, it can be reasonable to use an averaging scheme, when there is enough samples in every bin of the grid. However, the training stars are dominated by F, G and K stars, and the sample sizes of high and low mass aren’t enough to smooth the fluctuation in the bins. We would take advantage of the averaging scheme to train a regressor more effectively, with the help of DR3 (next year) and LAMOST DR6 plus early version of DR7 (more than ten million spectra in this summer).
Therefore, it is reasonable that we use the majority of the training set to build the sub-regressor, and apply the rest small fraction for the blind test. This process can be applied in other machine-learning regression, when there isn’t additional data for a blind test.
This work was supported by the National Program on Key Research and Development Project (Grant No. 2016YFA0400804) and the National Natural Science Foundation of China (NSFC) through grants NSFC-11603038/11425313/11403056. This work presents results from the European Space Agency (ESA) space mission Gaia. Gaia data are being processed by the Gaia Data Processing and Analysis Consortium (DPAC). Funding for the DPAC is provided by national institutions, in particular the institutions participating in the Gaia MultiLateral Agreement (MLA). The Gaia mission website is https://www.cosmos.esa.int/gaia. The Gaia archive website is https://archives.esac.esa.int/gaia. The Guoshoujing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope, LAMOST) is a National Major Scientific Project which is built by the Chinese Academy of Sciences, funded by the National Development and Reform Commission, and operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is http://www.sdss.org/. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU) / University of Tokyo, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional / MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allende Prieto et al. (2008) Allende Prieto, C., Sivarani, T., Beers, T. C., et al. 2008, AJ, 136, 2070
- 2Andrae et al. (2018) Andrae, R., Fouesneau, M., Creevey, O., et al. 2018, A&A, 616, A 8
- 3Bai et al. (2018) Bai, Y., Liu, J.-F., & Wang, S. 2018, Research in Astronomy and Astrophysics, 18, 118
- 4Bai et al. (2019) Bai, Y., Liu, J.-F., Wang, S., & Yang, F. 2019, AJ, 157, 9
- 5Bailer-Jones et al. (2013) Bailer-Jones, C. A. L., Andrae, R., Arcay, B., et al. 2013, A&A, 559, A 74
- 6Bailer-Jones et al. (2018) Bailer-Jones, C. A. L., Rybizki, J., Fouesneau, M., Mantelet, G., & Andrae, R. 2018, AJ, 156, 58
- 7Banko & Brill (2001) Banko, M. & Brill, E. 2001, Scaling to very very large corpora for natural language disambiguation. In Proceedings of ACL-2001, pages 26 C 33.
- 8Bijaoui et al. (2012) Bijaoui, A., Recio-Blanco, A., de Laverny, P., & Ordenovic, C. 2012, St Met, 9, 55
