The Fundamental Relation between Halo Mass and Galaxy Group Properties
Zhong-yi Man, Ying-jie Peng, Jing-jing Shi, Xu Kong, Cheng-peng Zhang,, Jing Dou, Ke-xin Guo

TL;DR
This paper investigates the relationship between galaxy group halo mass and observable properties, proposing a scenario based on star formation quenching and validating it with machine learning, leading to improved halo mass predictions.
Contribution
It introduces a simple evolutionary scenario for galaxy groups and uses machine learning to confirm it, significantly improving halo mass estimation accuracy from observable data.
Findings
RF regressor reduces halo mass prediction error by 50%
Scenario accurately describes the growth differences between blue and red groups
Enhanced halo mass estimates aid studies of galaxy-halo connection
Abstract
We explore the interrelationships between the galaxy group halo mass and various observable group properties. We propose a simple scenario that describes the evolution of the central galaxies and their host dark matter halos. Star formation quenching is one key process in this scenario, which leads to the different assembly histories of blue groups (group with a blue central) and red groups (group with a red central). For blue groups, both the central galaxy and the halo continue to grow their mass. For red groups, the central galaxy has been quenched and its stellar mass remains about constant, while its halo continues to grow by merging smaller halos. From this simple scenario, we speculate about the driving properties that should strongly correlate with the group halo mass. We then apply the machine learning algorithm the Random Forest (RF) regressor to blue groups and red groups…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16| Color | MSEvalidation | MSEtest | ||||
|---|---|---|---|---|---|---|
| Blue | 1305238 | 163154 | 0.003454 | 163154 | 0.003494 | 0.976 |
| Red | 459986 | 57498 | 0.0231 | 57498 | 0.02362 | 0.966 |
| Groups | Richness | OOB Score | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Blue | 0.821% | 76.930% | 0.220% | 15.674% | 2.471% | 1.630% | 0.399% | 0.214% | 1.641% | 95.165% |
| Red | 1.191% | 65.700% | 16.800% | 1.564% | 1.743% | 7.763% | 0.731% | 0.585% | 3.921% | 93.089% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
THE FUNDAMENTAL RELATION BETWEEN HALO MASS AND GALAXY GROUP PROPERTIES
Zhong-yi Man
Department of Astronomy, School of Physics, Peking University, Beijing, 100871, China
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
Department of Astronomy, Yale University, New Haven, CT 06520, USA
Ying-jie Peng
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
Jing-jing Shi
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
Xu Kong
Key Laboratory for Research in Galaxies and Cosmology, Department of Astronomy, University of Science and Technology of China, Hefei 230026, China
School of Astronomy and Space Sciences, University of Science and Technology of China, Hefei 230026, China
Cheng-peng Zhang
Department of Astronomy, School of Physics, Peking University, Beijing, 100871, China
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
Jing Dou
Department of Astronomy, School of Physics, Peking University, Beijing, 100871, China
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
Ke-xin Guo
Kavli Institute for Astronomy and Astrophysics (KIAA), Peking University, Beijing, 100871, China
International Centre for Radio Astronomy Research, University of Western Australia, Crawley, WA 6009, Australia
(Received 2019 Feb 24; Revised 2019 June 19; Accepted 2019 July 1)
Abstract
We explore the interrelationships between the galaxy group halo mass and various observable group properties. We propose a simple scenario that describes the evolution of the central galaxies and their host dark matter halos. Star formation quenching is one key process in this scenario, which leads to the different assembly histories of blue groups (group with a blue central) and red groups (group with a red central). For blue groups, both the central galaxy and the halo continue to grow their mass. For red groups, the central galaxy has been quenched and its stellar mass remains about constant, while its halo continues to grow by merging smaller halos. From this simple scenario, we speculate about the driving properties that should strongly correlate with the group halo mass. We then apply the machine learning algorithm the Random Forest (RF) regressor to blue groups and red groups separately in the semianalytical model L-GALAXIES to explore these nonlinear multicorrelations and to verify the scenario as proposed above. Remarkably, the results given by the RF regressor are fully consistent with the prediction from our simple scenario and hence provide strong support for it. As a consequence, the group halo mass can be more accurately determined from observable galaxy properties by the RF regressor with a 50% reduction in error. A halo mass more accurately determined in this way also enables more accurate investigations on the galaxyhalo connection and other important related issues, including galactic conformity and the effect of halo assembly bias on galaxy assembly.
galaxies: evolution — galaxies: formation — galaxies: halos — methods: statistical
1 Introduction
In the context of the CDM paradigm, the formation and evolution history of galaxies are closely correlated with the hierarchical growth of the dark matter halos in which galaxies reside. Studying the interrelationships between various properties of galaxies and their host dark matter halos can help us better understand the galaxy formation physics, provide a basis for interpreting the large-scale structure observation, constrain the cosmological parameters, and distinguish various dark matter models to probe the nature of dark matter (see Wechsler & Tinker 2018 for a comprehensive review on this topic).
Currently, there are several ways to obtain halo mass in observation. In galaxy clusters, measuring the line-of-sight (LOS) velocity dispersion of the galaxies or the temperature and density of the hot intracluster medium can both directly derive the halo mass of the cluster through virial theorem and hydrostatic equilibrium, respectively. The abundance matching (AM) technique (e.g. Kravtsov & Klypin 1999; Moustakas & Somerville 2002; Tasitsiomi et al. 2004; Vale et al. 2004; Yang et al. 2005; Conroy et al. 2006; Yang et al. 2007; Moster et al. 2010), as a simple and powerful tool, provides an indirect way to derive the halo mass of galaxy groups. The key assumption of AM is that the most massive central galaxy lives in the most massive dark matter halo, followed by the second most massive central galaxy living in the next most massive halo, and so forth. Based on this, given a halo mass function, one can in principle assign halo mass to a central galaxy according to its stellar mass ranking. Weak gravitational lensing is another powerful tool for measuring the halo mass distribution (e.g. Mandelbaum et al. 2006; Luo et al. 2018).
Based on the assumption that the total stellar mass/luminosity of a galaxy group is correlated with its halo mass, Yang et al. (2005, 2007) assign a halo mass to each galaxy group identified by their group finder. As one of the most widely used Sloan Digital Sky Survey (SDSS) group catalogs, Yang et al. (2007) group catalogs have inspired a series of studies on correlations among galaxies, halos, and the large-scale environment (e.g. Peng et al. 2010, 2012; Wang et al. 2013; Lacerna et al. 2014; Peng & Maiolino 2014a; Balogh et al. 2016; Spindler et al. 2018; Wang et al. 2018; Graham et al. 2018; Dragomir et al. 2018). Using the halo mass estimated from the AM technique, the signature of halo assembly bias can be detected (Wang et al., 2008, 2013; Lacerna et al., 2014). For instance, the low specific star formation rate (sSFR) sample is found to be clustered more than the high sSFR sample at a given halo mass. However, a signature of assembly bias was not found in Lin et al. (2016). Lin et al. (2016) conclude that it is likely due to either the inaccurate mean relationship between total luminosity and halo mass, or the fact that the scatter in the estimated halo mass using the AM technique correlates with physical properties of the galaxies, such as sSFR and star formation history (SFH). Therefore, we may include galaxy observables (other than stellar mass) that may correlate with halo mass, and this could further minimize the scatter in the estimated halo mass, that is, allow us to derive a more accurate halo mass.
In this work, we will use machine learning (ML) techniques to analyze the underlying multicorrelations between various observable group properties and halo mass, and to predict the halo mass of galaxy groups. In recent years, ML has been used in predicting halo mass in several studies (Ntampaka et al. 2015, 2016, 2018; Armitage et al. 2019; Ho et al. 2019; Calderon et al. 2019). For example, using the Support Distribution Machine, Ntampaka et al. (2015, 2016) constrained the dynamical mass of galaxy clusters from distributions of LOS velocity dispersion of cluster members. The scatter in mass prediction is significantly reduced than in the relation, even when clusters are contaminated with interloper galaxies. Armitage et al. (2019) employed a variety of ML algorithms to predict cluster mass based on a set of dynamical observables other than LOS. Most recently, Ho et al. (2019) used a deep learning method, Convolutional Neural Networks, to produce dynamical mass estimates of galaxy clusters.
However, most of these works are targeted on massive clusters and are mainly based on dynamical mass indicators, while our analysis is based on galaxy groups of a large mass range down to with observables related to the SFH taken into account. We apply the Random Forest regressor (RF), a powerful supervised ML algorithm, to a subsample of the group catalog retrieved from the semianalytic model (SAM), L-GALAXIES (Henriques et al., 2015). This model is built upon a cosmological -body simulation, covering a relatively large mass range with good statistics, which makes it an ideal sample for our purpose. We expect that the blue and red groups following different stellar-halo mass relations (SHMR) may have different assembly histories. We will hence choose galaxy group observables based on the existing understanding of the galaxyhalo connection and perform the analysis separately for these two samples. We will use RF to identify the most important group properties in determining halo mass and look for analytical formulae of halo mass as a function of selected galaxy group properties.
The layout of the paper is as follows. In section 2, we will briefly introduce the SAM (L-GALAXIES), the mock catalog, and the galaxy group samples. In section 3, we will discuss in detail on the different SHMRs for blue and red centrals and how we select galaxy group properties to be used in our analysis. In section 4, we will introduce the RF regressor and the configuration of the algorithm. The results will be presented in section 5. We will summarize our main findings in section 6.
2 Data
2.1 Semianalytical Galaxy Formation Model
Semianalytical model is a computationally efficient way to model the galaxy formation and evolution, by describing the various physical processes analytically and tracing the dark matter halo merger trees. In this work, we use the publicly released data of the latest Munich semianalytical model, L-GALAXIES111http://gavo.mpa-garching.mpg.de/MyMillennium/ (Henriques et al., 2015). This model is built on the Millennium (Springel et al., 2005) and Millennium-II (Boylan- Kolchin et al., 2009) simulations rescaled to the Planck cosmology (Planck Collaboration XVI, 2014): , , , , () and . Compared with previous Munich galaxy formation models (e.g. Guo et al. 2011, 2013), the model has made several changes, such as delaying the reincorporation of wind ejecta, lowering the gas density threshold for star formation, modifying the radio-mode feedback, and eliminating ram-pressure stripping in halos smaller than for satellites. Besides, L-GALAXIES has been careful with the observational errors: in the MCMC sampling of the model, Henriques et al. (2013) used multiple “good” determinations of each observational property, took the scatter among them (together with the quoted statistical errors) to suggest likely systematic uncertainties.
The model employs the Markov Chain Monte Carlo (MCMC) method to adjust the parameter space to match observations. It has been carefully calibrated against observed stellar masses and passive fraction of galaxies within the redshift range of , to produce the observed evolution of stellar mass function (SMF) and the distribution of sSFR. We are using the mocks produced from L-GALAXIES because SAMs like L-GALAXIES still produce more accurate stellar mass function than hydro-simulations, which indicates a more accurate SHMR. Also, SAMs usually have a much larger volume (500 Mpc/h for L-GALAXIES) than typical hydro-simulations, leading to larger training samples, better statistics and less cosmic variance.
2.2 Galaxy Group Samples
The aim of the work is to predict the halo mass of the galaxy groups in L-GALAXIES using the observable group properties. We select all galaxies with stellar mass above in the snapshot with from the catalog based on the Millennium simulation. The final catalog consists of 3,632,259 galaxies, including 2,206,529 centrals and 1,425,730 satellites identified by the friends-of-friends group finder.
We separate galaxy groups into blue and red according to the color of the central galaxies. We adopt a widely used selection criteria (e.g. Darvish et al. 2016; Shen et al. 2017; Laigle et al. 2018) where the quiescent galaxies are defined by and (Ilbert et al., 2009; Williams et al., 2009). Both color are in the restframe with dust extinction included. is a good indicator of the current versus past star formation activity (Martin et al. 2007; Arnouts et al. 2007) and the two-color selection criteria can effectively differentiate between dusty star-forming galaxies and quiescent galaxies.
3 Galaxyhalo connection
3.1 SHMR
The SHMRs for blue and red groups in our mock catalog are shown in the left panel of Figure 1. It is evident that the blue and red centrals follow different trends, as in previous works (e.g. More et al. 2011; Peng et al. 2012; Wang & White 2012; Rodríguez-Puebla et al. 2015; Mandelbaum et al. 2016; Zu & Mandelbaum 2016). At a given halo mass, blue centrals on average have larger stellar masses than red centrals. At a fixed stellar mass of the centrals, the red centrals are living, on average, in more massive halos with more satellites than the blue ones, and the red fraction of centrals increases with halo mass. As discussed in Peng et al. (2012) and Peng & Maiolino (2014b), these results can be explained by a simple scenario in which quenching is a result of stellar mass alone, as illustrated in Figure 2 below.
The cartoons in Figure 2 illustrate the different SHMRs for blue and red centrals. For blue groups (left panel), the central galaxies grow their stellar mass by star formation or mergers. Meanwhile, their host dark matter halos also continue to grow by merging with other halos. Therefore, the blue central galaxies are expected to move diagonally to the right and upward (indicated by the blue arrow). As a consequence of such coupled coevolution of the central and its host halo, a relatively tight relation between the stellar mass of the central and its host halo mass is expected.
For red groups, when the central is quenched at an early epoch (by certain physical mechanisms), its stellar mass remains about constant unless additional stellar mass is accreted through subsequent mergers, while its halo continues to grow by merging smaller halos, irrespective of the star formation status of the central galaxy. In other words, the growth of the halo has been decoupled from the growth of the central once the central was quenched, after which the red central has been moving horizontally to the right, as indicated by the red arrow. The triangular shading is due to the fact that not all of the quenched centrals will remain centrals during their horizontal evolution to the right (as marked by the red arrow). More massive red centrals will stand a higher chance of surviving as centrals following substantial growth in the mass of their parent halos, while less massive red centrals may become satellites, or may even disappear completely if they merge with larger galaxies, when their parent halos merge with more massive halos (with more massive galaxies). In both cases, they will no longer exist on this plot. Consequently, there are fewer low-mass red centrals that continue to evolve to the right than high-mass red centrals, which leads to the triangular red shading in the right panel of Figure 1.
Putting this all together, first of all, we see that the total stellar mass of a group is expected to correlate strongly with the group halo mass, since the total stellar mass is the best indicator of both the overall SFH of all group members and the merging history of the halo. Second, as above, the different assembly histories for blue and red groups can produce different SHMRs. For blue groups, in addition to the total stellar mass of the group, the properties that indicate the SFH of the centrals, such as star formation rate and color of the centrals, should also correlate strongly with group halo mass. For red groups, in addition to the total stellar mass of the group, the properties that can indicate the quenching epoch of the centrals (e.g., stellar age of the central) and the halo growth history (e.g., group richness) should also correlate strongly with group halo mass.
To investigate the complicated nonlinear and nonorthogonal multicorrelations between various observable galaxy properties and group halo mass, and to verify the simple scenario above, we employ ML techniques. As above, we will treat the blue groups and red groups separately in our following analysis. This is hence different from many previous studies (Ntampaka et al. 2015, 2016; Armitage et al. 2019; Calderon et al. 2019), which do not differentiate between blue groups and red groups.
We stress that our primary goal is to use the ML technique to verify the scenario as proposed above, which gives a simple description of the different coevolution histories for blue and red groups with their dark matter halos. Since the ML algorithms can also quantify the correlation between various observable galaxy properties and group halo mass, in return, the group halo mass can be more accurately predicted from observable galaxy properties.
3.2 Galaxy Group Properties
Figure 1 shows the correlation between group halo mass and the stellar mass of the central in the left panel and group halo mass and total stellar mass of the group in the right panel. As in the analysis in the previous section, the correlation in the right panel is evidently tighter than that in the left panel, indicating that the total stellar mass of the group plays a critical role in determining the group halo mass. Meanwhile, the scatter becomes progressively larger toward the low-mass end, which implies that other group properties may start to become important in determining the group halo mass in the low-mass regime.
As discussed in the previous section, galaxy group properties related to the assembly formation history of the centrals and their host halos should directly contribute to halo mass. We also include other key observable properties in our analysis as follows:
- •
Stellar mass of the central galaxy (), used in the standard AM approach.
- •
Total stellar mass in the group (), used in the AM of Yang et al. (2005, 2007).
- •
Group richness, defined as the total number of group members above a certain mass threshold (see Knobel et a. 2009). As discussed in the previous section, it is expected to correlate with halo mass for characterizing the halo growth history. In addition, Peng et al. (2012) used richness to study the environmental effect of galaxy quenching and found it a good proxy of halo mass on group scales.
- •
SFR of the central galaxy (SFR). As discussed in the previous section, SFR and color can reveal the SFH of the central, thus correlating with group halo mass.
- •
color of the central galaxy (), which is a good indicator of the current versus past star formation activity (Arnouts et al. 2007; Martin et al. 2007)
- •
band weighted stellar age of the central galaxy (Age). As discussed in the previous section, the stellar age of the central can indicate the quenching epoch of the red centrals, and may hence be correlated with halo mass (Lacerna et al., 2011). The stellar age is also used to investigate the galaxy assembly bias in SDSS (Lacerna et al., 2014).
- •
Compactness of the group in terms of the projected median distance of all satellites to the central galaxy (). This might be taken as an observational manifestation of halo concentration, which is a promising secondary parameter of dark matter halos driving the galaxyhalo connection (Wechsler et al., 2006; Faltenbacher & White, 2010).
- •
Luminosity gap in the band, defined as the -band luminosity difference between the brightest and the second brightest galaxies within the group (). It has been taken as a secondary halo mass indicator besides luminosity or stellar mass of the central galaxy (More 2012; Hearin et al. 2013; Shen et al. 2014; Lu et al. 2015).
- •
Bulgetotal mass ratio of the central galaxy (B/T). It is found to be correlated with the SFH of the galaxies (Cheung et al., 2012; Wake et al., 2012; Bluck et al., 2014), which have been available for SDSS (Simard et al., 2011).
The virial mass of the halo where the galaxy group lies is defined as the dark matter mass enclosed in the spherical volume within which the average density is 200, with being the critical density of the universe.
The L-GALAXIES model has been calibrated to reproduce the real values of group properties (section 2.1) that are basically available in observation. In particular, L-GALAXIES gives so far one of the most accurate fits of the SMF in the local universe, including SDSS (Baldry et al., 2008; Li & White, 2009) and the Galaxy And Mass Assembly (GAMA) survey (Baldry et al., 2012). The galaxy group and group richness in L-GALAXIES are derived via the friends-of-friends method, similar to those used in observation (e.g. Yang et al. 2005; Berlind et al. 2006; Yang et al. 2007). Compared to previous models (e.g. Guo et al. 2013; Henriques et al. 2013), the L-GALAXIES (Henriques et al., 2015) also has improvements in matching the distributions of color, SFR (Brinchmann et al., 2004; Salim et al., 2007), and stellar age (Gallazzi et al., 2005) at a fixed stellar mass.
However, to a large or small extent, there is always systematic difference in the absolute mean values (e.g. stellar mass, SFR, color and etc.) between model predictions and observations, as is the same case with the scatters around the mean value. In order to reduce the potential bias introduced by the systematic difference in both the mean value and scatter around the mean, for a given galaxy group property (), we use its renormalized dimensionless forms , defined as
[TABLE]
where and are the mean and standard deviation of the parameter .
By using this renormalized dimensionless form, we have assumed follows a normal distribution in both SAM and observations, and the difference between the two distributions depends only on these two parameters. While this assumption is not perfectly satisfied (e.g. for stellar age), the form we use could still reduce the biases from the offsets between SAM and observations in either the mean value or the scatter around the mean. In practice, we apply equation 1 to SFR, , age, and B/T ratio, whose absolute values are more likely to deviate from observation, while keeping the original values for , , richness, and given that the SMF and galaxy clustering are the most accurately predicted properties in SAMs. The following analyses are hence based on these renormalized input group observables.
4 Machine Learning
4.1 Random Forest Regressor
As for the algorithm of ML, we adopt the RF (Breiman 2001) regressor in Python library scikit-learn 222http://scikit-learn.org (Pedregosa et al., 2011). The RF algorithm is highly efficient, easy to use, and capable of dealing with multifeature data without requiring feature selection. The unit of RF is the decision tree, a tree-like model of decisions. The root node of the tree is split into different decision nodes, which are further split into more nodes. A final node that does not split anymore is a leaf (decision). An RF is a combination of randomly generated decision trees from the same data set. Each tree is trained individually using a random subset of features, which can thus mitigate the problem of overfitting. Using RF can increase the signal-to-noise ratio of the prediction since errors across different trees are likely to cancel each other out. Considering the interrelationships between galaxy observables and halo mass are mostly nonlinear, RF could possibly yield better a prediction than the linear modeling methods, such as ordinary least squares and partial least squares (Robinson et al., 2017).
The RF can make accurate predictions after being well trained, but it is also important to interpret the “black-box-like” model to understand the contribution of each input parameter to any particular predictions. Feature importance is a popular approach to quantifying the contribution of different input features to the predicted variable (halo mass in our case). There are different ways to compute the feature importance. For instance, Palczewska et al. (2013) proposed three methods: median, cluster analysis, and log-likelihood. In scikit-learn, RF regressor ranks the relative importance of the input features based on the gini importance (Pedregosa et al., 2011), which is the most common approach. The gini importance of a feature is computed by measuring its efficiency in reducing variance when creating decision trees within RF regressor. The feature importance given in scikit-learn will provide a clear view of the prediction from the model and can be used to explore the correlations between halo mass and different group properties.
4.2 Training, Validation, and Test Samples
We adjust the hyperparameters of RF regressor to enhance its efficiency and accuracy. For instance, n estimators is the number of trees in the forest. A larger value could lead to a better prediction but with a higher cost of time. max features is the size of the random subsets of features to consider when splitting a node. Increasing the size can reduce the variance, but also increase the bias. Other adjustable hyperparameters include the maximum depth of trees, minimum features when splitting a node, and the minimum size of a leaf. Although some of the hyperparameters have empirically optimal values, one still needs to cross-validate different combinations of them to enhance the performance of RF regressor with reasonable computation time.
In order to predict the halo mass from given galaxy group properties via RF regressor, we randomly divide the blue and red group samples into training, validation, and test sets, as shown in Table LABEL:tbl-1. The training sample occupies 80% of the full sample, while the validation and test samples each occupies 10%. The training sample is used to train the RF, and the validation sample is used to tune the hyperparameter space of the RF routines that minimize the mean square errors (MSE) between the true halo mass () and the halo mass predicted by the RF model (). The MSE is defined as
[TABLE]
After we obtain the optimal set of hyperparameters for RF regressor, we apply the regressor to the test sample, which is used to evaluate the performance of the trained algorithm. The MSEs of the test samples and the Pearson correlation factors of true and predicted halo mass are shown in Table LABEL:tbl-1. It is evident that the MSEs of the test sets (that have not been used in model training and validation) are identical to the validation sets.
5 Results
5.1 Predicting Halo Mass
The halo masses are predicted for the test samples by using the optimal RF regressor tuned based on the validation samples. We compare the predicted halo mass with the true halo mass in Figure 3 for blue groups (left panel) and red groups (right panel). The halo masses are recovered rather well via RF regressor: the 1:1 diagonal line in both panels (left for blue groups and right for red groups) goes through the ridge of all contour lines, indicating that there is little systematic difference between the predicted value and the true value of the halo mass. To quantify the distance to the 1:1 diagonal line () for each point, we adopt the parameter , as defined in Yang et al. (2007):
[TABLE]
where is the standard deviation of at a given
In the bottom panel of Figure 3, on average is as small as 0.1 dex for blue groups, and it becomes slightly larger for red groups but is still less than 0.2 dex, except at the very low mass end. The MSE is only 0.003494 for blue groups and 0.02362 for red groups. In the traditional AM approach in which only the total stellar mass (or total luminosity) of the group is used to derive the halo mass, is 0.2 dex or up to 0.3 dex (e.g. Yang et al. 2007). The group total stellar mass is the only halo mass indicator employed in Yang et al. (2007), where they use the AM method to match halo mass with . To make a more direct comparison, we repeat our analysis by using as the only input parameter for the RF regressor. The results are shown in the left panel of Figure 4. In this analysis, as in the usual AM approach, we have not differentiated between blue and red groups. We also apply the usual AM approach to our full sample by ranking and then assigning the halo mass to the group according to only. The results are shown in the right panel of Figure 4. The derived values in both panels are identical, around 0.2 dex or up to 0.3 dex, which agrees well with Yang et al. (2007). The MSE is 0.04193 for RF regressor and 0.03241 for AM, both of which are higher than the RF regressor with more input observables. It is interesting to note that the results shown in the two panels are very similar, indicating that the performance of the RF regressor is not superior to the usual AM approach when is used as the only input parameter. This is expected beacuse the advantage of RF is in exploring the complicated correlation between multiple input parameters. Compared to the scatters in Figure 3, where a set of galaxy group properties have been used to predict the halo mass, the RF regressor evidently produces a more accurate prediction of the halo mass than the usual AM approach, reducing the scatter by about 50%.
5.2 Importance of Group Properties
As discussed in section 4.1, one of the great features of the RF regressor is that it can calculate and rank the relative importance of each input parameter in determining the output. Table LABEL:tbl-2 shows the relative importance of each group property by using the training sample with our optimized RF regressor. Group properties with subscript “n” are the renormalized dimensionless parameters. The out-of-bag (OOB) scores in the last column characterize the overall accuracy of the regressor. The values for blue groups and red groups are both above 90%, suggesting that the model predictions are quite accurate and sample-independent.
Table LABEL:tbl-2 shows that the total stellar mass () is the driving input parameter for both blue and red groups (), while the stellar mass of the central () is trivial (). This is because the information carried by is already contained in which further includes an additional mass contribution from satellites. As discussed in section 3.1, this is expected as the total stellar mass is the best indicator of both the overall SFH of all group members and the merging history of the halo. The relation as shown in the right panel of Figure 1 apparently has a smaller scatter than other relations, for instance, the relation shown in the left panel of Figure 1. This is consistent with the basic assumption of the usual AM approach, where the halo mass is matched with the total stellar mass in the group only. However, the scatter of the relation is still significant, especially in the low-mass end. As in Table LABEL:tbl-2, the relative importance of the other parameters counts about 30% in total, which has been missed in the usual AM approach. Therefore, bringing in additional information (i.e. other galaxy properties) will produce a more accurate halo mass.
It is interesting to note that the second and the third most important parameters for blue and red groups are different. For blue groups, properties that characterize the star formation activity of the central galaxies (SFR and ) are most important next to the total stellar mass, while group richness and stellar age of the central galaxies are more important for red groups. This color dichotomy supports our scenario illustrated in section 3.2 that the blue and red centrals have distinct evolutionary histories. It is therefore necessary to make predictions separately for blue and red groups.
Remarkably, these results are fully consistent with the prediction from the simple scenario as discussed in section 3.1 and illustrated in Figure 2.
It becomes clear from the analysis above that one key to improving the prediction of the halo mass is to differentiating between blue and red groups. In other words, the key is to add quenching in the analysis. As discussed earlier, before quenching happens, we have the simple coupled coevolution between the blue centrals and their dark matter halos. When the central is quenched, its stellar mass remains largely constant unless additional stellar mass is accreted through subsequent mergers (and passive evolution of the stellar population), while its dark matter halo continues to grow regardless of the star formation status of the central, that is, the growth of the halo is decoupled from the growth of the central.
One may wonder why the blue and red groups have to be classified according to the colorcolor diagram, now that the colors of the centrals are already used as input parameters in the RF regressor. The reason is that although RF regressor is a powerful ML algorithm, it may not always successfully identify the two distinct evolutionary paths as discussed above. In other words, it may not capture the quenching process by itself. Therefore, if we take quenching into account by differentiating between blue and red groups when performing RF regressor, we can presumably improve its performance and produce a more accurate prediction of the halo mass. Another potentially important hidden process is merger. If we could quantify the composition of stellar mass for a given central galaxy (e.g., how much of its stellar mass is from in situ star formation and how much is from mergers), we should be able to further improve the accuracy of the halo mass prediction by including it as a new input parameter. We will explore this in our subsequent work.
The analysis and discussion above also imply that a better understanding of the underlying physics, in particular these hidden correlations between multiple variables, will help to improve the performance of of the ML algorithms. Further evidence is found in applying the ML technique to predict photometric redshifts from multiband photometry data (e.g. Kind & Brunner 2013). Using color (i.e. the difference between the magnitudes in different bands) usually produces more accurate redshifts than using magnitudes directly in the photometric redshift estimation.
5.3 Empirical Formulae
The analysis demonstrates the key relations between halo mass and galaxy group properties, which can be used to make a more accurate prediction of the group halo mass than the traditional AM approach. However, the RF regressor gives no explicit form of such underlying relations. In practice, analytical formulae will be more convenient and useful for halo mass prediction.
We use the ordinary least squares (OLS) regression model to empirically fit the true halo mass with the three most important parameters, as discussed above for blue and red groups. We use total stellar mass (), renormalized and renormalized for blue groups, and we use total stellar mass, group richness, and renormalized stellar age () for red groups.
To obtain the optimal fitting parameters, we use the full sample to fit our regression model for blue groups and red groups. Given the negative slope of the mass function of the central galaxies, most central galaxies are distributed in (see Figure 1). To obtain a relation that would work equally well across the entire explored mass range, we randomly choose an equal amount of central galaxies in each stellar mass bin to generate a new random sample out of the original training sample. For blue groups, we end up having 126,000 galaxies in three stellar mass bins ranging from to . For red groups, we have 159,000 galaxies in three stellar mass bins ranging from to .
We then fit the halo mass with the three key variables suggested by RF regressor. For each variable, we try different forms including logarithm, exponential, and polynomial (up to an order of 3) terms. The combination of them may better fit the data than the simple linear forms. We use the Schwarz information criterion (SIC) as an evaluation of the goodness-of-fit for different models. SIC is used instead of MSE because SIC has a quite strict punishment on the incorporation of additional variables. After searching for various possible models, we obtain the following models, which yield the minimum SIC values. We also perform the test for the coefficient of each variable and remove the variables whose coefficients are not significant at the 99.9% level ():
[TABLE]
[TABLE]
where the units for halo mass () and total stellar mass () are . The renormalized quantities , and are dimensionless. Richness is in units of number of galaxies within a given group. The original units for and are and , respectively.
To make a direct comparison with the results shown in Figure 3 where only test samples are used, we apply the above the empirical formulae to the same test samples, and the results are shown in Figure 5. The top panels of Figure 5 show , the residual between the halo mass predicted by equation 4 and 5 and the true halo mass as a function of the stellar mass of the central galaxy in the random sample. The MSE is 0.00768 for blue groups and 0.0385 for red groups. The mean of residuals is almost zero in both panels, suggesting the scatter is independent of the stellar mass of centrals. This demonstrates that our fitting formulae can be applied to groups spanning a relatively large mass range with little systematic difference. On average, the blue groups have a smaller standard deviation than the red groups. By comparing the predicted halo mass with the true halo mass in the test samples using the two formulae (bottom panels of Figure 5), we find the errors in halo mass are reduced by about 50% from the usual AM approach, as shown in Figure 4. Although the errors seem comparable with the RF regressor (Figure 3), the MSEs yielded by the empirical formulae are evidently larger than that of the RF regressor. On the other hand, these simple analytical formulae provide a convenient way to assign halo mass to galaxy groups.
It should be noted that the total stellar mass of the group and the group richness as in equation 4 and 5 depend on the sample selection. In the above analysis, only galaxies with stellar mass greater than are selected and contribute to the group total stellar mass and richness. A different sample selection will obviously produce different values of and richness. Therefore, the coefficient of each parameter in equation 4 and 5 must be recalibrated and modified if a different sample selection is used.
Below, we repeat the above fitting process by using a different sample selection of . We use the OLS model with the same analytic forms of equation 4 and 5 to fit the new sample (), and the new formulae are shown as below:
[TABLE]
[TABLE]
where the units for halo mass () and total stellar mass () are . The renormalized quantities , and are dimensionless. Richness is in units of number of galaxies within a given group. The original units for and are and , respectively.
As in Figure 5, the comparisons between the true halo mass and predicted halo mass are shown in Figure 6. The trends of residuals are similar to those shown in Figure 5, but with a slightly larger . Also, the minimum halo mass that can be recovered by the fitting formulae is larger for the sample with . This is expected because when using a lower mass selection (e.g. ) more galaxies will be included in the sample. This will produce on average larger and larger richness, thus leading to a more accurate predicted halo mass. However, a lower mass selection will require deeper and longer observations.
All input parameters in the above equations are observable quantities. As discussed in section 3.2, there is always a systematic difference in the absolute mean values (e.g. stellar mass, SFR, color) between model predictions and observations. Similarly, the scatters around the mean values are also different. To reduce any bias introduced by the systematic difference in both the mean value and scatter around the mean, we have used the renormalized dimensionless forms , as in equation 1, in all our analyses. Therefore, in principle, equation 4 - 7 can be applied directly to real observational data.
Nevertheless, there are still complications and caveats. Apparently, the renormalized dimensionless form will reduce but cannot fully remove all systematic differences between observations and simulations, in both the mean value and scatter around the mean. The mean and scatter can also vary in different observations and surveys. Therefore, ideally one should first compare the values observed in a certain survey, including stellar mass, color, SFR, and stellar age, with the L-GALAXIES predictions. If there are no significant differences in the mean and scatter of these values, the above analytic equations can be applied to derive the halo mass safely. In our future work, we will apply our method to surveys such as SDSS, GAMA, and COSMOS and etc. to derive the group halo mass. We will also apply our approach to the latest hydrodynamical simulations like EAGLE (Crain et al. 2015; Schaye et al. 2015) and IllustrisTNG (Nelson et al. 2019), to see if we will get consistent results. These latest hydrosimulations may produce more accurate properties such as SFR and metallicity and hence may provide a more accurate description of the galaxyhalo connection. On the other hand, SAMs like L-GALAXIES still produce more accurate SMFs than hydrosimulations, which indicates a more accurate SHMR. Also, SAMs usually have a much larger volume than the hydro-simulations, leading to larger training samples and less cosmic variance. This is also the reason why we start our analysis with L-GALAXIES in the first place.
Another important uncertainty comes from the group finder in observations. Group finders are designed to identify galaxy group members in the same dark matter halos based on their spatial distributions (e.g. using friends-of-friends technique). However, none of these group finders can fully recover the true group membership for each galaxy. The purity and completeness of all the recovered groups can never reach 100%. For instance, for any group catalogs, even if carefully calibrated against mock catalogs in which the underlying dark matter distribution is known, overfragmentation and overmerging of groups still happen. This leads to misclassification of satellites and centrals and produces wrong group richness and total stellar mass in the group; that is, the input quantities used in our equation 4 and 5 could be wrong. Apparently, this will be a common issue for all studies related to galaxy groups. We will try to quantify this effect in our future study.
6 Summary
In this paper, we have investigated the key relationships between the group halo mass and various observable galaxy group properties using the semianalytical galaxy formation models L-GALAXIES. We first propose a simple scenario (illustrated in Figure 2), which describes the evolution of the central galaxies and their host dark matter halos. Star formation quenching is one of the key processes in this scenario, which leads to the different assembly histories of blue groups (group with a blue central) and red groups (group with a red central). From this simple scenario, we speculated about the driving factors that should strongly correlate with the group halo mass. We then apply ML algorithm RF regressor to blue groups and red groups separately, to explore these nonlinear and nonorthogonal multicorrelations and to verify the scenario as proposed above. Remarkably, the results given by RF regressor are fully consistent with the prediction from our simple scenario. As a consequence, the group halo mass can be more accurately determined from observable galaxy properties by the RF regressor.
The main results of the paper are summarized as follows:
(1) The total stellar mass of a group is expected to correlate most strongly with the group halo mass, because the total stellar mass is the best indicator of both the overall SFH of all group members and the merging history of the halo.
As illustrated in Figure 2, for blue groups, because both the central galaxy and its host halo will continue to grow their masses simultaneously, a relatively tight relation between the stellar mass of the central and its host halo mass is expected. Therefore, in addition to the total stellar mass of the group, the properties that indicate the SFH of the centrals, such as SFR and color of the centrals, should also correlate strongly with the group halo mass.
For red groups, when the central is quenched at some point, its stellar mass remains about constant unless additional stellar mass is accreted through subsequent mergers, while its halo continues to grow by merging smaller halos. In other words, the growth of the halo is decoupled from the growth of the central.
Therefore, in addition to the total stellar mass of the group, the properties that can indicate the quenching epoch of the centrals (e.g. stellar age of the central) and the halo growth history (e.g. group richness) should also correlate strongly with group halo mass. The distinct evolutions of blue and red groups, due to the quenching of the centrals, require that we must treat them separately in our analysis.
(2) By using the RF regressor, among the various group properties explored, we find that the total stellar mass of the group is the most important parameter for both blue and red groups, followed by the SFR and NUVr color of the central galaxy for blue groups and group richness and stellar age of the central galaxy for red groups. This is fully consistent with the simple scenario proposed above and hence provides strong support for it.
Since the ML algorithm can also quantify the correlation between various observable galaxy properties and group halo mass, in return, the group halo mass can be more accurately predicted from observable galaxy properties. Compared to the traditional AM approach, the standard errors in the halo mass predicted by the RF regressor have been reduced by about 50%.
(3) The blue and red groups are classified according to the colorcolor diagram of the central galaxies. Although the color of the centrals has already been included as an input parameter in the RF regressor, running RF regressor separately for blue and red groups can produce more accurate halo masses. RF regressor is a powerful ML algorithm, yet it failed to capture the quenching process accurately by itself. Therefore, by taking quenching into account (i.e., differentiate between blue and red groups) when performing RF regressor, we have improved its performance and produced a more accurate prediction of the halo mass. Another potentially important hidden process is merger. If we could quantify the composition of stellar mass for a given central galaxy (e.g., how much of its stellar mass is from in situ star formation and how much is from mergers), we should be able to further improve the accuracy of the halo mass prediction by including it as a new input parameter. We will explore this in our subsequent work.
This implies that a better understanding of the underlying physics, in particular those hidden deep correlations between multiple variables, will help to improve the performance of the ML algorithms.
(4) Similar to other ML algorithms, RF regressor does not give an explicit form of the relation between group halo mass and group properties. We hence regress the halo mass on the key variables identified by RF regressor, and we derive the empirical relations that can be used to determine the halo mass analytically. Since the total stellar mass of the group and group richness that are used in these relations as input parameters depend on the sample selection, we proposed equations 4 and 5 for a sample of galaxies with stellar mass greater than ; and equations 6 and 7 for a sample of galaxies with stellar mass greater than . These simple analytical formulae provide a convenient way to assign halo mass to galaxy groups from observable group properties, with accuracy comparable to those determined directly from the RF regressor.
In our future work, we will include more observable properties of the galaxy groups, for instance, the structure, morphology, and dynamics of the group members. As mentioned at the end of section 5.2, another potentially important hidden process (besides quenching) is merger. If we could quantify it for a given central galaxy (e.g., how much of its stellar mass is from in-situ star formation and how much is from mergers), we should be able to further improve the accuracy of the halo mass prediction. We will also test our approach with the latest hydrodynamical simulations like EAGLE (Crain et al. 2015; Schaye et al. 2015) and IllustrisTNG (Nelson et al. 2019) to see if we will get consistent results.
Then we will apply the RF regressor to surveys such as SDSS, GAMA, and COSMOS to derive more accurate halo masses, which will enable more accurate investigations of the galaxyhalo connection and many other important related issues, including galactic conformity and the effect of halo assembly bias on galaxy assembly.
We are grateful to Frank C. van den Bosch, Zheng Zheng, and Qi Guo for the productive discussions and useful comments. We thank the anonymous referee for useful comments. We are particularly grateful to the L-GALAXIES project for making the data public. We also thank Aobo Li for helping polish the text. This work is supported by the National Natural Science Foundation of China grant No. 11773001 and National Key R&D Program of China grant 2016YFA0400702. J.S. acknowledges the support by the Peking University Boya Fellowship. X.K. acknowledges the support by the National Key R&D Program of China grant 2015CB857004 and 2017YFA0402600, and the National Natural Science Foundation of China grant No. 11320101002, No. 11421303, and No. 11433005. K.G. acknowledges the support from the Beijing Natural Science Foundation (Youth program) under grant No. 1184015.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Armitage et al. (2019) Armitage, T. J., Kay, S. T., & Barnes, D. J. et al. 2019, Ap J, 484, 1526
- 2Arnouts et al. (2007) Arnouts, S. et al. 2007, A&A, 476, 137
- 3Balogh et al. (2016) Balogh, M. L., Mc Gee, S. L., Mok, A., et al. 2016, MNRAS, 456, 4364
- 4Berlind et al. (2006) Berlind A. A. et al., 2006, Ap JS, 167, 1
- 5Baldry et al. (2008) Baldry I. K., Glazebrook K., Driver S. P., 2008, MNRAS, 388, 945
- 6Baldry et al. (2012) Baldry I. K. et al., 2012, MNRAS, 421, 621
- 7Bluck et al. (2014) Bluck Asa F. L. et al. 2014, MNRAS, 441,599
- 8Boylan- Kolchin et al. (2009) Boylan-Kolchin M., Springel V., White S. D. M., Jenkins A., Lemson G., 2009, MNRAS, 398, 1150
