Chasing Accreted Structures within Gaia DR2 using Deep Learning

Lina Necib; Bryan Ostdiek; Mariangela Lisanti; Timothy Cohen; Marat; Freytsis; Shea Garrison-Kimmel

arXiv:1907.07681·astro-ph.GA·February 15, 2022

Chasing Accreted Structures within Gaia DR2 using Deep Learning

Lina Necib, Bryan Ostdiek, Mariangela Lisanti, Timothy Cohen, Marat, Freytsis, Shea Garrison-Kimmel

PDF

TL;DR

This paper uses deep learning and clustering algorithms on Gaia DR2 data to identify known and new stellar streams, revealing insights into the Milky Way's merger history.

Contribution

It introduces a novel combination of deep neural networks and clustering methods to detect and analyze accreted stellar structures in Gaia data.

Findings

01

Identification of known structures like Gaia Enceladus and Helmi stream.

02

Discovery of a new large stream called Nyx with at least 90 stars.

03

Demonstration of machine learning effectiveness in galactic archaeology.

Abstract

In previous work, we developed a deep neural network classifier that only relies on phase-space information to obtain a catalog of accreted stars based on the second data release of Gaia (DR2). In this paper, we apply two clustering algorithms to identify velocity substructure within this catalog. We focus on the subset of stars with line-of-sight velocity measurements that fall in the range of Galactocentric radii $r \in [6.5, 9.5]$ kpc and vertical distances $∣ z ∣ < 3$ kpc. Known structures such as Gaia Enceladus and the Helmi stream are identified. The largest previously-unknown structure, Nyx, is a vast stream consisting of at least 90 stars in the region of interest. This study displays the power of the machine learning approach by not only successfully identifying known features, but also discovering new kinematic structures that may shed light on the merger history of the Milky…

Figures29

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Parameters and associated prior ranges for the halo, Gaia Enceladus, Nyx, and Nyx-2. All priors are taken to be linear. Note that the Nyx-2 component is only included when analyzing the canonical ( S > 0.85 𝑆 0.85 S>0.85 ) sample.

	Priors
	Halo	Enceladus	Nyx	Nyx-2
$μ_{r}$	$[- 70, 70]$	$[0, 250]$	$[- 70, 200]$	$[- 300, 200]$
$μ_{θ}$	$[- 70, 70]$	$[- 200, 200]$	$[- 70, 70]$	$[- 70, 70]$
$μ_{ϕ}$	$[- 70, 70]$	$[- 200, 200]$	$[0, 300]$	$[0, 300]$
$σ_{r, θ, ϕ}$	$[0, 400]$	$[0, 400]$	$[0, 400]$	$[0, 400]$
$ρ_{r θ, r ϕ, θ ϕ}$	$[- 1, 1]$	$[- 1, 1]$	$[- 1, 1]$	$[- 1, 1]$
$Q$	—	$[0, 1]$	$[0, 1]$	$[0, 1]$

Table 2. Table 2: Centers of the velocity clusters identified by DBSCAN after 100 iterations scanning over position and velocity errors of the stars. For each cluster, we provide the following: ID number, mean velocities in Galactocentric cylindrical coordinates, range of dispersions of stars that belong in the group through error resampling, number of stars, and its frequency across 100 reruns. We only list groups that have occurred more then 70 % percent 70 70\% of the time. The velocities of Group I overlaps with the Helmi stream [ 35 ] , while Groups II, III, and IV may be related to streams identified in [ 53 ] . The same groups appear when analyzing either the high-purity or the canonical sample.

ID	$v_{R}$ [km/s]	$v_{ϕ}$ [km/s]	$v_{z}$ [km/s]	$σ_{R}$ [km/s]	$σ_{ϕ}$ [km/s]	$σ_{z}$ [km/s]	$N_{stars}$	Frequency
I	21	141	-286	40-77	8-26	8-24	10-20	$100 %$
II	-154	-377	-63	63	34	18	3-19	$95 %$
III	-195	-237	164	41	24	24	3-17	$78 %$
IV	212	-229	161	9-35	7-25	1-37	7-17	$74 %$

Equations6

p_{k}\big{(}O_{i}\,\big{|}\,\theta\big{)}=\mathcal{N}\left(\mathbf{v}_{i}\,\middle|\,\boldsymbol{\mu}^{k},\boldsymbol{\Sigma}^{k}_{i}\right),

p_{k}\big{(}O_{i}\,\big{|}\,\theta\big{)}=\mathcal{N}\left(\mathbf{v}_{i}\,\middle|\,\boldsymbol{\mu}^{k},\boldsymbol{\Sigma}^{k}_{i}\right),

p_{e}\big{(}O_{i}\,\big{|}\,\theta\big{)}=\frac{1}{2}\,\Big{[}\,\mathcal{N}\Big{(}\mathbf{v}_{i}\Big{|}\,\boldsymbol{\mu}^{e},\boldsymbol{\Sigma}^{e}_{i}\Big{)}+\mathcal{N}\Big{(}\mathbf{v}_{i}\Big{|}\,\boldsymbol{\mu}^{\tilde{e}},\boldsymbol{\Sigma}^{e}_{i}\Big{)}\,\Big{]}\,,

p_{e}\big{(}O_{i}\,\big{|}\,\theta\big{)}=\frac{1}{2}\,\Big{[}\,\mathcal{N}\Big{(}\mathbf{v}_{i}\Big{|}\,\boldsymbol{\mu}^{e},\boldsymbol{\Sigma}^{e}_{i}\Big{)}+\mathcal{N}\Big{(}\mathbf{v}_{i}\Big{|}\,\boldsymbol{\mu}^{\tilde{e}},\boldsymbol{\Sigma}^{e}_{i}\Big{)}\,\Big{]}\,,

p\big{(}\{\mathcal{O}_{i}\}\big{|}\theta\big{)}=\prod_{i=1}^{N}\sum_{j=h,n,e}Q_{j}\,p_{j}\,\big{(}O_{i}\,\big{|}\,\theta\big{)}\,,

p\big{(}\{\mathcal{O}_{i}\}\big{|}\theta\big{)}=\prod_{i=1}^{N}\sum_{j=h,n,e}Q_{j}\,p_{j}\,\big{(}O_{i}\,\big{|}\,\theta\big{)}\,,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Chasing Accreted Structures within Gaia DR2 using Deep Learning

Lina Necib

[email protected]

Walter Burke Institute for Theoretical Physics, California Institute of Technology, Pasadena, CA 91125, USA

Department of Physics and Astronomy, University of California Irvine, Irvine, CA 92697, USA

Bryan Ostdiek

Institute of Theoretical Science, Department of Physics, University of Oregon, Eugene, OR 97403, USA

[email protected]

Mariangela Lisanti

Department of Physics, Princeton University, Princeton, NJ 08544, USA

Timothy Cohen

Institute of Theoretical Science, Department of Physics, University of Oregon, Eugene, OR 97403, USA

Marat Freytsis

Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel-Aviv 69978, Israel

School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540, USA

Shea Garrison-Kimmel

TAPIR, California Institute of Technology, Pasadena, CA 91125, USA

Abstract

In previous work, we developed a deep neural network classifier that only relies on phase-space information to obtain a catalog of accreted stars based on the second data release of Gaia (DR2). In this paper, we apply two clustering algorithms to identify velocity substructure within this catalog. We focus on the subset of stars with line-of-sight velocity measurements that fall in the range of Galactocentric radii $r\in[6.5,9.5]\,\text{kpc}$ and vertical distances $|z|<3\,\text{kpc}$ . Known structures such as Gaia Enceladus and the Helmi stream are identified. The largest previously-unknown structure, Nyx, is a vast stream consisting of at least 90 stars in the region of interest. This study displays the power of the machine learning approach by not only successfully identifying known features, but also discovering new kinematic structures that may shed light on the merger history of the Milky Way.

††software: This analysis made use of Astropy [80], Galpy [12], Matplotlib [39], NumPy [96], and Scikit-Learn [76]. The neural network used for tagging the accreted stars was implemented in Keras [18] with the TensorFlow backend [1]. The network was trained using Adam [51] to minimize the binary cross entropy loss.

1 Introduction

The paradigm of hierarchical structure formation describes how galaxies grow in a Lambda Cold Dark Matter universe [101]. During this process, large galaxies like our Milky Way gain the majority of their mass by capturing and absorbing smaller satellites. As a satellite galaxy falls onto a host, it is torn apart by violent tidal forces that strip both its dark matter halo and stars. Debris from this process is left scattered about the host galaxy as a fossil remnant of the accretion events. Assuming that not enough time has passed for these objects to fully virialize, such remnants manifest as distinctive features in phase space: stellar clumps, streams, and clouds of tidal debris [46, 45, 14, 23, 87, 24]. If such signatures can be identified, one can use their properties to reconstruct aspects of the host galaxy’s evolution.

The study of spatial and kinematic substructure in the Milky Way’s stellar halo has a long history. Various works have focused on modeling its large-scale components, discovering that the stellar halo can be separated into (at least) an inner and outer component based on kinematics based on distinctions in spatial density profiles, stellar orbits, and spectroscopic metallicities [15, 16, 6] photometry (Hess diagrams) in combination with spatial density distributions [19], and photometric metallicity estimates in combination with kinematics inferred from proper motions alone [5, 4]. These works targeted the overall structure of the stellar halo. In parallel, other studies focused on identifying the substructures that built the accreted halo. The Sagittarius Stream provides the most stunning example of an ongoing merger [47, 44, 102, 40], as it traces several orbits of the infalling Sagittarius dwarf [42]. Numerous tidal streams, some of which may be associated with disrupting globular clusters, have also been discovered; see Newberg & Carlin [71] for a review. These include the GD-1 [29], Pal-5 [73], and Ophiuchus [10] streams. In the case of older mergers, velocity coherence is still preserved although spatial coherence may be lost [34, 60, 61]. A plethora of such kinematic substructure has been found throughout the stellar halo, e.g. ECHOS [92, 90, 91].

The advent of data from the Gaia satellite [13] has already revolutionized our ability to reconstruct the Milky Way’s history, given the unprecedented number of 5D and even 6D stellar phase-space measurements. To date, several additional stellar streams and clumps have been discovered using Gaia data [65, 53, 63, 54, 67, 64, 41, 43]. Gaia Enceladus (also referred to as the Gaia Sausage) is the largest of these new structures [8, 30], and is believed to be the remnant of a significant merger that occurred at an estimated redshift of $\sim 1$ –3 [66]. Today, Enceladus’ stellar debris is highly radial and more metal-rich than the rest of the stellar halo; additionally, it constitutes the majority of the accreted stellar fraction in the inner Milky Way [21, 68, 57, 98].

The focus of this paper is on searching for kinematic substructures in the subset of Gaia data that includes radial velocities. In particular, we define the region of interest (ROI) to be within spherical Galactocentric radii $r\in[6.5,9.5]\,\text{kpc}$ and vertical distances $|z|<3\,\text{kpc}$ . In this region, nearly 99% of all stars belong to the Milky Way’s disk, while the remaining $\lesssim 1\%$ belong to the halo and may have been accreted from mergers [48]. Identifying this small fraction of accreted stars is a daunting task given the considerable size of the stellar disk background population. A variety of strategies have been explored to achieve this goal, which include looking for structures in action-angle [104], energy-momentum [31], apocenter-pericenter-angular momentum [32], and/or chemo-dynamic [e.g. 72, 33, 78, 3] space.

The Helmi stream provided the first proof-of-principle that such techniques can be applied successfully [35]. First discovered as an overdensity of 13 stars in angular momentum space, the Helmi stream has since been confirmed by a variety of other observations [17, 83, 52, 94, 7]. Most recently, it has been studied in Gaia DR2 [53, 26] and nearly 600 more potential members were identified using cross matches with spectroscopic surveys [54]. Comparisons with simulations strongly suggest that it is associated with the disruption of a satellite galaxy [50].

The selection criteria typically used to identify accreted stars are conservative and usually reduce potential disk contamination at the expense of excluding a large fraction of stars. Ideally, one would want to refine the selection process to yield a high-purity sample of accreted stars without sacrificing the overall statistics. Attempts in this direction have been made using traditional regression and classification techniques [97, 63]. Our focus here, is on the application of deep neural networks to this problem. In Ostdiek et al. [75], we developed a scheme for using neural networks to successfully distinguish accreted stars from those born in the Milky Way. We validated this approach through extensive testing on simulated Gaia mock catalogs [89] based on the Latte suite [100] of the Fire simulations [37, 38]. Our method is applied to the subset of Gaia stars with small parallax errors $\delta\varpi/\varpi<0.10$ , and yields a catalog of likely accreted stars (publicly available in [74]). This work presents the first analysis of a subset of this catalog, focusing on the 4.8 million stars that additionally have radial velocity measurements.

We use several different clustering algorithms to identify velocity structures in this new catalog. We recover known structures such as the Helmi stream and Gaia Enceladus, clearly showing that the latter extends down to the Galactic midplane. In addition, we identify several new stream candidates in the ROI. One of these candidates, which we call Nyx, is a significant contributor in the region studied [70]. The other two stream candidates, which may be associated with overdensities in [53], are comprised of $\mathcal{O}(10)$ stars.

This paper is organized as follows. In Sec. 2, we review the methodology underlying the catalog obtained by [75], and explicitly demonstrate how neural network scores correlate with different stellar populations (e.g. thin/thick disk and halo). In Sec. 3, we perform a Gaussian mixture model analysis to characterize the three most significant structures in the sample: Gaia Enceladus, Nyx, and the remaining constituents of the stellar halo, which for the remainder of this paper we will refer to as “Halo.” We further analyze the data for non-Gaussian structures in Sec. 4, which is where we recover the Helmi stream, as well as two other stream candidates. The Appendix includes additional figures that validate the analysis procedure.

2 A New Catalog of Accreted Stars

In this section, we take a first look at the catalog of accreted stars. After briefly summarizing the methodology that underlies the catalog, we provide a number of distributions of the basic properties of the stars identified as accreted and in situ by the neural network. Intriguingly, one can already see hints of novel structures by looking at these distributions. We will explore these structures more quantitatively in Sec. 3.

2.1 Characterizing the Catalog

To take maximum advantage of the richness contained within Gaia DR2, we developed a novel approach utilizing deep neural networks to derive a catalog of accreted stars. Details and extensive cross-checks of the methodology are presented in Ostdiek et al. [75]; we briefly summarize the main points here. We train the neural network on a combination of simulated mock Gaia surveys and real measurements of Milky Way stars. The simulated data is from the Ananke mock surveys [89], based on the Latte simulation suite [100, 38]. As we have the full merger history for each simulated galaxy, we can identify stars that are truly accreted, which can be leveraged to both train and validate the networks. The final network is pre-trained on mock catalogs from three solar positions of the m12i simulated galaxy, using only 5D heliocentric kinematics as inputs (the line-of-sight velocity is not provided to the networks). We then perform transfer learning on stars within the RAVE DR5-Gaia DR2 cross-matched data [55] that are identified as accreted with high confidence, i.e., if they are metal-poor $\text{[Fe/H]}<-1.5$ and have $|z|>1.5\,\text{kpc}$ . Retraining the last layer of the neural network allows it to learn characteristics specific to the Milky Way.

The final network is applied to the subset of Gaia DR2 stars with parallax $\varpi>0$ and fractional error $\delta\varpi/\varpi<0.10$ . In this work, we further restrict our analysis to only include those stars in the resulting catalog that have measured Gaia line-of-sight velocities [49], and fall within spherical Galactocentric radii of $r\in[6.5,9.5]\,\text{kpc}$ and vertical distances $|z|<3\,\text{kpc}$ of the Galactic midplane. This results in a final sample size of 4,820,164 stars. Figure 1 shows the spatial extent of the sample analyzed here. Note that Gaia DR2 is essentially complete in the optical magnitude range $G\in[7,17]$ [27]. For $G\in[4,12]$ , the radial velocity subset of Gaia is about 60–80% complete as compared to the total data set. Therefore, we do not expect our sample to be spatially complete, and this is evident from close inspection of Fig. 1. For example, the distribution peaks at the Solar circle, and there are rays originating from the Solar position $(x,y,z)=(-8,0,0)\,\text{kpc}$ in the $x-y$ plane. There is also an asymmetry in the $y-z$ plane where the distribution looks almost diagonal, probably due to projection effects (see [49] for the completeness of the radial velocity subset of Gaia DR2). Although the stellar sample is not spatially complete, we do expect it to be kinematically unbiased because no selection cuts have been made on velocity. This is particularly important for the study presented here, as the techniques we use for identifying stellar substructure are based on features in velocity space alone.

The neural network gives each star a score that reflects its probability of being accreted, with $S=0$ indicating a maximum-confidence in situ star and $S=1$ a maximum-confidence accreted star. Figure 2 shows the Toomre plot in Galactocentric coordinates of the stars in Gaia DR2 that pass all the selection cuts discussed above. A gradient of scores is evident, ranging from stars associated with the thin disk ( $S\sim 0$ ) to stars associated with the stellar halo ( $S\sim 1$ ). For context, we additionally overlay the best-fit 3 $\sigma$ contours for the thin disk, thick disk, and halo stars derived from the model of Bensby et al. [9].

Figure 2 clearly shows that the network scores are highly correlated with the expected behavior of these three populations: the lowest scores are associated with the thin disk, while the larger scores are associated with halo stars. For the most part, thin disk stars have network scores $S\lesssim 0.05$ , while thick-disk stars have scores extending up to $S\sim 0.5$ 111Running the neural network on the Galaxia [93] simulation demonstrated that the thick-disk component is the hardest one to score as accreted or in situ. In actuality, the Milky Way’s thick disk might be a combination of the two; see [86] and references therein. — see Fig. S1 for Toomre plots of stars with scores $S<0.05$ and $S\in[0.3,0.5]$ . For values of $v_{y}>200\,\,\text{km}/\text{s}$ and $\sqrt{v_{x}^{2}+v_{z}^{2}}\sim 100$ – $200\,\,\text{km}/\text{s}$ , stars that appear to belong to the thick disk according to the Bensby et al. [9] model have very low network scores, close to 0. These stars are easily identified by the network as being in situ because their $v_{y}$ velocities are large and positive. Note that due to the focus of the training regimen, the network score reflects the probability that a star is accreted, and as such it does not have any information on the different components of the disk.222It is in principle possible to use multi-class classification to identify different stellar populations, but that is outside the scope of this work. It is therefore rather remarkable that the network scores do correlate with known stellar populations.

To create a concrete dataset of accreted stars to work with, we must choose a cut on the network scores. In [75], we developed a principled way of identifying this cut by analyzing the network performance on mock catalogs, where truth information is available. Our criteria was to minimize the difference between velocity distributions of the stars that are selected and the distributions of all of the truth-level accreted stars. In particular, we found that a cut of $S\gtrsim 0.75$ least biased the distributions of the accreted stars.333All stars in Gaia with a score greater than 0.75 are made available upon request from the corresponding authors. Here, we will use more restrictive cuts on the score to improve the purity of the sample and facilitate the Gaussian Mixture Model analysis described in Sec. 3.1. Our canonical sample will be defined with $S>0.85$ ; it contains 22,296 stars total. We additionally explore the impact of placing a much more restrictive cut of $S>0.95$ to create a high-purity sample (9,379 stars total). The high-purity sample minimizes disk contamination at the expense of biasing the velocity distributions, while the canonical sample reduces the purity by introducing some disk stars, but should produce less biased distributions.

2.2 Accreted Versus In situ Stars

We now explore the kinematic distributions of the canonical and high-purity samples. The top row of Fig. 3 shows the velocity distributions in spherical Galactocentric coordinates ( $v_{r}$ , $v_{\phi}$ , $v_{\theta}$ ) for three different cuts on the network score: $S<0.75$ (in situ), $S>0.85$ (canonical accreted), and $S>0.95$ (high-purity accreted). The narrow peak near $v_{\phi}\sim 200\,\,\text{km}/\text{s}$ for the stars with $S<0.75$ shows that the network clearly classifies stars whose rotations are consistent with the disk as in situ. The $v_{r}$ and $v_{\phi}$ distributions differ between the canonical and high-purity sample. While both are roughly bimodal in $v_{r}$ , the canonical sample has a smaller dispersion. The canonical sample has a stronger peak at $v_{\phi}\sim 150\,\,\text{km}/\text{s}$ , which may due to some disk contamination.

To further explore the differences between the canonical and the high-purity cuts, we provide the $v_{r}-v_{\phi}$ distributions of the resultant datasets in the bottom row of Fig. 3. The in situ sample ( $S<0.75$ ) is dominated by the stars at $v_{r}\sim 0\,\,\text{km}/\text{s}$ and $v_{\phi}\sim 200\,\,\text{km}/\text{s}$ as expected, although it also includes some accreted stars that happen to be scored below $S=0.75$ by the network. Both the high-purity and canonical samples show an elongated structure in the radial direction that spans $v_{r}\in[-400,400]\,\,\text{km}/\text{s}$ , with $v_{\phi}\sim 0\,\,\text{km}/\text{s}$ . This is Gaia Enceladus, further discussed in Sec. 3.2. When we implement the canonical cut, we find that the network carves out a nearly circular region in this parameter space, because any star that falls within this region is very likely to be in situ. It does however leave a distinct half-moon structure in this plane, particularly apparent in the middle panel of Fig. 3, second row. If we increase the cut to $S>0.95$ , so that we only plot the stars that the network labels as accreted with high confidence, we find that the half-moon structure dissolves into a localized overdensity at $v_{r}\sim 150\,\,\text{km}/\text{s}$ and $v_{\phi}\sim 140\,\,\text{km}/\text{s}$ . We call this new structure Nyx, and will describe it in more detail in Sec. 3.2. It is important to note that Nyx is present in the canonical sample as well, but is much easier to identify in the high-purity sample.

Figure 4 provides the Toomre plots for the canonical and high-purity cuts. The black-dashed lines are centered around $v_{\text{lsr}}=220\,\,\text{km}/\text{s}$ and extend to $|v-v_{\text{lsr}}|>100-220\,\,\text{km}/\text{s}$ . As with the $v_{r}-v_{\phi}$ distributions shown in Fig. 3, the network cuts out the region within the inner dashed circle. A common selection criterion for identifying accreted stars is to require that $|v-v_{\text{lsr}}|>180$ – $220\,\,\text{km}/\text{s}$ (see [78] for a review). By default, such a cut ignores stars that fall within (approximately) the outermost dashed circle in Fig. 4, biasing the resulting distribution of stars in velocity space. This consequently ignores any substructure that co-rotates with the disk. Our deep-learning–based catalog has been constructed to minimize such a bias, thereby enabling us to look for prograde structures. Several structures stand out in both the canonical and high-purity samples. The vertically extended overdensity centered at $v_{y}\sim 0\,\,\text{km}/\text{s}$ is Gaia Enceladus. Second, the overdensity centered at $v_{y}\sim 130\,\,\text{km}/\text{s}$ and $\sqrt{v_{x}^{2}+v_{z}^{2}}\sim 130\,\,\text{km}/\text{s}$ is Nyx. Nyx becomes more evident in the high-purity sample where the disk contamination is reduced.

We emphasize that the kinematic distribution of the stars that are easiest to label as accreted are not guaranteed to be representative of the distribution for all accreted stars. Therefore, moving forward, we will present results for both the canonical and high-purity samples, although we emphasize the possibility of biases in the distributions, especially as the cut on $S$ becomes more restrictive [75].

3 Dominant Kinematic Structures

In this section, we focus on characterizing the largest (in terms of overall star count) kinematic structures present in the sample of accreted stars. We will use a Gaussian mixture model approach to extract three distinct features, including the new stream Nyx.

3.1 A Structure Finding Algorithm

To obtain a quantitative estimate for the number of components in the high-purity accreted dataset (defined by the sample of Sec. 2.1 with $S>0.95$ ), we run a Gaussian mixture model using scikit-learn [76] on the selected stars in Galactocentric velocity space. This first pass ignores the impact of finite measurement errors on the stellar velocities. We allow for up to 9 Gaussian distributions, and evaluate how many are necessary using the Bayesian Information Criterion (BIC). The BIC decreases rapidly as the number of Gaussians is increased to 4, after which it stabilizes. Note that Gaia Enceladus is best characterized by two Gaussians with nearly identical means and dispersions in $v_{\phi}$ and $v_{\theta}$ that are centered at equal but opposite values of $v_{r}$ [68]. Hence, the fact that 4 Gaussians are required to minimize the BIC tells us that the dataset contains three dominant kinematic structures. We will refer to these three structures as: Enceladus, Nyx, and the Halo. We emphasize that these three components taken together comprise the totality of the accreted stellar halo in our model.444There is possible disk contamination even in the high-purity sample. However these stars are usually not representative of the in situ sample, and therefore we do not expect them to form a distribution that can be picked up by the Gaussian mixture model. In other words, the component labeled as “Halo” is comprised of accreted stars in the stellar halo (some of which might be in coherent structures) that are not resolved by the mixture modeling.

The simple mixture analysis that yielded the BICs on the high-purity sample is efficient, but does not account for the uncertainties in the measured parameters of the stars. The preliminary analysis informs the number of Gaussian distributions we include moving forward. To properly account for the errors, we perform our own dedicated study to derive the best-fit parameters of the velocity distributions of these separate components. We define the likelihood that a star $i$ belongs to either the Halo or Nyx by

[TABLE]

where $k=h$ (Halo) or $n$ (Nyx), $O_{i}=\{v_{r,i},v_{\theta,i},v_{\phi,i}\}$ are the velocities of star $i$ in spherical Galactocentric coordinates, and $\theta$ is the set of free model parameters. $\mathcal{N}$ denotes the multivariate normal distribution with mean $\boldsymbol{\mu}=(\mu_{r},\mu_{\theta},\mu_{\phi})$ and covariance $\boldsymbol{\Sigma}_{\text{true}}$ , which is a function of the dispersions $\sigma_{r,\theta,\phi}$ and the correlation coefficients $\rho_{r\theta,r\phi,\theta\phi}$ . The measurement errors are taken into account as $\boldsymbol{\Sigma}_{i}^{k}=\boldsymbol{\Sigma}_{\text{true}}^{k}+\boldsymbol{\Sigma}_{\text{err},i}$ , where $\boldsymbol{\Sigma}_{\text{err},i}$ varies from star to star but does not depend on the model parameters. The likelihood for Enceladus is the sum of two Gaussians with opposite means in the radial direction:

[TABLE]

where $\boldsymbol{\mu}^{\tilde{e}}=(-\mu_{r},\mu_{\theta},\mu_{\phi})^{e}$ . The total likelihood is therefore defined as

[TABLE]

where $Q_{j}$ is the fractional contribution of each component (constrained to add up to 1 over all $j$ ). This type of likelihood analysis is very similar to what was performed in Necib et al. [68], except that here we only cluster in kinematic space, and do not include metallicity information in the likelihood.

We also present results using the $S>0.85$ sample. In this case, the total likelihood in Eq. (3) includes an additional Gaussian, modeled following Eq. (1), intended to capture a second component of Nyx at negative radial velocity, which we refer to as Nyx-2.555Using the same technique of evaluating the BICs, we find that the number of preferred Gaussians stabilizes after 6. We choose to include 5 Gaussians (two of which correspond to Gaia Enceladus). Adding a sixth Gaussian leads to two overlapping distributions in velocity space, roughly at the location of Nyx.

We run emcee [25] to find the posterior distributions for the free parameters of the separate components. With nine free parameters for each population, and two additional parameters to quantify the relative fractions, this is a 29 parameter fit for the $S>0.95$ sample study. With 4 components, and 3 relative fractions, the $S>0.85$ sample requires 39 free parameters. The priors are linear, and their ranges are listed in Table 1. We use 200 walkers, 8000 steps for the burn-in stage, and 1000 steps for the analysis of the high-purity (canonical) samples. The relevant corner plots are provided in the Appendix as Figs. S2a–S2g.

3.2 Properties of Enceladus and Nyx

In this section, we investigate the properties of the dominant kinematic structures in the catalog. The top (bottom) row of Fig. 5 provides the 2D kinematic distributions of the stars in the $S>0.95$ ( 0.85) catalog in Galactocentric spherical coordinates. The lines show the 68% contours of the posterior distributions for the separate stellar components, modeled using the likelihood in Eq. (3). The corresponding 1D distributions for the posteriors are shown in Fig. 6.

Gaia Enceladus is the radial bi-modal population in Fig. 5 (blue line). By construction, the lobes of Enceladus have the same mean and dispersion in $v_{\theta}$ and $v_{\phi}$ and are located at equal, but opposite, values of radial velocity, see Eq. (2). The general properties of the Enceladus distribution are largely consistent with results from a previous study using the SDSS-Gaia cross match [68]. From the radial distributions shown in Fig. 6, we see that the peaks are located closer together in the canonical sample than in the high-purity sample. These differences may be a result of kinematic biases that are introduced as the score cut is increased. Additionally, as shown in Fig. S4, we find that stars with a high probability of being associated with Enceladus clearly extend down to the Galactic midplane. This corroborates the hypothesis that the Enceladus merger contributes to the local dark matter distribution [69, 68].

Nyx is the prograde group of stars characterized by its significant radial velocity (Fig. 5, green line). In particular, Nyx moves in the same direction as the Galactic disk, but its rotational velocity lags by $\sim 90\,\,\text{km}/\text{s}$ . Its radial velocity distribution has a mean value of $134\,\,\text{km}/\text{s}$ and dispersion of $\sim 67\,\,\text{km}/\text{s}$ . In the canonical sample, we find a corresponding structure, which we call Nyx-2, that has an average $v_{\phi}$ of $94\,\,\text{km}/\text{s}$ , compared to the Nyx value (in the same sample) of $125\,\,\text{km}/\text{s}$ , but equal and opposite $v_{r}$ of $-79\,\,\text{km}/\text{s}$ compared to $98\,\,\text{km}/\text{s}$ for Nyx.

The fraction of Nyx stars is 9% and 30% in the high-purity and canonical samples, respectively. Nyx-2 corresponds to 22% of the canonical sample. Enceladus is the dominant structure in both samples, comprising 77% (42%) of the high-purity (canonical) set. The relative fraction of Enceladus is reduced in the canonical sample primarily because of the presence of Nyx-2.

A complete discussion of the properties of Nyx and Nyx-2, and their potential origin, is presented in [70]. Briefly summarizing what is detailed there, Nyx is a coherent stellar stream on an eccentric orbit ( $e\sim 0.6$ ) whose distribution is highly unlikely to be drawn from the expected smooth distribution of the thick disk. Its behavior is consistent with prograde streams observed in simulations, which are created when a massive satellite is dragged into the disk plane by increased tidal friction [82, 56, 99, 2, 84, 85, 81, 59, 77]. The kinematics of Nyx-2 suggest that it is related to Nyx, and may be tidal debris from a separate passage of the same satellite. We approach this conclusion cautiously only because the network scores for Nyx-2 stars are generally lower than those for Nyx and its detection is thus not as robust.

To study the chemical abundances of each component, we use the cross match of stars provided by Sanders & Das [88], covering the spectroscopic catalogs APOGEE [62], LAMOST [22], RAVE [95], GALAH [20], Gaia-ESO [28], and SEGUE [103]. Out of the 9,379 (22,296) stars in the high-purity (canonical) sample, we find cross matches for 4,255 (10,309). We histogram each star’s metallicity $Z$ in Fig. 7. A star is associated with a given component if the mixture analysis finds that it has a probability greater than 95% of belonging to it in the high-purity or canonical samples. The Halo has a mean at $Z\sim-1.3$ in the high-purity sample, with large dispersion and a long tail towards more metal-poor stars. Fewer stars are associated with the Halo in the canonical sample as compared to the high-purity sample (less than 3% versus 10%), but the metallicities are largely consistent between the two. Enceladus also has a large dispersion in $Z$ , and peaks at $Z\sim-1$ in both the high-purity sample and the canonical samples. Nyx stars however have a much narrower $Z$ distribution, peaking at $Z\sim-0.6$ ( $Z\sim-0.4$ ) in the high-purity (canonical) sample. In both the canonical and high-purity samples, the distributions of the different components are consistent. Interestingly, the distribution of Nyx-2 in the canonical sample matches that of Nyx, strengthening the argument that they are debris from the same progenitor. Spectroscopic follow ups will be crucial to more deeply understand the origin of Nyx and Nyx-2.

In Fig. 8, we provide a scatter plot of the maximum vertical distance from the midplane, $z_{\text{max}}$ , versus the eccentricity, $e$ , of the different components as obtained from a simple calculation of their orbits. These parameters are evaluated when running the stellar orbits back in time 1 Gyr over 1000 steps, using the package gala [79]. To take the measurement errors into account, we resample each star 100 times from the position and velocity Gaussians, and average their eccentricities and $z_{\text{max}}$ . Figure 8 only shows 100 random stars from each component to not overwhelm the plot. Full distributions of the eccentricities, $z_{\text{max}}$ , apocenters, and pericenters of each component are provided in the Appendix, see Figs. S5a–S5c.

Enceladus stars have largely eccentric orbits ( $e>0.8$ ), confirming that the satellite progenitor was on a highly radial orbit [21]. Similarly, Halo stars have large eccentricities and extend up to $\sim 100\,\text{kpc}$ above the plane, on average higher than Enceladus. The 1D histograms of the eccentrities show that Enceladus stars are peaked towards larger eccentricities (Figs. S5b and S5c). Nyx stars extend to $z_{\text{max}}\sim 0.1$ – $4\,\text{kpc}$ above the plane at eccentricities $e\sim 0.3$ – $0.8$ — the latter are higher than what is expected of thick-disk stars [58].

4 More Subtle Kinematic Structures

In Sec. 3, we discussed the three most dominant structures found in the high-purity and canonical catalogs. To see if there is any evidence for non-Gaussian structures, we now use the algorithm DBSCAN666https://scikit-learn.org/stable/modules/clustering.html [76]. This clustering algorithm is designed to identify connected overdensities; we will apply it to the 3D velocities in Galactocentric cylindrical coordinates, not spherical as in Sec. 3, which will allow for easier comparison with previous studies. Note that DBSCAN does not require any assumptions regarding the shape of the underlying distributions, and therefore can identify non-Gaussian structures. This is particularly important when hunting for small stream-like features. DBSCAN depends on the minimum number of stars in a group, $n_{\text{min}}$ , and the compactness parameter, $\epsilon$ .

It is important to take into account the errors on the positions and velocities of the stars while using DBSCAN. To do so, we resample the stars over their errors and run DBSCAN 100 times, identifying the groups for each independent run. In Fig. 9, we show the resulting groups for $n_{\text{min}}=5$ and $\epsilon=0.4$ for one such realization, which returned nine groups. Note that one of these groups (shown in pale orange) is attributed to the combination of Gaia Enceladus and Nyx, as their distributions overlap and they are thus not separated by this algorithm.

Each realization of DBSCAN is prone to statistical fluctuations as we vary the stars within their measurement errors. To find the most robust structures, we save the centers of each of the groups identified in each the 100 resamplings, and then run DBSCAN over these 100 realizations to find clusters of group centers that are present in more than 70 realizations, with $\epsilon=0.3$ and $n_{\text{min}}=50$ . We find four stream candidates that pass these cuts for the high-purity sample, see Table 2. Interestingly, when running the same algorithm on the canonical sample, the same three groups pass these requirements.

The stream that is the most robust — in that it is found by all 100 realizations — is an overdensity at $(v_{R},v_{\phi},v_{z})=(21,141,-286)$ km/s, shown in the realization of Fig. 9 as the stars in yellow (labeled as Group I). This overdensity can be matched to the Helmi stream [35],777It also overlaps the S2 stream in [65], which is believed to be related to the Helmi stream. which has been reported in Gaia DR2 by several studies, including [26, 53]. We identify $\sim 20$ stars as belonging to this overdensity, all with $v_{z}<0\,\,\text{km}/\text{s}$ . We test the IDs of the core Helmi stream stars provided in [54] against our framework to find their associated network scores. Of the 40 IDs given, 37 pass our parallax error cut. Of these, 10 have $S<0.5$ and 27 have high scores of $S>0.5$ , with 22 having scores $S>0.9$ , meaning that the neural network is confident they are accreted. We do not reconstruct the second (smaller) known cluster of the Helmi stream at $v_{z}>0\,\,\text{km}/\text{s}$ .

The other three robust streams are located at $(v_{R},v_{\phi},v_{z})=(-154,-377,-63)\,\,\text{km}/\text{s}$ , $(v_{R},v_{\phi},v_{z})=(-195,-237,164)\,\,\text{km}/\text{s}$ and $(212,-229,161)\,\,\text{km}/\text{s}$ with small dispersions,888For groups II and III, we only provide the maximum dispersions. In some cases, the group has only two stars, making it difficult to define a meaningful velocity dispersion. as listed in Table 2. These structures fall near several velocity clusters identified in [53] and may be related.

5 Conclusions

Stars in the Milky Way galaxy can be divided into two components: those that were born within the Galaxy and those that were accreted. The phase-space distribution of accreted stars provides a crucial handle for understanding how the Galaxy evolved by revealing the imprints of satellite mergers. This approach requires distinguishing the population of accreted stars from their in situ counterparts, a task that becomes increasingly challenging near the Galactic midplane where disk stars comprise $\sim 99$ % of all stars. This motivated the work of [75], where we developed a deep neural network based approach that allows us to build a catalog of accreted stars from Gaia DR2 data. Although the network provides a catalog of all well-measured Gaia stars, regardless of whether or not they have a line-of-sight velocity measurement, this paper provides the first analysis of the 4.8 million star subset that includes the full 6D information, and that fall within $r\in[6.5,9.5]\,\text{kpc}$ and $|z|<3\,\text{kpc}$ .

Our primary goal is to identify and analyze structures in 3D velocity space. As a first step, we perform a Gaussian mixture analysis to break down the high-purity sample ( $S>0.95$ ) into its most significant contributions: Gaia Enceladus, Nyx, and the Halo. We find that Enceladus is highly radial and comprises the vast majority of accreted stars in this region of the sky. These results are consistent with previous studies of Enceladus, which characterized its properties farther from the disk plane [8, 30, 57, 68, 21]. Nyx is a new stream identified by this analysis. It is prograde and comprises nearly 9% of the high-purity sample, making it one of the most significant streams to be discovered to date near the Sun. Properties and a discussion of the potential origin of Nyx as a merging dwarf galaxy are explored in greater detail in [70]. The “Halo” is essentially the remaining group of accreted stars that cannot be further subdivided by the mixture analysis.

We also repeated the analysis on all accreted stars with network scores $S>0.85$ . This canonical sample is considerably larger in size than the high-purity one, but likely has more contamination from disk stars. We again recover Enceladus and Nyx in this sample. Additionally, we find evidence for another prograde stream, Nyx-2, with roughly the same rotational speed as Nyx and equal — but opposite — radial velocity. Nyx-2 comprises $\sim 22\%$ of the canonical sample, comparable to the Nyx fraction in this sample. Similarities in the kinematics and metallicities beetween Nyx and Nyx-2 suggest that they may be related to the same progenitor.

We also attempt to reconstruct non-Gaussian velocity structures using the DBSCAN algorithm. We locate four additional streams using this method, one of which coincides with the well-studied Helmi stream [35]. The Helmi stream is by far the most robust of the four, consisting of $\sim 20$ stars and identified over all repeated iterations that account for uncertaintities in the stellar velocity measurements. The other three candidate streams consist of fewer stars, and are recovered $\sim 70\%$ of the time over repeated iterations of DBSCAN. These streams are all retrograde, and may be associated with overdensities identified in [53].

The analysis presented here demonstrates the power derived from combining advancements in data quality, numerical simulations, and data analysis techniques. The fact that the catalog reproduces known structures such as Gaia Enceladus and the Helmi stream validates the utility of this approach. However, the science case extends much farther than simple validation. Indeed, the new catalog greatly improves our understanding of the stellar distribution in the ROI studied. In particular, it clearly demonstrates that Enceladus extends down into the Galactic plane. It also unearths evidence for a significant new stellar stream (Nyx), a potentially related counterpart (Nyx-2), and three other smaller candidate streams.

We have improved over previous approaches as a result of two main factors. The first is due to the statistical benefit of an increased overall size of the accreted stellar sample. Additionally, having used a deep network that is only trained on phase space allows us to derive a high-purity sample of accreted stars without imposing strong cuts on circular velocity or metallicity, as is typically done. This reduces the intrinsic bias that results from such cuts. We note that this paper has only scratched the surface, and in particular it will be very interesting to investigate what structures can be identified in the rest of our Gaia DR2 catalog that does not include line-of-sight velocities.

Understanding the mergers that contributed stellar debris in our neighborhood of the Milky Way has the potential to provide an empirical determination of the local dark matter distribution, which is also built up from mergers [36, 69], and as such, is expected to include remnant structures. The recent discovery of Enceladus, for example, motivates extending the Standard Halo Model of dark matter to (at least) a two-component model that includes both an isotropic halo and debris flow [68]. By clearly demonstrating that Enceladus extends into the disk plane, the results of this work confirm that dark matter debris from this merger likely contributes in the Solar Neighborhood. The discovery of Nyx near the Solar position may suggest the presence of a corresponding dark matter stream or disk. Coupling this catalog with cosmological simulations will be essential in refining our understanding of the local dark matter phase-space distribution, and its implications for direct detection experiments.

Note Added

As this work was being completed, the paper by Borsato et al. [11] became available. Using DBSCAN applied to the integrals of motion, Borsato et al. [11] found the groups of stars we also identify in Table 2, along with four more that do not pass our selection criteria.

Acknowledgments

We thank G. Brova, P. Hopkins, E. Kirby, R. Sanderson, and A. Wetzel for helpful discussions. This work was performed in part at Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. This research was supported by the Munich Institute for Astro- and Particle Physics (MIAPP) of the DFG cluster of excellence “Origin and Structure of the Universe.” This research was supported in part by the National Science Foundation under Grant No. NSF PHY-1748958.

LN is supported by the DOE under Award Number DESC0011632, the Sherman Fairchild fellowship, and the California Presidential fellowship. BO and TC are supported by the US Deptartment of Energy under grant number DE-SC0011640. ML is supported by the DOE under Award Number DESC0007968 and the Cottrell Scholar Program through the Research Corporation for Science Advancement. MF is supported by the Zuckerman STEM Leadership Program and in part by the DOE under grant number DE-SC0011640. SGK is supported by an Alfred P. Sloan Research Fellowship, NSF Collaborative Research Grant #1715847 and CAREER grant #1455342, and NASA grants NNX15AT06G, JPL 1589742, 17-ATP17-0214.

This work has made use of data from the European Space Agency (ESA) mission Gaia (http://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, http://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

Bibliography104

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadi et al. [2015] Abadi, M., Agarwal, A., Barham, P., et al. 2015, Tensor Flow: Large-Scale Machine Learning on Heterogeneous Systems, https://www.tensorflow.org , , , software available from tensorflow.org
2Abadi et al. [2003] Abadi, M. G., Navarro, J. F., Steinmetz, M., & Eke, V. R. 2003, Ap J, 597, 21
3An & Beers [2020] An, D., & Beers, T. C. 2020, Ap J, 897, 39
4An et al. [2015] An, D., Beers, T. C., Santucci, R. M., et al. 2015, Ap J, 813, L 28
5An et al. [2013] An, D., Beers, T. C., Johnson, J. A., et al. 2013, Ap J, 763, 65
6Beers et al. [2012] Beers, T. C., Carollo, D., Ivezić, Ž., et al. 2012, Ap J, 746, 34
7Beers et al. [2017] Beers, T. C., Placco, V. M., Carollo, D., et al. 2017, The Astrophysical Journal, 835, 81
8Belokurov et al. [2018] Belokurov, V., Erkal, D., Evans, N. W., Koposov, S. E., & Deason, A. J. 2018, MNRAS, 478, 611