Fundamentals of effective cloud management for the new NASA Astrophysics Data System
Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J. Kurtz, Edwin, Henneken, Carolyn S. Grant, Donna M. Thompson, Roman Chyla, Stephen McDonald,, Golnaz Shapurian, Timothy W. Hostetler, Matthew R. Templeton, Kelly E., Lockhart, Kris Bukovi, Nathan Rapport

TL;DR
This paper discusses the design and implementation of a scalable, reliable, and resilient cloud management system for NASA's Astrophysics Data System, focusing on Kubernetes deployment challenges and solutions.
Contribution
It provides practical insights and strategies for deploying complex scientific data systems on cloud infrastructure using Kubernetes and container orchestration.
Findings
Kubernetes improves scalability and resilience of ADS
Identified challenges in automatic scaling and load balancing
Developed monitoring and CI/CD workflows for complex systems
Abstract
The new NASA Astrophysics Data System (ADS) is designed with a serviceoriented architecture (SOA) that consists of multiple customized Apache Solr search engine instances plus a collection of microservices, containerized using Docker, and deployed in Amazon Web Services (AWS). For complex systems, like the ADS, this loosely coupled architecture can lead to a more scalable, reliable and resilient system if some fundamental questions are addressed. After having experimented with different AWS environments and deployment methods, we decided in December 2017 to go with Kubernetes as our container orchestration. Defining the best strategy to properly setup Kubernetes has shown to be challenging: automatic scaling services and load balancing traffic can lead to errors whose origin is difficult to identify, monitoring and logging the activity that happens across multiple layers for a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Scientific Computing and Data Management · Cloud Computing and Resource Management
Fundamentals of effective cloud management for the new NASA Astrophysics Data System
Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J. Kurtz, Edwin Henneken, Carolyn S. Grant, Donna M. Thompson, Roman Chyla, Stephen McDonald, Golnaz Shapurian, Timothy W. Hostetler, Matthew R. Templeton, Kelly E. Lockhart, Kris Bukovi, and Nathan Rapport
Abstract
The new NASA Astrophysics Data System (ADS) is designed with a service-oriented architecture (SOA) that consists of multiple customized Apache Solr search engine instances plus a collection of microservices, containerized using Docker, and deployed in Amazon Web Services (AWS). For complex systems, like the ADS, this loosely coupled architecture can lead to a more scalable, reliable and resilient system if some fundamental questions are addressed. After having experimented with different AWS environments and deployment methods, we decided in December 2017 to go with Kubernetes as our container orchestration. Defining the best strategy to properly setup Kubernetes has shown to be challenging: automatic scaling services and load balancing traffic can lead to errors whose origin is difficult to identify, monitoring and logging the activity that happens across multiple layers for a single request needs to be carefully addressed, and the best workflow for a Continuous Integration and Delivery (CI/CD) system is not self-evident. We present here how we tackle these challenges and our plans for the future.
Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138, USA [email protected]
1 Introduction
The NASA Astrophysics Data System (ADS; Kurtz et al. 2000) is a key bibliographic service for astronomical research. ADS content has steadily increased since its early years (Grant et al. 2000), containing now more than 13 million records and 100 million citations including software and data citations (Accomazzi 2015). After several iterations, its original architecture (Accomazzi et al. 2000) and user interface (Eichhorn et al. 2000) have evolved to address growing maintenance challenges and to adopt newer technologies that allow more advanced functionality (Chyla et al. 2015; Accomazzi et al. 2015, 2018).
The new ADS is designed with a service-oriented architecture (SOA), containerized using Docker111https://www.docker.com/, orchestrated by Kubernetes222https://kubernetes.io/ and deployed in Amazon Web Services333https://aws.amazon.com/ (AWS). We have been using this platform for almost a year now, both in our development and production environments. However, when searching for Kubernetes in the full text of the astronomy collection in the new ADS, we currently find only nine results and one of them is not related to the software platform. Among these results, only three present results or a product/service that used Kubernetes in production (Abbott et al. 2018; Araya et al. 2018; Farias et al. 2018). The rest only mention the software as an alternative or indicate they are considering to migrate their platform to it in the future. While the new ADS does not have full text for all records, these data indicate that the new ADS is using cutting edge technology in production. The price to pay for being early-adopters is the challenge of solving problems that nobody (or very few people) has faced yet, but sharing our experience will ease the path for others while ADS continues to lead the way in the astrophysical community.
2 The new architecture
The new ADS consists of multiple customized Apache Solr444https://lucene.apache.org/solr/ search engine instances plus a collection of microservices deployed in two different Kubernetes clusters (see Figure 1). This loosely coupled architecture allows us to have a more scalable, reliable and resilient system.
Based on our experience, managing Kubernetes clusters in production requires a good strategy to properly monitor all the services from the exterior, log the internal events triggered by users’ requests and define a solid strategy to deploy new software versions with a workflow for a Continuous Integration and Delivery (CI/CD) that minimizes service interruptions.
2.1 Monitoring
Making sure the whole system is healthy and responding to users’ requests is a priority. We developed a custom monitoring tool that emulates users’ behavior (e.g., executing searches, accessing libraries, exporting records, filtering results) and alerts us to unexpected results or errors via Slack555https://slack.com/. This emulation happens with a high cadence of the order of several minutes. Historical data is also accumulated and daily reports are generated to measure trends and improvements that could be correlated with microservices updates or infrastructure changes.
2.2 Logging
Responding to a single user request may involve multiple microservices (e.g., libraries, Solr search service) and different data requests (e.g., bibcodes in a library, records in Solr). At the very first step, when the user request reaches the AWS application load balancer, a trace identifier is attached to the HTTP request and we propagate it for each required internal request inside our infrastructure. All the microservices output logs to stdout, including key information such as the trace identifier and the user’s account identifier. Logs are captured by Fluent Bit666https://fluentbit.io/ and distributed to Graylog777https://www.graylog.org/ and AWS CloudWatch via Fluentd888https://www.fluentd.org/.
2.3 Deploying
The deployment of new microservice releases is automatically managed by Keel999https://keel.sh/. The developers push new commits to GitHub101010https://github.com/ and/or make releases, which triggers unit testing via Travis111111https://travis-ci.org/ continuous integration and image building via Docker hub121212https://hub.docker.com/. When a new image is built, Keel deploys it directly to our development environment (each pushed commit) or to our quality assurance environment (each new release). Confirmation to deploy a release in production is provided via Slack, where Keel reports its operations and reacts to developers’ approvals.
3 Future plans
Several microservices still require manual intervention in order to deploy new releases, Keel does not cover all our development cases and we are working on a new custom tool to meet our needs (after having discarded other tools available in the market due to their complexity). We seek to fully automate the deployment process, while ensuring traceability and easy roll-backs based on automatic functional tests from our monitoring tool. Additionally, to reduce the required resources and simplify operations, we will evaluate other engines for searching through our logs such as Kibana via ElasticSearch131313https://www.elastic.co/products/kibana (provided by AWS).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbott et al. (2018) Abbott, T. M. C., Abdalla, F. B., Allam, S., Amara, A., Annis, J., Asorey, J., Avila, S., Ballester, O., & et al. 2018, ar Xiv e-prints. 1801.03181
- 2Accomazzi (2015) Accomazzi, A. 2015, in Science Operations 2015: Science Data Management - An ESO/ESA Workshop, 3
- 3Accomazzi et al. (2000) Accomazzi, A., Eichhorn, G., Kurtz, M. J., Grant, C. S., & Murray, S. S. 2000, Astronomy and Astrophysics Supplement Series, 143, 85. astro-ph/0002105
- 4Accomazzi et al. (2018) Accomazzi, A., Kurtz, M. J., Henneken, E., Grant, C. S., Thompson, D. M., Chyla, R., Mc Donald, S., Shaulis, T. J., Blanco- Cuaresma, S., Shapurian, G., Hostetler, T. W., Templeton, M. R., & Lockhart, K. E. 2018, in American Astronomical Society Meeting Abstracts #231, vol. 231 of American Astronomical Society Meeting Abstracts, 362.17
- 5Accomazzi et al. (2015) Accomazzi, A., Kurtz, M. J., Henneken, E. A., Chyla, R., Luker, J., Grant, C. S., Thompson, D. M., Holachek, A., Dave, R., & Murray, S. S. 2015, in Open Science at the Frontiers of Librarianship, edited by A. Holl, S. Lesteven, D. Dietrich, & A. Gasperini, vol. 492, 189
- 6Araya et al. (2018) Araya, M., Osorio, M., Díaz, M., Ponce, C., Villanueva, M., Valenzuela, C., & Solar, M. 2018, Astronomy and Computing, 25, 110
- 7Chyla et al. (2015) Chyla, R., Accomazzi, A., Holachek, A., Grant, C. S., Elliott, J., Henneken, E. A., Thompson, D. M., Kurtz, M. J., Murray, S. S., & Sudilovsky, V. 2015, in Astronomical Data Analysis Software an Systems XXIV (ADASS XXIV), edited by A. R. Taylor, & E. Rosolowsky, vol. 495, 401
- 8Eichhorn et al. (2000) Eichhorn, G., Kurtz, M. J., Accomazzi, A., Grant, C. S., & Murray, S. S. 2000, Astronomy and Astrophysics Supplement Series, 143, 61. astro-ph/0002102
