Petascale R&D Challenges

Achieving scalability and reliability in Rubin Observatory computing, storage, and network resources

The design of the data management system (DMS) architecture is influenced by the technology that we expect to be available to implement it, starting with construction and commissioning and continuing through the principal 10-year survey period. This technology includes not only more powerful components, but completely new system architectures and potentially disruptive technologies.

Most computing throughput improvements will come not from increased CPU clock speeds as in the past, but from larger concentrations of CPUs/cores and advanced computing architectures. Solid state technology may change storage and the way we physically organize data. Hardware failures will be routine for the Rubin Observatory data system due to the large number of CPUs and disk drives, and reliance on high-speed network connectivity. Creating a system sufficiently robust to operate around such failures is a major challenge. We need to predict the characteristics of CPU, network, storage hardware, and system software sufficiently well that our design is appropriate. Further, we need to insulate the design as much as possible from underlying platform dependencies.

Reliability and performance issues for very large databases

Rubin Observatory's main data products from the 18,000 square degree survey with 2000 images per patch of sky over ten years will be in the form of extremely large relational database tables (37 billion rows in the Object table; 350 billion rows in the Source table). The tables must be extensible, partitioned and indexed to facilitate high query performance, and replicated across multiple centers.

Queries in the time domain (Source table) are likely to be of equal importance to those in the spatial domain. Since these are traditionally optimized by different database organizations, it is unclear which choices will perform best for Rubin Observatory. Some intensive applications will involve n-point correlations of object attributes over all objects. All these factors suggest that database performance and reliability are research areas.

Efficient automated data quality assessment

LSST will produce large volumes of science data. The DMS produces products for scientific use both during observing (i.e. alerts and supporting image and source data) and in daily and periodic reprocessing. The periodic reprocessing also results in released science products. Analysis of the nightly data will also provide insight into the health of the telescope/camera system. An automated data quality assessment system which efficiently searches for outliers in raw image data and unusual correlations must be developed. This will almost certainly involve aspects of machine learning.

Operational control and monitoring of the DMS

The DMS will be a complex distributed system with enormous dataflows that operates 24 hours a day, seven days a week. The DMS must be continuously monitored and controlled to ensure the proper functioning of all computing hardware, network connections, and software, including the data quality of the science pipelines. Most of the monitoring tasks, and some of the control tasks, must be highly automated, since the data volumes preclude human examination of all but a tiny fraction of the data.

Achieving an acceptably low false transient alert rate

The science mission places high demand on Rubin Observatory's ability to rapidly and accurately detect and classify varying and transient objects and to achieve a low false alarm rate. Given the very high data volume produced by Rubin Observatory, the corresponding large number of detections in each image (up to one million objects detected per image), as well as the likelihood of discovering entirely new classes of transients, Rubin Observatory will not be able to rely on traditional labor-intensive validation of detections, classifications, and alerts. To achieve the levels of accuracy required, new algorithms for detection and classification must be created, as well as innovative automated techniques for alert filtering and validation.

Efficiently detecting and determining orbits of solar system objects

One of Rubin Observatory science missions is to catalog the population of solar system objects, with a particular focus on potentially hazardous objects. Due to the depth of Rubin Observatory images, about 300 solar system objects per square degree will be detected near the ecliptic. Rubin Observatory's path across the sky is not optimized solely for tracking solar system objects, so this dense swarm of objects must be reliably tracked through considerable gaps in time. Scalable algorithms that minimize incorrect associations between detections at different times must be developed.

Achieving required photometric accuracy and precision

The Rubin Observatory Observatory Science Requirements Document (SRD) requires a level of photometric (intensity data) accuracy and precision that may be difficult to achieve over the entire sky, particularly since Ruin Observatory will be operating in a wide variety of seeing, sky brightness, and degradation due to atmospheric conditions. To achieve this requires a thoroughly tested calibration procedure and associated image processing pipeline. In addition to the point-source requirements in the SRD, accurate photometric redshifts require precision photometry for spatially extended objects.

Achieving required astrometric accuracy and precision

The Rubin Observatory SRD requires a level of astrometric (position on the sky) accuracy and precision that is difficult to achieve over the entire sky. Achieving this astrometric performance requires a global, whole-sky, numerical solution for all per-frame astrometric quantities that minimizes any necessary trade-offs. Considerable work will be required to develop such a solution.

Achieving optimal object detection and shape measurement from stacks of images

Most objects that will be used for dark matter and dark energy science are too faint to be usefully measured in a single Rubin Observatory exposure. Instead, Rubin Observatory must detect and measure the properties of objects combining information from multiple exposures of the same region of sky (image stacks). Weak lensing galaxy shape measurements are particularly vulnerable to systematic effects introduced by errors in determining the local point-spread function (PSF), a measure of how innate characteristics of the imaging system affect the image. These systematic effects must be minimized. Exposures may vary significantly in their signal-to-noise and PSF quality, and defining how to optimally combine information from all of them is a research problem.

Developing a flexible approach that enables highly reliable classification of objects

Classification of astronomical objects is important and difficult. A wide variety of information must be assessed to reliably classify an object. This includes spatial morphology in multiple colors, photometry in multiple colors, time dependent behavior, and astrometric motion. Further, the best classifications will make use of surveys in other wavelength regimes and spectral information where available, not solely information from Rubin Observatory.

Experience from many surveys has shown that no single algorithm can do a good job on all objects. Rather, good algorithms tend to be specialist, limited to particular objects classes, e.g. eclipsing binaries or supernovae. A successful system must allow the development and incorporation of a wide variety of algorithms in a flexible manner. Research will include the application of human computation techniques to this challenge.

Retuning algorithm behavior on the fly

Several key algorithms employed in the Rubin Observatory application pipelines are complex, containing many data-dependent decisions and a large number of tuning parameters that affect their behavior. As observing conditions change, an algorithm may begin to fail for a particular choice of tuning parameters. Rubin Observatory's extremely large data volume makes human intervention in such cases impractical, but it is essential that the pipelines continue to function successfully. Rubin Observatory will incorporate adaptive tuning to address these dynamics.

Verifying scientific usefulness of the Rubin Observatory database schema and its implementation against realistic queries

The Rubin Observatory database schema must efficiently support queries of data that have many relationships between multiple locations on the sky, observational periods, and filters used. A high-performance implementation of this schema has many complexities that are addressed in the peta-scale database architecture and analysis challenge. The ultimate test of how well these challenges have been carried out is to perform science with the database. To do this usefully, we are simulating Rubin Observatory data, using data from current surveys, and engaging the Rubin Observatory/LSST Science Collaborations and scientific community.

← Back to rubinobservatory.org