Data Management

An illustration of the innovative "data mining sphere" developed by the Rubin Observatory database team.

Data Management is responsible for creating the software, services and systems which will be used to produce Rubin Observatory's data products. If you are interested in installing the pipeline software jump to pipelines.lsst.io.

The speed with which Rubin Observatory maps the southern sky and the depth to which it can see will produce an enormous volume of data, about 20 terabytes (TB), or 20 trillion bytes of raw data per night. The total amount of data collected over the ten years of operation will be about 60 petabytes (PB), and processing this data will produce a 20 PB catalog database. The total data volume after processing will be several hundred PB, processed using about 150 TFLOPS (trillion floating point operations per second) of computing power for the first Data Release, increasing to 950 TFLOPS by Data Release 11 at the end of the ten-year survey. Processing such a large volume of data, converting the raw images into a faithful representation of the universe, implementing automated data quality assessment and automated discovery of moving or transient sources, and archiving the results in useful form for a broad community of users is a major challenge.

The data management system is architected in three layers: an infrastructure layer consisting of the computing, storage, and networking hardware and system software; a middleware layer, which handles distributed processing, data access, the user interface, and system operations services; and an applications layer, which includes the data pipelines and products and the science data archives. The applications layer is organized around the data products being produced.

The nightly pipelines are based on image subtraction, a process that highlights differences between two exposures of the same field, and are designed to rapidly detect interesting transient events in the image stream and send out alerts to the community within 60 seconds of completing the image readout.

The data release pipelines, in contrast, are intended to produce the most completely analyzed data products of the survey, in particular those that measure very faint objects and cover long time scales. Each year, a new run processes the entire available survey data set, cumulatively increasing the depth and completeness of the available data. The data release pipelines consume most of the computing power of the data management system.

The calibration products pipeline produces the wide variety of calibration data required by the other pipelines.

All of these pipelines are architected to operate on very small and medium sized platforms as well as make efficient use of Linux clusters with thousands of nodes.

Although the data management facilities will have substantial computing power (the 150 TFLOPS required for processing the first Data Release equals the world's most powerful computer in 2004), if current trends continue, they won't even qualify for the top 500 list when Rubin Observatory sees first light through the telescope. Hence, while Rubin Observatory is making a novel use of advances in information technology, it is not taking the risk of pushing the expected technology to the limit.

Image Credit:

LSST

← Back to rubinobservatory.org

Data Management

LSST@Europe4 Wraps up in Rome

Data Image Gallery