LSST Data Management
The LSST data management system must
- reliably process unprecedented data volumes
- ensure consistent data quality without manual intervention
- meet stringent near-real-time transient alerting deadlines
- accommodate both scientific and computing technology evolution over at least a decade, and
- serve the LSST data products to a diverse community of users located across multiple continents.
The LSST data management system is composed of a set of productions each of which is made up of a series of pipelines, a large archive of images, and a number of catalogs containing the detected astronomical sources and resolved astronomical objects. Underneath these are the software middleware and technology infrastructure that permit the visible elements to work securely, reliably, and scalably. The processing and data are distributed across multiple computing centers on the observatory mountaintop, in a base facility near the observatory, and at multiple archive centers and data centers.
The results of such a processing run form a data release, which is a static, self-consistent data set for use in performing scientific analysis of LSST data and publication of the results.
Periodically new calibration data products are created, such as bias frames and flat fields that will be used by the other processing functions.
All LSST data must be made available through an interface that utilizes, to the maximum possible extent, community-based standards such as those being developed by the Virtual Observatory.
Overview of Data Management
The rapid cadence of the LSST observing program will produce an enormous volume of data, ~20 TB per night, leading to a total database over the ten years of operations of 60 PB for the raw data, and 15 PB for the catalog database. The total data volume after processing will be several hundred PB, processed using ~150 TFlops of computing power. Processing such a large volume of data, converting the raw images into a faithful representation of the universe, automated data quality assessment, automated discovery of moving or transient sources, and archiving the results in useful form for a broad community of users is a major challenge.
The data management system is architected in three layers: an infrastructure layer consisting of the computing, storage, and networking hardware and system software; a middleware layer, which handles distributed processing, data access, user interface, and system operations services; and an applications layer, which includes the data pipelines and products and the science data archives. The application layer is organized around the data products being produced.
The nightly pipelines are based on image subtraction, and are designed to rapidly detect interesting transient events in the image stream and send out alerts to the community within 60 seconds from completing the image readout. The data release pipelines, in contrast, are intended to produce the most completely analyzed data products of the survey, in particular those that measure very faint objects and cover long time scales. A new run begins each year, processing the entire survey data set that is available. The data release pipelines consume most of the computing power of the data management system. The calibration products pipeline produces the wide variety of calibration data required by the other pipelines. All of these pipelines are architected to make efficient use of linux clusters with thousands of nodes.
Although the data management facilities will have substantial computing power (~150 TFlops, equal to the world's most powerful computer in 2004), the continuation of current trends suggests that it will not even qualify for the top 500 list by the time of first light in 2014. Hence, while LSST is making a novel use of advances in information technology, it is not taking the risk of pushing the expected technology to the limit.
Learn more on the following pages:
