Innovation / Petascale Data Challenges
Astronomy is undergoing a revolution in the way we probe the universe and the way we answer fundamental questions. New technology enables this: novel detectors are opening new windows on the universe, creating unprecedented volumes of high quality data, and computing technology is keeping up with this explosion. In turn, this is driving a shift in the way science is produced in astronomy and astrophysics: huge surveys of the sky over wide wavelengths can be analyzed statistically for low-level correlations and inverse problems may be solved by statistical inversion, producing new understanding of the underlying physics. LSST is the lighthouse project in this revolution, and solutions to LSST’s challenges are already having spin-off effects in broader areas of technology and “big data” science.
The realization of the LSST involves extraordinary engineering and technological challenges: the fabrication of large, high-precision aspheric optics; construction of a huge, highly-integrated array of sensitive, wide-band imaging sensors; and the operation of a data management facility handling tens of terabytes of data each day. The design and development effort includes structural, thermal, and optical analyses of all key hardware subsystems, prototyping and development of data management systems, and extensive systems engineering studies. To validate system performance, full end-to-end simulations are being done. Over 100 technical personnel at a range of institutions are currently engaged in this program. LSST R&D has led to a new generation imaging CCD which is highly segmented, low noise, and sensitive from the UV to the near IR. The rapid cadence of the LSST observing program will produce about 30 TB per night, leading to a total database over the ten years of operations of 60 PB for the raw data, and 30 PB for the catalog database. The total data volume after processing will be over one hundred PB, processed using 250 TFlops of computing power. Processing such a large volume of data, converting the raw images into a faithful representation of the universe, automated data quality assessment, and archiving the results in useful form for a broad community of users is a major challenge.
This data-driven modeling and discovery linkage has entered a new paradigm. The acquisition of scientific data in all disciplines is now accelerating and causing a nearly insurmountable data avalanche. It is no longer possible for humans to look at any representative fraction of the data. Instead, we may be looking over the shoulders of assisted learning machines at innovative visualizations of metadata. Discoveries will be made via searches for correlations. The role of the experimental scientist increasingly is as inventor of ambitious new searches and new algorithms. Novel theories of nature are tested through searching for the predicted statistical relationships across big data bases. With this accelerated advance in data generation capability, we will require novel, increasingly automated, and increasingly more effective scientific knowledge discovery systems.
The LSST scientific database will include:
- Over 100 database tables
- Image metadata consisting of 700 million rows
- A source catalog of with 3 trillion rows
- An object catalog with 20 billion rows each with 200+ attributes
- A moving object catalog with 10 million rows
- A variable object catalog with 100 million rows
- An alerts catalog. Alerts issued worldwide within 60 seconds.
- Calibration, configuration, processing, and provenance metadata
The science archive will consist of 400,000 sixteen megapixel images per night (for 10 years), comprising 60 PB of pixel data. This enormous LSST data archive and object database enables a diverse multidisciplinary research program: astronomy & astrophysics; machine learning (data mining); exploratory data analysis; extremely large databases; scientific visualization; computational science & distributed computing; and inquiry-based science education (using data in the classroom). Many possible scientific data mining use cases are anticipated with this database. The advances in these technology areas will be exported to other big data science applications (biology, remote sensing, etc) and will drive innovations in industry. Already a collaboration is forming between industry and LSST on the design of extremely large databases.