October 2012  •  Volume 5 Number 2

The LSST Data Avalanche: Astroinformatics Rises to the Challenge

Unlike previous articles in this series, this E-News article is not based on a chapter from the LSST Science Book. LSST formed the Informatics and Statistics Science Collaboration in 2009. Kirk D. Borne is the Chair of the collaboration. The members of the Collaboration are listed at the end of the article.

LSST opens the world of data-intensive astronomy, requiring skills in the area of computational and data sciences in order to maximize the opportunities for knowledge. (Graphic: Emily Acosta, LSST)  

Every night for 10 years LSST will obtain approximately 2,000 images of the sky with its 3-billion pixel camera. This corresponds to about 15 terabytes of data daily for 10 years. As the survey progresses, researchers will have hundreds of petabytes of data to access, analyze, and interpret. Adjectives such as “flood,” “avalanche,” “fire hose,” and “big data” are used to describe this onslaught of data. One of the major questions facing the LSST scientists and engineers is how to handle the large and complex data collection that LSST will generate. The Informatics and Statistics Science Collaboration is researching the science and engineering of this challenge. To keep up with the flood of data, researchers will need to develop more powerful algorithms, methodologies, and approaches. Rising to the challenge will enable scientists to undertake new modes of discovery, where data-driven, data-rich science goes beyond traditional science.

This new “big data” isn’t limited to large astronomy surveys. The growth of data volumes in nearly all scientific disciplines, business sectors, and government is swamping our ability to gain useful insights and understanding from the data in an efficient or effective way. How are we going to access, retrieve, interpret, analyze, mine, integrate, and visualize massive quantities of data? The answer is the informatics approach: the use of digital data, information, and related services for research and knowledge generation [D.N. Baker, EOS 89 (2008)]. Researchers will use the discipline of informatics, or more specifically, astroinformatics, to organize, explore, visualize, and mine the LSST data for new astronomical discoveries. A data-driven revolution in science is underway.

Astroinformatics encompasses a set of naturally related specialties including data organization, data descriptions, astronomical classification taxonomies, astronomical concept ontologies, data mining, visualization, and statistics. The accompanying cyberinfrastructure includes databases, virtual observatories (distributed data), high-performance computing (clusters and petascale machines), distributed computing (the Grid, the Cloud, and peer-to-peer networks), intelligent search and discovery tools, and innovative visualization environments.

Astroinformatics will allow data integration, data mining, and knowledge discovery across heterogeneous massive data collections. It will allow re-use and re-purposing of archival data for new projects, integration of data within different contexts, literature linkages, classification of objects, quantitative scoring of classifications, discovery of “interesting” objects and new classes of object, development of an astronomical “genome,” and employment of data in educational settings among other uses. According to Borne, “We are not just using more data; qualitatively different methods for doing science with big data are required. It’s a revolutionary new way to do science.”

Borne sees a wide variety of data mining and statistics use cases for the LSST data collection. These include:

  • Provide rapid probabilistic classifications for millions of events each night;
  • Find new multivariate correlations and associations in high-dimension (dimensions around 1,000) astronomical attribute parameter space;
  • Discover voids in these high dimensional parameter spaces, for example, period gaps;
  • Discover new and exotic classes and subclasses of objects and astrophysical processes, along with new properties of known classes;
  • Discover new and improved rules for classifying known classes of objects;
  • Identify novel, unexpected behavior in the time domain from time series data;
  • Hypothesis testing – verify existing (or generate new) astronomical hypotheses with strong statistical confidence, using millions of training samples;
  • Serendipity – discover the rare one-in-a-billion type of objects through outlier detection, which Borne calls “Surprise Discovery” algorithms;
  • Quality Assurance – identify data pipeline processing errors through deviation detection.

The landscape of astronomical research is changing rapidly. With powerful statistical and informatics methods and the advent of large surveys and massive data collections, astronomers will be able to meet the massive data-to-knowledge challenges of LSST and to discover the unknown unknowns at an unprecedented rate.

For more information:

K.D. Borne (2006) Data-Driven Discovery through e-Science Technologies. 2nd IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT’06).

K.D. Borne and T. Eastman (2006) Collaborative Knowledge Sharing for E-Science. AAAI Workshop on the Semantic Web for Collaborative Knowledge Acquisition, 104-105.

K.D. Borne (2010) Astroinformatics: Data-Oriented Astronomy Research and Education. Journal of Earth Science Informatics, 3, 5-17.

R. McKercher and S. Jacoby (2011). LSST Key Player in Sea Change of Data Availability E-News 4 (2).

Article written by Anna H. Spitz and Kirk D. Borne


