Counting Bytes ...
Today's large disk drives have capacities measured in hundreds of gigabytes, but LSST will generate terabytes of data every night and eventually store more than 50 petabytes. To keep these numbers straight and give some sense of scale, here is glossary of storage terms:
- Megabyte (MB)=106 bytes = a Ph.D. thesis' worth of text;
- Gigabyte (GB)=109 bytes = forty (four-drawer) file cabinets full of text, or two compact discs' worth of music;
- Terabyte (TB)=1012 bytes = forty thousand file cabinets of text, or a feature film stored in digital form;
- Petabyte (PB)=1015 bytes = forty million file cabinets of text, or all of CNN's news footage for five years.
SOFTWARE: THE SOUL OF THE MACHINE
LSST will tile the sky repeatedly (each "visit" is a pair of 15 second exposures) with overlapping images of approximately ten-square-degrees. It will take two bytes of data to represent the amount of light falling on each of LSSTs 3.2 billion pixels. The telescope will make pairs of 15-second exposures, with each requiring an additional 2-seconds to read the image from the detector. While the second exposure is being read out the telescope moves to the next position on the sky in an average of 5 seconds. Current estimates indicate LSST will create 12.8 gigabytes (GB) of data every 39 seconds, a sustained data rate of 330 megabytes (MB) per second. While such a rate is not unheard of by modern internet standards, it represents a dramatic increase for astronomy. The highest data rate in current astronomical surveys is approximately 4.3 MB per second, in the Sloan Digital Sky Survey (SDSS).
Over a ten-hour winter night, LSST will thus collect up to 13 terabytes (million megabytes, TB) of 16 bit image data. While this seems a daunting amount of data to process, examine, store, and disseminate, its magnitude is not unprecedented. A feature-length High-Definition Television (HDTV) movie, before editing, requires several terabytes to store in raw form; by the time LSST is in operation, Hollywood and others will routinely be dealing in similar amounts of data!
The data reduction and analysis for LSST will be done in a way unlike that of most current observing programs. The data from each visit will be analyzed and new sources detected in the minute before the next pair of exposures is ready. This will allow interrupting the normal schedule of operations to follow any new, rapidly-varying events as they occur. It will also allow nearly-instantaneous notification to other observing resources such as radio and infrared telescopes and X-ray and gamma-ray observatories in space.
As each image becomes available, it will be corrected for geometric distortions and any small variations in sensitivity across the detectors. Ambient light from the night sky will be removed. The image will then be added to data previously collected from the same location in the sky to build up a very deep master image. The collection of these master images will become a key data product of LSST: a very deep map of the entire sky visible from its remote mountain site.
The master image will also be subtracted from each individual image as it comes in. The result will be an image which contains only the difference between the sky at that time and its average state: a picture containing only what has changed. Objects in this difference image will be classified according to their appearance and, by looking into a database of all previous classifications and images, according to their evolution in time. These data, and the individual exposures themselves, will then be added to the database.
Quality control will be an important aspect of data processing. Major problems will be relatively simple to diagnose automatically from the data stream. These include the effects of atmospheric blurring and of the mechanical and electronic health of the system. Experience with current automated surveys, which can be seen as precursor projects to LSST, has shown that such simple measures are not enough to ensure that data quality remains at the highest possible level. Subtle problems manifest themselves only through using the data to do science. Rather than playing the passive role of providing data to the community, the LSST team will engage in several key scientific projects to guarantee data quality.
A single exposure will detect sources at 24th magnitude. This is much fainter than the faintest sources detectable on the photographic plates, exposed for many hours on the Mt. Wilson 100-inch telescope, used by Edwin Hubble to discover the expansion of the universe. At this level of brightness, the most common objects in the sky are not stars but galaxies — 60,000 of them per square degree on the sky. In one pass across the visible sky (20,000 square degrees, or about three nights of observation), LSST will detect and classify 840 million persistent sources. Over time, LSST will survey 31,000 square degrees. By adding together the first five years of data, the all-sky map will reach 27th magnitude, and its database will contain over three billion sources, not counting transient events.
Information about the color of each source allows, for example, an estimate of the distance to each galaxy, or of the mass and evolutionary state of most stars. LSST will provide this data by observing in five colors, using filters in front of the camera. Further properties such as brightness, size, orientation and shape will be measured for each object in each color, allowing a much more detailed object classification. If 100 parameters are measured, after a single pass over the entire sky, the database will contain about 150 TB of data. In order to study change, however, such data will be retained from each pass over the sky, leading to over 5 PB of classification data in five years.
In addition to this object database, the individual images will themselves be retained, in an image database of over 150 TB for each individual pass, or 30 petabytes (PB, a thousand million megabytes) in five years. The image database will be a movie of the entire sky visible from the site of LSST...true cosmic cinematography.
Changes discovered by image subtraction will be compared against the database of known objects, allowing the type of change to be classified. Is this a new source? If not, how is it changing? Notification of specific types of events will automatically be sent to a variety of research programs, some of them automated in themselves. For example, a new source brightening over a period of a few days with a particular color signature will be identified as one of the several hundred thousand supernovae LSST is expected to discover each year.
If a series of new sources can be recognized by software as a single object moving across the sky, it will be tagged as a potential solar system object. New and archived data for these objects will be combined and a preliminary orbit determined. If the orbit satisfies certain criteria, the object will be classified as an Near-Earth Object and the orbital data will be sent on automatically to several projects currently in progress, which will assess the risk it may pose to Earth.
The object and image databases themselves will become a powerful tool for observational astronomy. One will be able to ask new questions and perform new surveys without needing to perform new observations. By retaining a time-dependent picture of the whole sky, one need not anticipate every sort of change to be discovered before observations are made. LSST will make unusual events commonplace and the rarest of events observable. Such "data mining" allows exploiting LSST to its fullest potential, but implementing the software to accomplish this stands as a daunting challenge.
THE ABILITY TO DO FAST analyses on petabytes of data will revolutionaize how we detect faint moving objects or probe the underlying dark mass-energy of our universe. Weak gravitational lensing, the deflection of light by intervening clumps of dark matter, causes distortions in the observed shapes of galaxies. LSST's high throughput and multiple short exposures will enable unprecedented control and rejection of systematic errors in image shape distortion. These data may then be processed to yield a mass map of the intervening universe. Closer to home, potentially devastating near-Earth objects now go undetected. New techniques of extracting relevant source parameters can be used on the imaging data to automatically find such objects. Similar image-probing techniques are important to other areas of science as well (satellite observations, biology, oceanography, etc), and the software tools developed to mine the LSST data resource will find wide application.
The high data rate, combined with the need for real-time analysis and later data exploration, requires a fresh approach, making use of the best technology and developing innovative software for optimal data management. While much headway can be made in efficient algorithms and associated software, there will also be hardware challenges in processing and storing this much data. It will be particularly effective to have data analysis innovations in place when the telescope and camera systems are first put into use.
Current projects show that approximately 5000 mathematical operations are required per pixel of the image to process and classify survey data. Scaling this to the size of the LSST data stream shows that approximately a thousand of today's high-end processors will be required — a feasible proposition. Advances in processor power over the next five years will reduce this number to a few hundred, by which time the required LSST computer system will seem quite pedestrian. Storing this data is also well within even today's technology. At current prices, a one-petabyte disk storage system costs less than $1 million; in five years this price should drop to well below $100,000. Keeping all of the LSST data online will certainly be affordable.
More interesting challenges are presented by data mining. We need now to discover ways to search for correlations in such a massive database, an ability which will be key to extracting unanticipated science. While the software required by LSST science programs presents challenges, assuring opportunity for unanticipated science using such huge databases presents far greater ones. Designing optimal datahandling and search routines will be an exciting aspect of this project, for many science programs may need access to the full imaging data archive. One example of this is a search for what appear to be collections of faint point sources of light but which in reality are all part of a single, extended but low-surface-brightness object. Another example is the search for patterns in the appearance of objects transient in time and space.
THE GOAL OF THE LSST project is to make all of the data available to anyone who is interested, anywhere in the world. How much of the data is interesting to how many is a question which must guide the way data is distributed. The overwhelming majority of users will probably not be professional astronomers. They will be interested in browsing deep color images, the most recently acquired images, or an all-sky map or what changed last night — this could range between several GB to 100 TB of data. They will not have the very high-speed internet access available to research institutions, so they will need to use tools designed to browse the sky at low resolution before "zooming in" on a particularly interesting area. LSST can accommodate these users by deploying one or more large but otherwise conventional web sites. Most research applications of the LSST databases will likewise require relatively small amounts of data at a time. Searching catalogs of objects and sporadic downloads of images by the professional community can also be served by web-based access to one or more LSST data centers and will become the cornerstone of the National Virtual Observatory, a project to make all astronomical data widely available.
When a project requires data more rapidly than internet access can provide, "sneaker net" — writing data to disks and physically shipping them to the project — may provide the most cost-effective solution. For example, an astronomy department may wish to have a copy of the databases and the deep all-sky map for local use. Planetariums might wish to keep large quantities of data on-site for use in developing exhibits and shows. Storing such data sets would require disk space costing a few tens of thousands of dollars. This would be a small fraction of the cost associated with providing the computing power necessary to make use of the information. Large institutions, or even countries, might provide their scientists access to copies of the entire data set. Projects which require access to all of the data at once will almost certainly also require significant computational resources to achieve their goals. For these projects, the cost of acquiring a complete copy will be quite small compared with their overall budget.
Finally, one might envision an investigation which requires rapid access to new data as well as the data set as a whole. A small number of these projects will be accommodated at LSST data center. Investigators will be able to bring their own computers to the center and tap into the primary LSST data stream. Space for this will be limited, so a national committee will judge these projects competitively based on their scientific merit. This is the way limited telescope resources are allocated today. The only difference is that the sky will be down here on Earth and the telescope is now the data connection.
Achieving this will require the efforts not only of astronomers but also of experts on statistics and algorithm development, computer science, and data mining and visualization. The effort invested in software, data system design, tools for visualizing and analyzing data, and, of course, making sense of the data, may be comparable to that spent on the telescope hardware itself.