News Releases

Fastest disk to disk transfer of terbabyte size astronomy data across the Pacific

San Diego, September 28, 2005 - Scientists from the National Center for Data Mining (NCDM) at the University of Illinois at Chicago established a new record for transferring nearly a terabyte of astronomy data, disk-to-disk, across the Pacific, using their UDT protocol. NCDM and its partners achieved this milestone at iGrid 2005, in San Diego, CA.

New network protocols have proved successful at achieving high performance from one computer’s memory to another, but until now, it has been a challenge to translate this performance to work with data on disk, something required by actual data intensive applications. Researchers pulled data from disk on one side of the Pacific and wrote to a disk on the other side of the Pacific almost as fast it if they were in the same room, which is a major milestone, according to Robert L. Grossman, Director of the National Center for Data Mining at the University of Illinois at Chicago and Managing Partner of Open Data Partners LLC.

NCDM transferred the entire Release 3 of the Sloan Digital Sky Survey (SDSS) data set (785 Gigabytes compressed) from the iGrid floor to nodes at KISTI in Korea, in less than 3.5 hours. The average transfer speed was over 650Mb/sec and the peak speed was over 1000 Mb/sec. This was the first time that an astronomy data set of this size was transferred this fast across the Pacific. With conventional networks and network protocols this transfer would not have been practical.

Being able to transfer data disk-to-disk across long-distance networks is an open challenge with many real-life applications. Many scientific researchers, even those with access to high-speed networks, are still forced to resort to mailing hard drives full of data in order to share their information due to network issues and drawbacks with current transfer protocols. Using NCDM’s high-speed disk-to-disk transfer technology, information can be shared more quickly, more efficiently, and more often.

Since September, the NCDM has begun routinely transferring these types of large datasets trans-pacifically to test the robustness of their protocol. iGrid marks the first public showing of such large disk-to-disk transfers of astronomy data at these speeds.

Analyzing Streaming Data at 10 Gb/s
In a related demo, NCDM and its partners set a milestone for performing statistical operations on high volume data flows. Keeping up with the processing of data flowing over optical networks has been a challenge as the bandwidth of these networks has increased. By layering statistical operations over the NCDM-developed network protocol UDT, researchers were able to process the data as fast it arrived, that is at line speed.

The demonstration used four computers connected to iGrid using 10 Gb/s links. Single computers were located in Korea and Japan and two were located in Chicago. In the demonstration, data was pulled from these four disks scattered around the world to create four streams, which were analyzed at iGrid and the results merged, at an average aggregate throughput of over 10 Gb/s, and a peak throughput exceeding 14 Gb/s. The results (a type of statistical summary called a histogram) were displayed and updated in real time.

In the past, statistical operations of the type performed have generally used lower bandwidth networks and different protocols and have not usually exceeded 1 Gbps aggregate throughput.

As these types of networks and protocols become more widely deployed, it will become more common to analyze and monitor different continuous streams of data for a variety of applications, including applications involving weather data, astronomy data, earth science data, and defense data.

UDT
Behind the scenes, the NCDM researchers use a novel data transport protocol called UDT, which they developed and make available as an open source library. It is well known in the high performance computing field that standard TCP significantly under-utilizes the abundant optical bandwidth that is already widely deployed today. The UDT protocol can achieve very high bandwidth utilization while remaining both fair and friendly to co-existing flows. Using UDT, 7 Gb/s pure memory-memory data transfer speed can be achieved between a single pair of machines, which closely approaches the actual hardware limitations.

The NCDM researchers also developed streaming-based data mining algorithms to analyze the data at the end host.

National Center for Data Mining, University of Illinois, at Chicago
The National Center for Data Mining (NCDM) at the University of Illinois at Chicago (UIC) was established in 1998 to serve as a national resource for high performance and distributed data mining. NCDM performs research, sponsors data mining standards, operates an international data mining testbed, and performs outreach. NCDM is coordinating the development of the Predictive Model Markup Language (PMML), the standard for data mining models, and sponsoring the Teraflow Testbed, a worldwide testbed for high performance and distributed data mining. For more information about NCDM, see www.ncdm.uic.edu

SDSS
The Sloan Digital Sky Survey is systematically mapping one-quarter of the entire sky, producing images in five colors and determining the positions and brightnesses of more than 200 million celestial objects. The spectroscopic measurements of distances to a million of the nearest galaxies are giving us a three-dimensional picture of the universe within a much larger volume than that explored to date. The SDSS is also recording the distances to 100,000 quasars, among the most distant objects known, giving us unprecedented hints at the distribution of matter to the edge of the visible universe.

The results of the SDSS are available to the scientific community electronically, both as images and as precise database catalogs of all the objects discovered. The Survey also represents a significant increase in scale. The total quantity of information produced, about 15 terabytes.

The SDSS is a joint project of The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Korean Scientist Group, Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington. Funding for the creation and distribution of the SDSS Archive has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.

StarLight
StarLight, the optical STAR TAP, is an advanced optical infrastructure and proving ground for network services optimized for high-performance applications. StarLight, funded by the National Science Foundation, is being developed by the Electronic Visualization Laboratory (EVL) at the University of Illinois at Chicago, the International Center for Advanced Internet Research (iCAIR) at Northwestern University, and the Mathematics and Computer Science Division at Argonne National Laboratory, in partnership with Canada’s CANARIE and Holland’s SURFnet. www.startap.net/starlight

Kitakyushu JGNII Research Center
Kitakyushu JGNII Research Center was established in April 2004, as a center for research on the JGNII projects governed by NICT. The center promotes R&D for realizing the Next Generation Internet, with a high-quality infrastructure to support the forthcoming ubiquitoussociety in a safe and convenient environment.

JGN II is an open testbed network environment for research and development, which was previously operated by the Japan Gigabit Network, and expanded by the National Institute of Information and Communications Technology (NICT) as a new ultra-high-speed testbed network for R&D collaboration between industry, academia, and government. Its aim is to promote a broad spectrum of research and development projects, ranging from fundamental core research and development to advanced experimental testing, in areas including the advancement of network-related technologies for the next generation, and a diverse range of network application technologies.

KISTI, Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information (KISTI) is a national institute under the supervision of MOST (Ministry Of Science and Technology) and is playing a leading role in building the nationwide infrastructure for knowledge and information by linking supercomputing with the optical research network (KREONet2). KISTI will become the uppermost important institution based on e-Science, Grid and advanced network technologies.

iGRID2005
The International Grid (iGrid) collaborative event showcases ongoing global collaborations in middleware development and applications research that require high-performance multi-gigabit networks. The iGrids are organized every two or three years by institutions, organizations, consortia and National Research & Education Networks who also participate in the Global Lambda Integrated Facility. Overall planning responsibilities for iGrid 2005 are being handled by the Electronic Visualization Laboratory at the University of Illinois at Chicago and Calit2 at the University of California, San Diego, in cooperation with the Mathematics and Computer Science Division of Argonne National Laboratory, SURFnet, University of Amsterdam, and CANARIE.

Contact:
Shirley Connelly
Associate Director, NCDM
312.413.2176
connelly @ uic.edu

Robert Grossman
Director, NCDM
312.413.2176
grossman @ uic.edu