Data for the People: How to Fast-Track Your Network and Share the Data Love

July 25, 2007

So you’ve used a grid to split up your job, process it faster, then return your results. You now have a nice chunky terabyte of data. What do you do with it? Bob Grossman, Director of the National Center for Data Mining at the University of Illinois at Chicago, U.S., says the answer is share, share, share.

“In terms of impact on society, the ability to use transparently other people’s data is going to be transforming,” Grossman says.

“It is about ‘network effects’,” he continues. “In the same way that a network becomes more interesting as more people join it, you can draw more interesting conclusions about your own data if you put it into the context of other people’s data.”

A fine notion in principle

But how can you get these network-busting bundles of new data to the people who need them?

Simple, says Grossman. You just send them, to everyone and anyone who might like to take a look.

“Our motivation for the last ten years has been to create a web for data, so it’s easy to browse, explore and download it. The system we built, called DataSpace, still controls who can write data, but we encourage anyone in the world to read it.”

Driven by this ultimate goal, Grossman turned his eye to the networks: could they distribute large sets of data across thousands of miles, and all without wasting a second? No, not really, not at all.

Grossman describes the old faithful TCP internet protocol?still going strong after nearly 25 years?as “a huge success story,” but, he says, new versions of TCP just weren’t coming out fast enough to solve his problem in good time.

“It was clear the network would change, but we didn’t want to wait ten years for that to happen. So we built our own infrastructure instead.”

Enter the fast lane

UDT, or User Datagram Protocol (UDP)-based Data Transfer, is the result. Able to shoot data around the world at 10 gigabits per second, UDT compares well with the three or four megabits per second that standard TCP?as it was usually deployed?was achieving. “And if you’re impatient like me…” jokes Grossman, “…I know which one I’d prefer.”

UDT has enjoyed much initial success, winning the annual Bandwidth Challenge held at the SC06 super computing conference last November by transporting the 1.3 terabytes of Sloan Digital Sky Survey (SDSS) Data from Chicago to Florida, with a sustained data transfer rate of eight gigabits per second.

For those keen on a more global challenge, UDT was used just last month to move 1.4 terabytes of SDSS data from Chicago all the way to Moscow. The transfer was complete in about 4.5 hours using a one-gigabit per second link.

Even more exciting, UDT is now an option for gridFTP.

This progress points in some interesting directions for Grossman and his team.

“We want to lower the cost of getting hold of other people’s terabytes,” he says. “I want to be able to find out, in just a few minutes, whether someone’s data is going to be useful for my research.”

- Cristy Burne, iSGTW

Source: International Science Grid This Week (iSGTW)