August 24, 2001
Two weeks ago, the NSF announced a plan to link computers in four major
research centers with a comprehensive infrastructure called the TeraGrid.
The project will create the world's first multi-site computer facility, the
Distributed Terascale Facility (DTF). NCSA director Dan Reed agreed to
answer some questions for HPCwire
concerning the purpose and promise of the DTF.
HPCwire: How long has the DTF project been in development? How and by whom
was the plan developed?
REED: The DTF and the TeraGrid build on the information infrastructure PACI
was created to develop and deploy (e.g., Grid software, scalable commodity
clusters, software tools, visualization and data management, and community
application codes). The DTF TeraGrid plan was developed jointly by people
from NCSA, SDSC, Argonne, and Caltech who have been involved in the NSF
PACI program since 1997. As such, it is a natural outgrowth of the Grid
computing vision we have been developing over the past four years.
HPCwire: What specific Grand Challenge questions is the DTF being created
to address?
REED: The DTF does not target a fixed set of applications. Rather, the size
and scope of the DTF will enable scientists and engineers to address a
broad range of compute intensive and data intensive problems based on
national peer review and resource allocation. However, there are many
exemplars of expected use.
For example, the MIMD Lattice Computation (MILC) collaboration is a
multi-institutional group that studies lattice QCD. Worldwide, MILC uses
more than two million processor hours per year. The MILC collaboration both
tests QCD theory and helps interpret experiments on high
energy-accelerators. At present, the MILC code's fastest measured single
processor performance is on NCSA's Itanium Linux cluster. Parallel
molecular dynamics codes like NAMD are designed for high-performance
simulation of large biomolecular systems. Such codes can predict structure
and binding energies, determining optimal transitions paths, and examining
free energies of transitions. Other scientific areas that will benefit from
use of the DTF systems and the TeraGrid include:
* The study of cosmological dark matter using Tree-Particle-Mesh (TPM)
N-body codes
* Higher resolution, more timely weather forecasts. For example, the Weather
Research and Forecast (WRF) model will advance weather prediction, making it
possible to predict weather patterns more accurately on a 1-kilometer scale.
* Biomolecular electrostatics: The DTF will provide the resources for
detailed investigation of the assembly and function of microtubule and
ribosomal
complexes using new "parallel focusing" algorithms for fast elucidation of
biomolecular electrostatics on parallel systems.
Also, the DTF will enable a new class of data intensive applications that
couple data collection through scientific instruments with data analysis to
create new knowledge and digital libraries. Targeted data intensive
applications will include the LIGO gravity wave experiments, the proposed
National Virtual Observatory (NVO), the Atlas and CMS LHC detectors, and
other NSF Major Research Instrumentation (MRE) projects such as NEES.
HPCwire: What projects will constitute NCSA's prime focus? Which industrial
partners will be cooperating? What will be the most concrete long-term
benefits?
REED: NCSA and its Alliance partners have been leaders in Grid software and
cluster computing systems. The Itanium Processor Family Linux clusters at
the core of the DTF are based on ideas and experiences with NCSA's
large-scale IA-32 and Itanium Linux clusters. The DTF's Grid software and
tools build on ideas and infrastructure developed by Argonne and USC-ISI.
Intel and IBM are close collaborators on microprocessors and compilers,
clusters, and Grid software. Qwest is partnering with the DTF consortium on
wide-area networking. NCSA's industrial partners will also continue to work
with NCSA on new technologies and their applications.
HPCwire: Is the DTF itself significantly scalable? To what extent? Are
there currently plans to add centers to the DTF?
REED: We believe the DTF will be the backbone for a national Grid of
interconnected facilities. Just as the early ARPAnet and NSFnet anchored
the Internet, the DTF TeraGrid will anchor creation of a national and
international Grid of shared data archives, computing facilities, and
scientific instruments. Concretely, the DTF will provide a resource that is
scalable from the desktop, all the way to the 13.6 total teraflops that
will constitute the DTF clusters. This means that researchers will be able
to easily port their work from their own PCs or small clusters to our large
systems.
HPCwire: In terms of both the computing systems being integrated and the
optical network itself, how much existing hardware and technology is being
used and how much is being built from the ground up?
REED: The 13.6 TF DTF computing system will include 11.6 teraflops of
computing purchased through the NSF DTF agreement and two 1-teraflop
Linux clusters already on the floor at NCSA. The NSF Cooperative Agreement
with NCSA paid for the latter clusters, and they will be integrated into
the DTF system. We expect to add more cluster capability to the NCSA system
in the coming years. The DTF network will be built by Qwest in cooperation
with the four DTF partners. It will connect to Abilene, to international
networks via STAR TAP, and to the Illinois and California research
communities via CalRen-2 and I-WIRE.
HPCwire: The TeraGrid will use Linux across Abilene, STAR TAP, & CalRen-2.
What principal measures will be implemented to maintain security throughout
such a heterogeneous open-source environment?
REED: We will leverage the Globus public key Grid Security Infrastructure
(GSI) for integrated TeraGrid security. This incorporates PACI-operated
security infrastructure, including Certificate Authorities (CAs),
certificate repositories for portal users, and revocation mechanisms,
GSI-enabled interfaces to DTF resources, client applications, and
libraries. We will also build on the Globus Community Authorization Service
(CAS) for community-based access control to manage access to data, compute,
network, and other resources.
HPCwire: Judging by the news releases, strategic administration of the
TeraGrid is as distributed as its resources. How will critical operational
policy directions be determined, and what is your role in that process?
REED: We will establish a TeraGrid Operations Center (TOC) that will
leverage elements of the operations centers at NCSA, SDSC, Argonne, and
Caltech. The
TOC will establish a set of policies that guide the TeraGrid's operation,
usage, and technology transfer. Operationally, TOC staff will provide 24x7
and online support for the TeraGrid, deploying automated monitoring tools
for verifying TeraGrid performance and coordinating distributed hardware
and software upgrades. All of the principals (Berman, Foster, Messina,
Stevens, and Reed) will work collaboratively as a team to establish
coordinated policies. I will serve as the TeraGrid Chief Architect, charged
with providing advice and guidance on technical directions related to
clusters, networks, and technologies, and on new opportunities for the DTF
and its evolution.
HPCwire: Ruzena Bajcsy, NSF assistant director for Computer and Information
Science and Engineering, has stated that "the DTF can lead the way toward a
ubiquitous 'Cyber-Infrastructure'..." Do you agree that this project is the
first step toward the development of such an infrastructure? What is the
next step? Please describe your vision of a "ubiquitous Cyber-Infrastructure"?
REED: Yes. The DTF TeraGrid is the first step in developing and deploying a
comprehensive computational, data management, and networking infrastructure of
unprecedented scale and capability. This is the idea of the TeraGrid--a
cyberinfrastructure that integrates distributed scientific instruments,
terascale and petascale computing facilities, multiple petabyte data
archives, and gigabit (and soon terabit) networks--all widely accessible by
scientists and engineers. The development of such an infrastructure is
critical to sustain U.S. competitiveness and to enable new advances in
science and engineering. New scientific instruments and high-resolution
mobile sensors are flooding us with new data, ranging from full sky surveys
in astronomy to ecological and environmental data to genetic sequences. The
TeraGrid is the blueprint for the infrastructure that will allow us to
glean insights from this data torrent. Terabytes of data from individual
experiments and petabytes from research
collaborations will soon be the norm. Simply put, breakthrough science and
engineering is critically dependent on a first-class computational and data
management infrastructure.
In the long run, the TeraGrid vision will help to transform how we work and
our notions of "research" and "computing." We will move away from "island
universes" to an ubiquitous fabric where applications execute without
explicit reference to place. As an example, imagine an earthquake
engineering system that integrates "teleobservation" and "teleoperation"
enabling researchers to control experimental tools--seismographs, cameras,
or robots--at remote sites, and provide real-time remote access to data
generated by those tools. Combined with video and audio feeds, large-scale
computing facilities for integrated simulation, data archives,
high-performance networks, and structural models, researchers will be able
to improve the seismic design of buildings, bridges, utilities, and other
infrastructure. Many such examples exist of how our understanding of our
natural world will be enhanced and accelerated through the use of an
integrated infrastructure. Similarly compelling examples exist in fields as
diverse as biology and genomics, neuroscience, aircraft design, high-energy
physics and astrophysics, and intelligent, mobile environments for IT research.
HPCwire: Does the DTF, in fact, constitute a de facto push by the NSF
toward virtual unification of SDSC and NCSA?
REED: NCSA, SDSC, and their two partnerships, the Alliance and NPACI, each
contribute unique and complementary skills and technologies to the
collaborative development of the DTF TeraGrid. Concurrently, each will
continue to separately develop and deploy new computing infrastructure as
part of their ongoing PACI missions.
HPCwire: How would you characterize your leadership of NCSA? How does it
differ from that of your predecessor, Larry Smarr? What are your greatest
challenges at this time, and how are you dealing with them?
REED: NCSA and the Alliance are about enabling breakthrough science and
engineering via advanced computing infrastructure. That is a long and rich
tradition that both Larry and I believe in passionately. NCSA's role is not
only to support today's computational science research but also to "invent
the future" by developing those technologies that will make today's
scientific dreams tomorrow's reality. The TeraGrid is THE NEXT MAJOR STEP
along that path, one that leads NCSA and the Alliance to petaflops, terabit
networks, hundreds of petabytes and ubiquitous mobile sensors. We're
continuing to invent the revolution that will transform science and
engineering research.
Copyright 1993-2001 HPCwire. Redistribution of this article is forbidden by
law without the expressed written consent of the publisher. For HPCwire
subscription information send e-mail to sub@hpcwire.com. Tabor Griffin
Communications' HPCwire is also available at
http://www.tgc.com/hpcwire.html