June 7, 2002
Researchers Achieve Production Grid Breakthrough
Physics researchers have carried out the first production-quality simulated
data generation on a data Grid, comprising sites at Caltech, Fermilab, the
University of California-San Diego, the University of Florida, and the
University of Wisconsin-Madison.
"This achievement represents an extremely challenging and important
milestone in the integration of Grid middleware components within the
current 'real world' LHC computing environment," the researchers announced.
Doug Olson of Lawrence Berkeley National Laboratory and the Particle
Physics Data Grid said it has been "decided that a worldwide Grid
environment is required and will be used for the computing work of the
physics experiments at the LHC," the Large Hadron Collider at CERN in
Switzerland. Technical details of the worldwide Grid are still being worked
out, he said.
Globus Project co-leader Ian Foster called the work "a major achievement in
terms of production Grid computing."
The work was done by members of the U.S. Compact Muon Solenoid
Collaboration (CMS) in concert with the Particle Physics Data Grid, the
Grid Physics Network, and the International Virtual Data Grid Laboratory,
and was funded by the U.S. Department of Energy, the National Science
Foundation and the EU-DataGrid project, among others.
The deployed data Grid serves as an integration framework, with Grid
middleware components brought together to form the basis for distributed
CMS Monte Carlo Production (CMS MOP) and used to produce data for the
global CMS physics program, the researchers said. The middleware components
include Condor-G, DAGMAN, GDMP, and the Globus Toolkit packaged together in
the first release of the Virtual Data Toolkit.
The CMS-MOP distributed production system employs a tier-like hierarchy in
which a production manager at a Tier-1 center distributes production jobs
to several remote Tier-2 sites, they said. Once generated at the Tier-2
sites, the simulated data is automatically published back to the Tier-1
center as well as replicated to selected Tier-2 sites.
"This integration exercise showed that the Grid still presents significant
challenges in harnessing distributed resources," the researchers said.
Issues of data and security had to be overcome, such as how to get software
and data to many remote systems and be sure that it's there, and how to get
results back.
Issues of heterogeneity and error recovery also had to be addressed, they
said. "To use other sites' resources, you need to interface with many batch
systems; the Grid means more errors, more crashes, more mysterious
failures," they wrote. Unanticipated errors were handled, such as key
machines crashing in the middle of a run; Grid credentials expiring in the
middle of a run; jobs successfully completing but their results being lost
before they got sent back; various pieces of middleware doing the
unexpected; and the network going down.
"Despite these challenges, over 50,000 proton-proton collision events
inside the CMS detector have been simulated using CMS-MOP and validated for
use by CMS physicists," the researchers said. Production of another 150,000
simulated events is underway.
Copyright 1993-2002 HPCwire. Redistribution of this article is forbidden by law without the expressed written consent of the publisher. For a free trial subscription to HPCwire, send e-mail to: trial@hpcwire.com