April 19, 1999
The name of the game is connectivity. And for high-performance networks that
increase available bandwidth without breaking the bank, ATM has long been
recognized as a solid choice. The technology is built on the promise of QoS
(Quality of Service) and support for voice, video and data--and, when used
in the right environment, ATM has delivered. Not only have large corporations
bought into ATM for enterprise solutions, but in 1995, the National Science
Foundation sponsored the very-high-speed Backbone Network Service (vBNS) to
interconnect the U.S. research and education communities. Implemented and
operated by MCI, vBNS has successfully provided IPv4 transport services over
an OC-12 ATM backbone connecting 11 POP (points of presence) and serving 71
site connections.
Recently, the NSF raised the stakes. In 1997, it issued a request for
solutions that could extend the interconnectivity of vBNS over the Internet
to international research and education networks, such as APAN (Asian Pacific
Advanced Network), whose members include Australia, Hong Kong, Indonesia,
Japan, Korea, Malaysia, Singapore and Thailand (www.apan.net). In October
1998, the NSF formally approved a proposal from Indiana University
(see "Network Operations Organizational Structure",
below). Work on that trans-Pacific link began many months in advance of that
approval, however, and is now largely completed.
As members of an Indiana University team that helped design and build that network,
we joined forces with groups from Ameritech Advanced Data Services, Argonne National
Labs, APAN, AT&T, Australian National University, Japan Science and Technology Corp.,
Kokusai Denshin Denwa (KDD), Korea Telecom, Korea Advanced Institute for Science and
Technology, and the National University of Singapore to construct the ATM WAN that
extended from Chicago to Tokyo. And while it might seem that building a global network
should be as simple as connecting switches and routers over the Internet, building the
actual ATM WAN connection was fraught with obstacles, not the least of which was
overcoming differences of time, distance and language.
For example, teams from AT&T and KDD completed the first round of circuit
testing in July and, after proclaiming the link clean, handed the process
over to us. Much to our dismay, we experienced extreme packet loss during
ping tests between the routers at each end of the link. We reasoned that
the problem must lie in our equipment and/or its configuration.
We checked whether the routers were shaping traffic more loosely than the
telco service expected, forcing the telco traffic-policing mechanism to
drop cells. This was not the culprit. And we ruled out the possibility
that IP-over-ATM encapsulation was set differently at the ends of the link
(aal5snap versus vcmux), but we were getting closer. We eventually discovered
that payload scrambling was enabled on one side of the local DS-3
link and disabled on the other side.
However, despite resolving this problem, we still encountered a 1 percent to
2 percent packet loss. Using router-to-router pings, we narrowed the packet
loss to the vBNS router that interconnects the trans-Pacific network
infrastructure and Indiana University. The university was connected via
a VC (Virtual Circuit) on one interface while the trans-Pacific network
was similarly connected but to a different interface. When MCI moved the
connections to the same interface, the problem evaporated, but as of this
writing we were still trying to figure out why the old configuration didn't work.
Since many of the challenges we faced can occur during construction of
any ATM WAN, we outline them here and discuss the solutions we developed
while building the connection, which we call a Trans-Pacific Advanced Connection
(see "Bonus Points: The ABCs of TransPAC",
below.
The Object of the Game
We based our TransPAC network on a 35-Mbps VBR-nrt ATM service provisioned
as a single PVC (Permanent Virtual Circuit), extending from the ATM-based
exchange point in Chicago, called Science Technology and Research Transit
Access Point, STAR TAP (see "Bonus Points: STAR TAP", below)
to the Tokyo APAN XP (exchange point). Additional bandwidth should become
available once carriers upgrade the trans-Pacific infrastructure. Routers
in Chicago and Tokyo provide Layer 3 IP services.
Several factors played into our decision to use ATM as opposed to,
say, DS-3 service. First, we were working with a limited budget and the
ATM VBR (variable bit rate) service was much less expensive than equivalent
dedicated bandwidth. Second, we needed to be able to carve out several
discreet pipes. For example, researchers on both sides of the Pacific
are interested in collaborating on wide-area native IPv6 trials, but
tunneling IPv6 in IPv4--the primary transport service offered over TransPAC
--was not an option, since this would require end systems to run IPv4 and
IPv6 stacks and would result in encapsulation overhead. Using ATM allowed
us to provision a PVC to connect IPv6 routers on either side of the link.
Setting Up the Board
Most applications running over the TransPAC link will use IP as their
transport layer. Although IP provides universal connectivity among
heterogeneous systems and interconnecting networks, using IP over a
long-distance link, such as ours, and ensuring that packet forwarding
conforms to the vBNS acceptable-use policy present a number of interesting challenges.
For example, the speed of light tends to slow IP connections. The
distance between Chicago and Tokyo is about 10,000 kilometers, or 10
million meters. The speed of light is about 300 million meters per second,
so the delay between Chicago and Tokyo (commonly known as propagation delay)
is about 10,000,000/300,000,000, or 33 milliseconds. Other factors compound
the delay: Light travels through fiber at only 80 percent of its velocity
through the air; router queuing delay and various other delays in the
carrier's cloud must be considered. Taken together, delay becomes significant
--in our case, it measures 100 ms one way. If a network node in Chicago starts
sending packets to Tokyo at the rate of 35 Mbps (the maximum rate supported by
TransPAC), the Chicago node would transmit 350,000 bytes before the first
packet arrived in Tokyo. The number of bytes in transit on a link is referred
to as the bandwidth-delay product.
TCP, the mainstay for providing error-free reliable communications over the
Internet, is adversely affected by high bandwidth-delay product links.
Through the efforts of Van Jacobson and others, TCP has developed a sophisticated
pacing mechanism that regulates the rate at which data is transmitted over a
given application connection. TCP attempts to adapt its transmission rate to
available capacity, both as the connection starts its initial transmission and
in the presence of network congestion as detected by packet loss. Communications
links with a high bandwidth-delay product, like TransPAC, hamper TCP's feedback
mechanism and typically impinge on an application's ability to ramp up transmission
rates on startup or recover from congestion (seen as packet loss). Although most
vendors have implemented modern TCP options that improve performance in
high-bandwidth-delay networks, we could not assume that all end systems using
this network had been upgraded, and thus discarded this as a possible workaround.
To combat the side effects of high bandwidth-delay product and minimize
packet loss, TransPAC includes a Layer 3 buffering device (a router)
between the TransPAC network and the STAR TAP ATM switch. TransPAC also
provides support for end users who need to understand these issues and
tune their applications and workstations to perform better over similar
long-distance links.
Rules Are Rules
There are two paths between APAN and the United States--one is through
the commercial Internet, the other is over TransPAC. Some APAN institutions
do not meet TransPAC's AUP (acceptable-use policy), which oversees the
traffic that transits the link, and any U.S.-bound traffic from these
sites should take the commercial Internet path. AUP-compliant APAN sites
should use the TransPAC link. Sounds easy, but there's a rub: While router
vendors seem to have perfected the art of routing packets fast, they still
have work to do in the area of efficiently routing packets in a manner
conforming to desired administrative policies.
Because routers typically forward packets solely according to destination
address, they cannot distinguish AUP-compliant traffic from other traffic.
Under destination-based routing, all sites at one end of the link would
take the shortest path to the other--which, in our case, would inevitably
be over TransPAC, since its metric is shorter than that for the commercial
Internet. When there are multiple paths from point "A" to point "B,"
destination-based IP routing has a hard time doing the "right thing," from
a policy standpoint. This type of routing dilemma is commonly called the
fish-routing problem (see "Bonus Points: A Fish Called WANda", below).
Fortunately, Cisco has a flexible set of knobs (user-controllable parameters),
which it dubs route-maps, that allow the router to be configured to route
packets based on both destination and source address, a process known as
explicit routing. To implement this function, a table is constructed that
identifies all the source-destination pairs that require special handling.
Armed with the table, the router can make the appropriate routing decision.
But though route-maps provide a workable remedy, they're rather slow, and
the longer the list of source-destination pairs, the slower the route-map.
Suppose you have 50 Asian sites connecting with 100 vBNS sites. You'd end
up with a route-map list of 5,000 lines. That's too long.
We pondered several possible solutions, ranging from dynamically updating
the list of route-maps so that only currently in-use applications would be
represented, to monitoring the link to determine the most-used route-map
source-destination pairs and moving them to the top of the list (testing
revealed that the router searched the list sequentially). We finally
settled on a different approach: If we could keep the route-map list short,
the problem would be manageable. Rather than define source-destination pairs,
we used the route-map to identify packets coming from institutions permitted
to use TransPAC, regardless of their destination. This trimmed the list to just
one line per TransPAC authorized institution. Packets from these authorized
institutions were routed to a second router. The second router had a different
forwarding view of the network, one that used TransPAC for all packets destined
for STAR TAP-connected sites in the United States.
Equipment and Game Pieces
The high-performance networks interconnected by an ATM WAN link frequently
operate at bandwidths that exceed the bandwidth of the WAN link itself. Hence,
there is a real risk that the link will become saturated. To avoid this, during
threatening periods, certain applications should receive priority over others.
Given the lack of maturity (and, more important, the lack of user-friendliness)
of client-side RSVP implementations, as well as a general sense that transit
links such as TransPAC are inappropriate places to deploy stateful reservation
mechanisms, we plan to implement a differentiated services model for TransPAC soon.
In accordance with statements of direction from several vendors, TransPAC plans
to implement four service classes. We are testing weighted fair queuing (WFQ) to
implement this priority scheme. WFQ examines the priority bits in the IP header
to route packets into several queues. These queues are then serviced in a
round-robin fashion, with higher-priority queues receiving a longer service
interval. Depending on the implementation, fair-queuing may be in effect within
each of these queues. So, for example, if three 10-Mbps streams traverse the
high-priority queue, they would each receive 33 percent of the bandwidth that's
available to that queue.
All this is well and good, and sounds like it's exactly what the doctor ordered.
But while several vendors offer WFQ on campus routers with 100BASE-X interfaces,
no vendors that we have talked with as of this writing offer it on routers that
support both BGP and ATM interfaces. This is somewhat of a paradox since a
popular model within the IETF uses differentiated service in the core of
wide-area networks (where BGP and ATM are currently mainstays) and RSVP at
the end points. Regardless, all the vendors we have spoken with assured us
that this feature is on their product road maps. Our sense is that this
functionality should be available sometime this quarter.
While not immediately applicable to TransPAC, we tested one 100BASE-X WFQ
implementation (Cisco's Catalyst 8510 Layer 3 switch) to determine if there
are some generalizations that we could make to characterize WFQ's behavior.
Using a Netcom Systems Smartbits SMB-2000 traffic generator/analyzer, we
determined empirically that the worst-case bandwidth available to a given
stream can be determined by the following generalized equality:
BWS=(WS*L)/(WT*MS)--where, BWS equals worst-case stream bandwidth;
WS equals weighted round-robin weight for stream; L equals line bandwidth;
WT equals total weighted round-robin weight; and MS equals stream multiplier.
So, for example, the 8510 uses four queues per interface, which, by default,
receive weights of 1, 2, 4 and 8, respectively. Queue weights are adjustable,
but they must add up to 15. Packets are mapped to queues based on their IP
precedence bits according to the following table:
· PrecedenceQueue Weight0-112-324-546-78
· For two packet streams with an IP precedence of 5:
WS=4, L=107, WT=15, and MS=2. This gives a worst-case bandwidth for each
stream of: (4*107)/(15*2)=~13.3 Mbps.
It is very important that researchers who use the TransPAC link are aware
of the resources available for their applications, so such a generalized
equality will come in handy in the future.
In lieu of per-VC WFQ, we are left with another mechanism that was really
designed as a tool to help avoid congestion. WRED (Weighted Random Early
Detection) provides a means through which traffic can be selectively dropped.
This is helpful if your environment consists of traffic, like TCP, that
responds when packets are dropped.
The "weighted random" part of WRED indicates that packets are discarded
from random flows based on IP precedence bits. "Early detection" means
that this selection is done well in advance of buffer exhaustion. The
idea is that rather than allowing buffers to fill completely and then
be forced to drop all traffic non-selectively, you begin to drop
selectively before buffers are full. The advantage is that many traffic
flows are not simultaneously affected and, in the case of TCP, do not
all enter their slow start phases and begin to ramp up at the same time
--an effect called "global synchronization." In environments where RED
is not implemented, it is common to see waves of congestion as end systems
slow down and speed up their transmissions in unison. WRED is not a
substitute for WFQ, however; indeed, the two should complement each other.
But given our current options, we feel WRED is better than nothing.
Bonus Points: STAR TAP
One piece of the High-Performance International Internet Service
(HPIIS) architecture as envisioned by the National Science Foundation
is a Layer 2 "meet me" point at which research and education (R&E)
connections, such as Indiana University's Trans-Pacific Advanced Connection
(TransPAC), enter the United States. The Science Technology and Research
Transit Access Point (STAR TAP) is an ATM-based exchange point in Chicago
that serves this purpose (www.startap.net).
Although the primary purpose of TransPAC is to interconnect the Asian-Pacific
Advanced Network (APAN) and the very-high-speed Backbone Network Service (BNS),
this connection also gives APAN access to the many other high-performance R&E
networks that peer at the STAR TAP. Some of these networks are CANARIE (Canada),
NREN (NASA) and ESnet (U.S. Department of Energy), as well as other HPIIS networks
soon to arrive, such as MirNET (Russia).
Bonus Points: A Fish Called WANda
The fish-routing problem (named after the diagram typically used to represent it)
results when two or more networks with different routing policies aggregate
at a single router. This typically occurs when networks operating under
contracts with different ISPs meet at a GigaPOP (a high-speed aggregation point).
In the example shown here, Site A has contracted with ISP 1 and Site B
has contracted with ISP 2 for commodity Internet services. Because the
GigaPOP's router will have a single best route for Destination C,
traffic from both Site A and Site B will follow this best route (from ISP 1).
Without the ability to perform explicit routing--in other words, routing based
on source and destination, rather than just destination--the GigaPOP cannot
ensure that traffic intended for Destination C is following the path contracted
by the traffic source.
Bonus Points: The ABCs of TransPAC
More than just network infrastructure, the Trans-Pacific Advanced Connection
(TransPAC) is a collection of network and human resources whose goal is to
facilitate international collaboration in research and education.
Closely coordinated operational and user support activities are necessary
for networks to succeed on a global scale, and they encompass a long chain
of service providers (see diagram below).
The TransPAC NOC and User Services Groups comprise engineers and
technicians from Indiana University and from APAN (Asian Pacific
Advanced Network). Although composed of participants from multiple
organizations across many countries, these support groups can act
in a well-coordinated fashion by using state-of-the-art communications
and collaboration tools, such as videoconferencing and shared
environment conferencing.
The groups rely heavily on common systems and procedures for
scheduling, maintenance, monitoring, problem management, notification,
documentation, reporting and support.
© 2000 CMP Media Inc.