Boletín de Abril de 2003
 
Boletín Informativo

Strategies & Issues: Measuring End-to-End Internet Performance

by Richard Carlson, T.H. Dunigan, Russ Hobby, Harvey B. Newman, John P, Streck, Mladen A. Vouk

Network Magazine

When is "best-effort" service not good enough? If you've ever run a high-performance application over the Internet, only to achieve less-than-optimal results, your answer to this question would probably be "too often."

The root of the Internet's performance problems lies in our inability to obtain a system-level, end-to-end view of this network. While there are tools that can provide information on performance parameters, resolving the Internet's end-to-end performance problem requires a more sophisticated solution.

Today, the "standard" QoS mechanism on the public network is over-provisioning. Other techniques, such as Differentiated Services, are used on a piecemeal basis. Providers don't have control over the entire network; and where that control ends, Service Level Agreements (SLAs) basically aren't enforceable. Until a viable QoS-sensitive end-to-end business model emerges, consistently obtaining end-to-end QoS over the Internet will likely be impossible.

PERFORMANCE ELEMENTS

A system-level view of the Internet encompasses host platforms, which include their hardware, operating system (OS), and application software.

End-to-end performance involves host system characteristics such as memory, I/O, bandwidth, and CPU speed; the OS; and the application implementation. To maintain throughput levels, the host system must be able to move data from the application buffers, through the kernel, and onto the network interface buffers at a speed faster than that of the network interface.

Faster and wider I/O and memory buses, faster processors, larger amounts of memory, and accelerated disk transfer rates mean that many computers are no longer hardware limited, even at 1Gbit/sec speeds. In addition, OS vendors have reduced the number of memory-to-memory copies being made of data as it moves from an application to the network interface. With this change to "zero copy" stacks, OSs are gaining the capability to drive networks at speeds of up to several gigabits per second.

So if hardware and the OS aren't limiting end-to-end Internet performance, what is? To find the answer to this question, we'll delve into three major problem areas: network configuration, network protocols, and applications.

CONFIGURATION LIMITATIONS

Duplex-mismatch conditions are just one example of a network configuration problem. A recent study at the NASA Jet Propulsion Laboratory in Pasadena, CA showed that more than 50 percent of trouble tickets were generated by Fast Ethernet duplex-mismatch conditions. In this instance, the network and host NIC failed to correctly auto-negotiate the full/half-duplex setting for the link. The duplex-mismatch condition caused a massive loss of packets in one direction. For protocols that must be able to travel in both directions, such as TCP/IP, this can cause major problems.

High-bandwidth tests must be used to detect duplex-mismatch conditions. These fail in one direction because the half-duplex-configured interface expects congestion control by collision detection, but the full-duplex-configured interface doesn't perform this function. This effectively drowns the half-duplex end and prevents any traffic coming from that direction.

While bandwidth tests can detect the problem, they can't localize it unless the right probes are installed in and around each layer-2 and layer-3 device on the network. Some newer diagnostic tools, such as the Argonne National Laboratory's (ANL) Network Configuration Tester (http://miranda.ctd.anl.gov:7123), can find these problems more easily. This system performs a bandwidth test using a Java script on the client computer, and can determine some characteristics of the link by examining the kernel variables in a remote server.

Another configuration problem lies in the different speeds and types of individual links on a network path. For example, an enterprise might have a DS3 (45Mbits/sec) uplink from the campus to the local ISP, while its desktops have Fast Ethernet links. Diagnostic tools-such as pathchar, pchar, and clink, which we'll discuss in more detail later-are being developed to determine a network path's link type, speed, and latency (round-trip time). These should be widely deployed within the next two years.

Maximum Transmission Unit (MTU) size is yet another network configuration issue. The IEEE 802.3 Ethernet protocol specifies an MTU of 1,500 bytes. The maximum throughput that an application can achieve is proportional to its packet size. In theory, larger packets provide higher throughput. Most Gigabit Ethernet switches and routers support Jumbo Frames, which allow the MTU size to be increased up to 9,000 bytes, thereby providing higher throughput with no additional packet loss. However, the lack of a standard for the Gigabit Ethernet Jumbo Frame protocol could lead to interoperability problems.

PROTOCOL PROBLEMS

Network protocol issues can also have a dramatic impact on performance. For example, TCP has a number of built-in control algorithms that regulate how it behaves during the sending and receiving of data. The underlying premise is that multiple TCP streams should each receive a fair share of network resources. If a link becomes congested, all TCP streams sharing that link are expected to reduce their transmission rate to decongest the link. This process is referred to as TCP backoff.

With the current TCP algorithm, a small amount of data loss-even a few percentage points' worth-can have a major impact on network and application performance. With this algorithm, losing a single packet during a round trip reduces throughput to the destination by 50 percent until no loss state is detected within that time frame. Thus, a long, high-throughput network stream can take several minutes to recover from a single packet-loss event.

Some large-volume, very high-speed data transfer applications attempt to circumvent the Internet's TCP backoff problem through a variety of measures. These include staging and caching application data on servers near the end user, simultaneously using multiple traffic streams for a single application, and using UDP-based protocols that obtain reliability by forward error correction or bulk retransmission of missing data blocks.

An alternative is to use Error Correction Code (ECC) to reconstruct some of the lost packets at the destination. Only burst losses are retransmitted, since they may not be recoverable from the coding scheme.

Research is underway to determine how to quickly recover from packet loss without negatively impacting network reliability. One objective is to devise new congestion control algorithms for TCP that won't suffer the effects of a single packet loss, but will still maintain fairness of use. A few new algorithms have shown improved performance, but there's some question as to whether they exceed their fair share of available bandwidth.

Another network protocol issue relates to buffers. End-to-end performance is a function of link speed and the distance between two end systems, which is a determinant of packet delay. As platform and network speeds increase, the time to transmit a single packet is reduced, so more packets can be sent within a particular time frame. The amount of data contained in the network is the product of bandwidth times delay.