Contact: Dave Andersen

Cluster-based storage systems are becoming an increasingly important target for both research and industry. These storage systems consist of a networked set of smaller storage servers, with data spread across these servers to increase performance and reliability. Building these systems using commodity TCP/IP and Ethernet networks is attractive because of their low cost and ease-of-use, and because of the desire to share the bandwidth of a storage cluster over multiple compute clusters, visualization systems, and personal machines. Furthermore, non-IP storage networking lacks some of the mature capabilities and breadth of services available in IP networks. However, building storage systems on TCP/IP and Ethernet poses several challenges. Our work analyzes one important barrier to high-performance storage over TCP/IP: the Incast problem.

TCP Incast is a catastrophic TCP throughput collapse that occurs as the number of storage servers sending data to a client increases past the ability of an Ethernet switch to buffer packets. In a clustered file system, for example, a client application requests a data block striped across several storage servers, issuing the next data block request only when all servers have responded with their portion (Figure 1). This synchronized request workload can result in packets overfilling the buffers on the client's port on the switch, resulting in many losses. Under severe packet loss, TCP can experience a timeout that lasts a minimum of 200ms, determined by the TCP minimum retransmission timeout (RTOmin).

Figure 1: A simple cluster-based storage environment with one client requesting data from multiple servers through synchronized reads.

When a server involved in a synchronized request experiences a timeout, other servers can finish sending their responses, but the client must wait a minimum of 200ms before receiving the remaining parts of the response, during which the client's link may be completely idle. The resulting throughput seen by the application may be as low as 1-10\% of the client's bandwidth capacity, and the per-request latency will be higher than 200ms (Figure 2).

Figure 2: TCP Incast: Throughput collapse for a synchronized reads application performed on a real storage cluster.

The motivation for solving this problem is the increasing interest in using Ethernet and TCP for interprocessor communication and bulk storage transfer applications in the fastest, largest data centers, instead of Fibrechannel or Infiniband. Provided that TCP adequately supports high bandwidth, low latency, synchronized and parallel applications, there is a strong desire to "wire-once" and reuse the mature, well-understood transport protocols that are so familiar in
lower bandwidth networks.


Existing TCP improvements---NewReno, SACK, RED, ECN, Limited Transmit, and modifications to Slow Start---sometimes increase throughput, but do not substantially change the incast-induced throughput collapse (See FAST 2008 paper below). However, we have identified three solutions to the Incast problem.

First, larger switch buffers can delay the onset of Incast (doubling the buffer size doubles the number of servers that can be contacted). But increased switch buffering comes at a substantial dollar cost -- switches with 1MB packet buffering per port may cost as much as $500,000. Second, Ethernet flow control is effective when the machines are on a single switch, but is dangerous across inter-switch trunks because of head-of-line blocking.

Finally, reducing TCP's minimum RTO allows nodes to maintain high throughput with several times as many nodes. We find that providing microsecond resolution TCP retransmissions can achieve full throughput for as many as 47 servers in a real world cluster environment (Figure 3).

Figure 3: Reducing RTO to microsecond granularity alleviates Incast for up to 47 concurrent senders. Measurements taken on 48 Linux 2.6.28 nodes, all attached to a single switch.



David Andersen
Greg Ganger
Garth Gibson
Srini Seshan


Michael Stroucken


Elie Krevat
Amar Phanishayee
Hiral Shah
Wittawat Tantisiriroj
Vijay Vasudevan



Related Items


We thank the members and companies of the PDL Consortium: Broadcom, Ltd., Citadel, Dell EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi Ltd., Intel Corporation, Microsoft Research, MongoDB, NetApp, Inc., Oracle Corporation, Samsung Information Systems America, Seagate Technology, Tintri, Two Sigma, Uber, Veritas and Western Digital for their interest, insights, feedback, and support.




© 2017. Last updated 16 August, 2012