Cluster-based storage systems are becoming an increasingly important target for both research and industry. These storage systems consist of a networked set of smaller storage servers, with data spread across these servers to increase performance and reliability. Building these systems using commodity TCP/IP and Ethernet networks is attractive because of their low cost and ease-of-use, and because of the desire to share the bandwidth of a storage cluster over multiple compute clusters, visualization systems, and personal machines. Furthermore, non-IP storage networking lacks some of the mature capabilities and breadth of services available in IP networks. However, building storage systems on TCP/IP and Ethernet poses several challenges. Our work analyzes one important barrier to high-performance storage over TCP/IP: the Incast problem.
TCP Incast is a catastrophic TCP throughput collapse that occurs as
the number of storage servers sending data to a client increases past
the ability of an Ethernet switch to buffer packets.
In a clustered file system, for example, a client application requests
a data block striped across several storage servers, issuing the next
data block request only when all servers have responded with their
portion (Figure 1). This synchronized request workload
can result in packets overfilling the buffers on the client's port on
the switch, resulting in many losses. Under severe
packet loss, TCP can experience a timeout that lasts a minimum
of 200ms, determined by the TCP minimum retransmission timeout
When a server involved in a synchronized request experiences a timeout, other servers can finish sending their responses, but the client must wait a minimum of 200ms before receiving the remaining parts of the response, during which the client's link may be completely idle. The resulting throughput seen by the application may be as low as 1-10\% of the client's bandwidth capacity, and the per-request latency will be higher than 200ms (Figure 2).
The motivation for solving this problem is the increasing interest in
using Ethernet and TCP for interprocessor communication and bulk
storage transfer applications in the fastest, largest data centers,
instead of Fibrechannel or Infiniband. Provided that TCP adequately
supports high bandwidth, low latency, synchronized and parallel
applications, there is a strong desire to "wire-once" and reuse the
mature, well-understood transport protocols that are so familiar in
lower bandwidth networks.
Existing TCP improvements---NewReno, SACK, RED, ECN, Limited Transmit, and modifications to Slow Start---sometimes increase throughput, but do not substantially change the incast-induced throughput collapse (See FAST 2008 paper below). However, we have identified three solutions to the Incast problem.
First, larger switch buffers can delay the onset of Incast (doubling the buffer size doubles the number of servers that can be contacted). But increased switch buffering comes at a substantial dollar cost -- switches with 1MB packet buffering per port may cost as much as $500,000. Second, Ethernet flow control is effective when the machines are on a single switch, but is dangerous across inter-switch trunks because of head-of-line blocking.
Finally, reducing TCP's minimum RTO allows nodes
to maintain high throughput with several times as many nodes. We find
that providing microsecond resolution TCP retransmissions can
achieve full throughput for as many as 47 servers in a real world cluster environment (Figure 3).
We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.