Tuesday, April 19, 2005

Retries in NFSv4 Considered Harmful

I've been asked lately about this text in RFC3530, which I'm responsible for:
When processing a request received over a reliable transport such as TCP, the NFS version 4 server MUST NOT silently drop the request, except if the transport connection has been broken. Given such a contract between NFS version 4 clients and servers, clients MUST NOT retry a request unless one or both of the following are true:
  • The transport connection has been broken
  • The procedure being retried is the NULL procedure
Rather than resurrect the original discussion from the NFSv4 working alias (I think it was in 2002, since that's the time when I joined Network Appliance, and rejoined the NFSv4 standards effort) and resubmitting it to the NFSv4 working alias, and then being asked about it several years from now, and doing it all over again, etc., let this blog entry serve as my words on the issue.

But is there an exception for clients that retry a request that is already in progress on the NFSv4 server? Apparently some NFSv4 servers implement this exception (which is a holdover from code inherited from NFSv3).

No there is no exception. If there was an exception, then this would contradict, or at least be inconsistent with:
clients MUST NOT retry a request
So really the question is, why shouldn't be clients allowed to retry a request over the same, unbroken connection? The answer is that TCP guarantees delivery of data. Having an application level retry above TCP defeats much of the purpose of TCP. NFSv4 supports unlimited transfer sizes. Typical NFSv4 servers allow 64 Kbyte transfers, but some support much more, including one megabyte or more. Allowing clients to retry a one megabyte, or even a 64KByte transfer over the same connection is an awful waste of resources on the client and server and on the network. And it is unnecessary because the NFSv4 server MUST never drop a request, unless the connection is broken. If you know your NFSv4 server is doing that, then you need never retry a request, unless the connection is broken.

A broken connection is the one out allowed by the NFSv4 specification to handle the case of an NFSv4 server rebooting, or to handle the case of the network partition causing a connection to timeout. The latter is necessary if the client's API to the TCP connection doesn't have any feedback telling the sender that that receiver has acknowledged receipt of data sent.

The other out allowed is retrying a NULL procedure. This is there to handle the case of an NFSv4 server crashing without sending a disconnect (in TCP, this is a FIN message) indication to the client. The client is then unaware that the connection no longer exists on the client. If the client sends a request, the server crashes, then the client will wait forever. Clients should wait a reasonable amount of time (personally, I think 60-180 seconds is reasonable, and I wrote the Solaris 2.6 NFS/TCP client with such timeouts) for a response. Then they can either break the connection on their side, and then retry, or they can send a NULL procedure "ping" to the NFSv4 server. I prefer the latter, because NULL pings don't use much resources, and if an NFSv4 server is live, it saves the cost of unneccesarily re-sending a request that results in a big transfer.

Of course, some NFSv4 client implementors might be worried that there will still be some NFSv4 servers that drop requests without a disconnected indication, or a nice NFSv4 error code like NFS4ERR_DELAY or NFS4ERR_RESOURCE. Given that there are NFSv4 servers out there that don't follow the specification with respection to retries of requests that are in progress (and I thought it was clear that servers MUST NOT do that), I guess I have to accept that some NFSv4 servers might mistakenly drop requests. So clearly, if the NULL ping does not force a disconnect, and some number of seconds later, the response to the original request has not been received, NFSv4 clients have no recourse but to disconnect, and retry.

Note that the reason NFSv4 introduces strict rules about retries over TCP is because there were no rules at all for NFSv3 over TCP. As a result some of the initial clients (and even some modern clients) had timeouts that were too short, servers would (and still do) drop requests, and client implementors really had no clear guidelines for when to retry and when to timeout. With NFSv4, we never retry on the same connection, and for the most part, rely on the timers built in to the connection-oriented transport. This seems like a better way, but that's just my opinion. :-)