Saturday, March 11, 2006

Connectathon 2006

I was at Connectathon last week, and gave a presentation, "NFS over TCP, Again". The slides are now posted at the Connectathon web site. The material should hopefully be self-explanatory, but I'll annotate some it here based on the questions and discussions.

Slide 3 asks "Why NFS/TCP?" In addition to the reasons I gave, Max Matveev pointed out that even though both TCP and UDP have the same weak 16 bit checksum algorithm (a topic discussed in more depth by Alok Aggarwal), it turns out NFS over UDP/IP is much more prone to data corruption than over TCP/IP. NFS needs to send requests and responses that exceed the Max Transmission Unit (MTU) of the network media used between the NFS client and server. TCP does this by breaking the NFS message into segments which will fit into the MTU. UDP does this by breaking the NFS message into IP fragments that each fit into the MTU. With TCP, each segment has a unique sequence number. With UDP, each fragment of a datagram shares a per-datagram 16 bit identifier, but has a unique fragment offset to indicate the fragment's place in the datagram. Let's say we are using NFS/UDP, and an NFS WRITE request is sent at time T, with datagram identifier X. The request is broken into N fragments. The first fragment is lost in the transmit somewhere, but the server receives the last N-1 fragments and holds them until it gets the first fragment, or the time to live (TTL) timer on each of the fragments expires.

Meanwhile, the client is busy doing other NFS/UDP things, and the datagram identifier gets re-used. The identifier is just 16 bits; assuming 32 kilo byte writes, giga bit/sec transmission speeds, then 2^16 * 32 * 1024 * 8 / 1000^3 is just 17.2 seconds. If the TTL is greater than 17 seconds, then the re-use of the identifier for another 32 Kbyte NFS WRITE will result in the first fragment of the new NFS request being used as the first fragment of the old NFS request. That first fragment has some interesting stuff in it, such as the file handle, and the offset into the file. If the file handles are different, then we are writing data for one file into another file. That's a security hole and a data corruption. If the file handles are the same, and the offsets are different, then we get data corruption. If the file handles and offsets are the same, we can still get data corruption, because 17 seconds ago, a retry of the first NFS WRITE might have succeeded with no transmission loss, and this new NFS WRITE request is an intentional over write (say a database record update).

I admit to never encountering the above myself, but I'd long since given up on NFS/UDP back when ethernet was a lot slower.

Slide 4 says that in Linux, the NFS/UDP total timeout is about a minute. Someone, who shall go nameless to protect the guilty, challenged that. After the presentation, I did a quick experiment with a Linux client that is running:

2.6.11-1.27_FC3 #1 Tue May 17 20:27:37 EDT 2005 i686 i686 i386 GNU/Linux
16:07:39 I tried to do an ls of an NFSv3/UDP mount point to a dead NFS server, and collected a packet trace. The packet trace showed retransmissions at relative time (in seconds) offsets of 9.9, 19.8, 39.7, 1.1, 2.1, 4.3, 8.8, 16.6, 35.1, 1.1, 2.2 4.4, 8.8, etc. At 16:07:48 , the messages log wrote "server not responding".

The inital timeout appears to be 10 seconds [not 100 milliseconds as I claimed in the slide], the overall "call" timeout is about a minute (9.9+19.8+39.7 = 69.4 secs ~= 16:07:48 - 16:07:39 = 21+48 = 69 secs) and then the algorithm looks extremely Solaris-like, with the exception that instead of 35.1 seconds, Solaris would use 20 seconds on the 5th retransmit.

In slide 9, I advocate using NULL RPC pings to probe whether a TCP connection is alive or not. This is to allow the client to quickly deal with the situation where a server crashed or failed over without sending a TCP disconnect indication. I didn't go into details about an algorithm, but here is what I had in mind:

When a request is sent over the connection, start (or reset if one is already started) a server crash timer that will be less than the timeout specified in the timeo= mount option. Each time a response from the NFS server is received, cancel the timer. Also, cancel the timer if the connection is ever disconnected.

When the server crash timer fires, send a NULL RPC (procedure zero) request. Reset the server crash timer. When a response to the NULL RPC request is received, cancel the server crash timer.
I was asked about a value for the server crash timer. I suggested 10 seconds. Someone objected that over low speed and/or high latency links, 10 seconds might not be enough. Even at 14.4 kbits/sec, 10 seconds is plenty of time to transmit a NULL RPC request (98 bytes over ethernet) and receive its response (86 bytes):
( 98 + 86 ) * 8 / (14.4 * 1024) = 0.0998 seconds = 99.8 milliseconds

As for high latency, Brent Welch pointed out that long distance WAN links aren't going to exceed a few hundred milliseconds.

Another objection was the additional traffic these NULL pings will introduce. This objection misses the context for why I suggested pings. At least one NFS client out there uses 10 second RPC timeouts over TCP with a justification being that the client needs to quickly detected server crash or failover. With 10 second RPC timeouts, the traffic is going to be much higher than with my proposed 10 second server crash timeout. And from the algorithm I've presented here, the NULL pings won't happen any more frequently then once per 10 seconds between any client and server pair.

Slide 10 states,
RFC3530 requires NFSv4 server to disconnect any
time it detects an NFSv4 client sending a retry over the
same connection

Rick Macklem pointed out that the RFC doesn't explicitly say that. He is correct. But it does say the server MUST not drop a request without disconnecting. NFS servers usually have a work avoidance cache whereby when an impatient client re-sends a request for a request that is in progress, the server drops the re-sent request rather than re-process the request. When I wrote that slide, I was not anticipating that a server implementation would not support work avoidance. However, in his own Connectathon presentation, Rick made some pretty interesting arguments for an NFSv4 server not supporting work avoidance.

Slide 16 gives advice for NFS users. Which is to verify your NFS client's default timeout, and if it is under 60 seconds, increase it. A note on verifying the timeout. As an example method to break the network path from client to server, I listed disconnecting the client from the network switch. Depending on the client, that might not work because if the client's network interface adapter does not detect the presence of the local area network, it might indicate that to the IP layer, and the IP layer in turn might report a network path problem immediately. Another way to do this is to disconnect the server from the switch. It may be that your server is in production, and you don't want to do that. So a third way would be to interpose a switch between your client and the main switch, and break the connection between the interposed switch and the main switch. (The way I do this is to put the server in a break point so it stops responding.)

Enough about my presentation, but feel free to post questions in the comments.

Here are some comments on some of the other presentations.

  • Sam Falkner and Lisa Week gave an NFSv4 ACl talk, discussing some work they are doing to integrate the POSIX and NFSv4 ACL models for authorization. Slide 3 mentions that the new ZFS file system implements pure NFSv4 ACLs (just like Data ONTAP's WAFL does; great ideas transcend companies it appears :-). Slide 7 has a interesting idea for integrating ACLs with UNIX mode bits.
  • Alok Aggarwal of Sun presented his ideas for adding checksums to the NFSv4 protocol. While checksums being part of the NFSv4 protocol is a long time from now, I think Alok makes a strong case for investing in good networking and storage hardware that will hopefully be less susceptible to corrupting data.
  • Tom Talpey of NetApp gave an NFS/RDMA update. The news on this effort includes: (1) The LINUX NFS/RDMA server work has moved from CITI (University of Michigan) to Open Grid; (2) Sun and NetApp are funding an OpenSolaris client and server implementation at Ohio State University.
  • Garth Goodson of NetApp gave a useful summary of what Parallel NFS (pNFS) is, and its current status.
  • Tom Talpey gave an overview of the bugs in the Network Status Monitor protocol, which to me makes yet another case for using NFSv4.
  • Bryce Harrington of Open Source Development Labs (OSDL) discussed OSDL's efforts to test the Linux NFSv4 client and server.
  • Lisa Week presented the current state of the NFSv4.1 protocol. The current "what's in list" has: pNFS, directory delegations and notifications, SECINFO changes, exactly once semantics (aka sessions), implementation IDs, and clarifications and corrections motivated by NFSv4.0 implementation experience.
  • Tom Haynes of Sun discussed issues around scaling NFS server exports. The big takeaway is that in order to do client-based access control on each NFS request, servers need to consider vast scales. A grid of say 25,000 NFS clients X say 1300 exports combined with some really horrendous automounters translates into real challenges for NAS vendors playing in the high end. Slide 20 summarizes some excellent advice for server vendors.
  • Rick Macklem of the University of Guelph presented his ideas on a Recent Request Cache for NFSv4. After he took audience abuse for *gasp* using an overhead projector (which I suspect some of the age-20-something attendees had never seen before), he delivered his talk (on handwritten transparencies of course :-). I really did like his idea for computing a checksum of some of the NFS arguments and using that as an additional key. As I mentioned in my talk, RPC transaction identifier (aka XID) re-use, due to bad XID generation algorithms, causes lots of pain for some users of some NFS clients (which will go nameless to protect the guilty). Another very cool idea from Rick was using the TCP-level acknowledgement from the client to the server as an indication to the server that the client received the NFS response. The server can then delete the response from the request cache entry. Or at least, the server can move that response nearer to the front of the might-be-or-to-be-deleted list of responses. Tom Talpey asked Rick about the layering violation this would cause. Rick suggested a socket option be created for allowing the server to receive a callback when the client's TCP receiver acknowledge receipt.
  • Jeremy Allison of the Samba Team gave an update on Samba. It is always stimulating to listen to Jeremy predict the impending death of NFS and its takeover by CIFS. Meanwhile the number NFSv4 implementations grows (more on that in a bit).
  • Andy Adamson of CITI at University of the Michigan discussed the work he is doing on SPKM-3 (Simple Public Key Mechanism). SPKM-3 is a GSS-API security mechanism I wrote an RFC for several years back. SPKM-3 was in turn based on the SPKM-2 specification that Carlise Adams of what was then Bell Northern Research wrote years before. Andy noted that the current SPKM RFCs use outdated crypto algorithms, and old X.509 public key certificate specifications and so the document needs updating. The consumers of SPKM will be people who want to use NFSv4 on transcontinental links, yet are in different organizations making Kerberos V5 not feasible.
  • I didn't get to Sam Falkner's NFSv4/DTRACE talk, nor any of the NDMP talks.
In addition to the talks, there was of course interoperability testing of NFS, CIFS, SSH, and NDMP. NetApp was there testing CIFS, NFS, NDMP, and demonstrated pNFS. There were two new NFSv4 implementations from companies I'm not allowed to mention (due to Connectathon non-disclosure rules). Without naming more companies, I learned that two other companies are planning on releasing NFSv4 features. So these four newcomers, plus BSD, EMC, Hummingbird, IBM (AIX) Linux, NetApp, and Sun will bring us to 11 NFSv4 implementations.