Friday, October 27, 2006

Review of "Why NFS Sucks" Paper from the 2006 Linux Symposium

Olaf Kirch of SUSE/Novell, a major Linux distributor gave a talk on NFS July 26, 2006, at the Linux Symposium. There were some press reports of his presentation, which were stunning in their inaccuracies (e.g Sun invented RFS). In fairness, Olaf's paper has fewer errors, and I'll presume, since I wasn't there, that his presentation was no less accurate than his paper. Also according to first hand accounts of engineers I've exchanged email with, his presentation was far less critical of NFS than the paper. One attendee told me:

The parts of his talk that I did hear, though, left me with the impression that NFSv4 is the best thing since sliced bread since it fixes all the nits and problems with NFSv2/v3.

There were a few inaccuracies, but overall it was actually rather positive.


Kirch's paper pokes lots of holes, some accurate, without always explaining why those holes are there, or how hard it would be to fill them. You might get more information understanding NFS warts by reading the original NFSv2 USENIX paper.

The first section on History claims that AT&T's RFS predated NFS, and Sun designed NFS in reaction to weaknesses of RFS. That is reversed. Sun released NFS with SunOS 2.0 in 1985. RFS arrived in System V Release 3, which came in 1987. I was an employee of Lachman Associates, Inc. at the time, when Lachman obtained early access to System V Release 3 source code, and ported NFS from SunOS 2.0 to System V Release 3 during 1985 (Lachman also ported NFS to System V Release 2 in the same time frame). RFS was, if anything, a reaction to NFS, and is a classic example of the problems one will get if 100% adherence to POSIX semantics is the primary goal of a remote file access protocol. Kirch's explanation of the problems with RFS are correct, but later in his paper he criticizes NFS for not going down the same road.

The paper claims that in the NFSv3 specification was written mostly by Rick Macklem, and published in 1995. RFC1813 , published indeed in 1995, documents the specification, but it was made available in a PostScript form by Sun in 1993. The primary contributors to the specification were Brian Pawlowski, Peter Staubach, Brent Callaghan, and Chet Juszczak (Chet being the catalyst for finally getting the NFS industry to sit down at the 1992 Connectathon and get serious about NFSv3). Rick certainly contributed to NFSv3 specification, but so did several others, and they are listed in the acknowledgements of RFC 1813. For what it is worth, Rick's contributions to NFSv3 out weighed mine.

Regarding the claim that WebNFS gained no real following outside of Sun, I know of many NetApp customers that use it from Solaris clients to NetApp filers. Without NFSv4, it is the most practical way to use NFS through a firewall. It is certainly the case that web browsers unfortunately don't support nfs:// URLs, though I noticed Mac OS X uses nfs:// syntax for some applications. In the Linux world there's no WebNFS following, but that is a function of no support for it in Linux.

Kirch states that the NFSv4 WG formed in reaction to Microsoft rebranding SMB as CIFS. Actually, the rebranding took place after Sun announced WebNFS. The Sun-hosted NFSv4 BOF at the 1996 San Jose IETF meeting took place after the Microsoft-hosted SMB BOF at the 1996 Montreal IETF meeting. I was at the SMB BOF in Montreal, and then co-chaired (with Brent Callaghan) the NFSv4 BOF at San Jose. Readers are free to connect the dots.

In the section on NFS file handles, Kirch notes the difficulties the Linux dentry model poses for NFS. NFS was around for years before Linux arrived. I submit that it "sucks" to design a VFS layer that didn't account for most popular remote file access protocol at the time. Few UNIX systems at the time or now with a VFS layer shares the problems the Linux VFS layer has with NFS.

In the section on write performance, Kirch claims "virtually all" NFS server implementations provide an option to turn off stable writes. Actually Solaris never did, and NetApp's ONTAP never has either. Those two are rather significant servers, and so "virtually all" is a stretch. Actually I'm not even sure most servers had such an option.

At any rate, it is hard to understand what Kirch is arguing, when he claims that even the safe unstable writes of NFSv3 are unsatisfactory. He doesn't offer any alternatives for the problem of ensuring data reliability in face of server or client crash. Once the storage is decoupled from the application, and the storage and computer environment can independently fail, one has this problem.

As for his claim that the performance gain of NFSv3 safe unstable writes is a mirage due to internal write buffers in modern disk drives that don't actually flush data, in my experience, NFS vendors are well aware of the issue, and spend a lot of engineering resources to keep those disk buffers stable or force them to disk. The SPEC SFS committee reviews benchmark all the time, and rigorously enforces the requirement that committed NFSv3 writes go to stable storage.

In the section on NFS over UDP, Kirch makes some concise and excellent arguments for why you should use NFS over TCP.

Kirch's criticisms in the Retransmitted Requests section are dead on accurate. This is why NFSv4.1 will support true exactly once semantics (I spent much of the summer getting the NFSv4.1 spec in shape for the exactly once semantics description which is why my blogging output has been pathetic of late).

There are some inaccuracies in the Cache Consistency section. Kirch claims the client at regular intervals revalidates the cache. Actually clients set a time to live on the cache for a certain interval, and the next time the cache is accessed, the cache is revalidated if the time to live has expired. So if a file is cached, but not actively in use (no process is issuing read or write system calls to it), no over the network revalidation requests (GETATTRs) occur.

Kirch also claims "most file systems store time stamps with second granularity". Perhaps in Linux this is the case. Outside the Linux world, file systems have been storing time stamps with microsecond or finer resolution for at least a decade, probably closer to two decades. It is certainly a huge problem if you are using Linux as your NFS server.

Kirch also glosses over the fact that applications that concurrently access the same file need to have a synchronization method, and that this method is usually byte range file locking. He mentions that NFS clients that set a byte range file lock will either bypass the cache for reads or writes, or invalidate the cache before each read, and flush the cache after each write. But he doesn't note that even if an application were doing concurrent I/O to the same file on a local file system, synchronization would be necessary. This is no different than a multi-threaded application accessing a shared data structure. Synchronization primitives like spinlocks are needed, even the the data structure is kept in local memory.

It is hard to tell whether Kirch considers cache consistency a performance problem, or a correctness problem, as he dimisses NFSv4 delegations which are not available when there is contention, and in a later section notes that cluster file systems he touted earlier as possible solutions have their own problems, including scaling "beyond a few hundred nodes". NetApp has many customers with grid computing farms of thousands to tens of thousands of NFSv3 clients accessing as few as one filer. For some, NFSv4 delegations will be very appopriate.

The section on POSIX Conformance is accurate.

The section on Access Control Lists is mostly accurate. Note that when the ACCESS procedure was introduced in NFSv3, ACLs weren't widely used in UNIX at all. ACCESS was needed anyway to deal with the situation where NFS servers mapped superuser (uid 0) to "nobody", but clients would let superuser open the file anway, resulting in user surprises, like being able to read the parts of a file with mode 0000 that were in local cache but not the uncached bits. The deficiencies he states are issues in NFSv4 ACLs are actually problems with the Linux implementation not the protocol itself. Kirch is accurate that mapping NFSv4 ACLs to draft (but never standardized) POSIX ACLs is not always possible. That is by intent; it was never a goal to provide a perfect mapping. NTFS ACLs have won, and it is time to move on from draft POSIX ACLs.

The section on NFS Security is accurate.

In the section on NFS File Locking, Kirch states that no one has explained why NFS originally did not support file locking. The explanation is SunOS 2.0 was based on a 4.2 BSD kernel, and 4.2BSD had very limited support for file locking. Only when SunOS added support for System V APIs, and complied with the System V Interface Definition (SVID) did Sun acknowledge the requirement to support byte range locking on NFS and local file systems. This section is mostly accurate, but skips noting the vast improvements NFSv4 makes over NFSv3 in terms of lock recovery.

Kirch's appraisals of AFS and CIFS are fairly accurate, though I cannot reconcile his accurate statement that "crash recovery" in CIFS is the "job of the application" with his opinion "CIFS could be serious competition to NFS in the Linux world". Without real crash recovery, except perhaps for desktops, CIFS isn't a viable competitor to NFS. For example you don't see Oracle recommending its database be used over CIFS. If CIFS had crash recovery, there might never have been an NFSv4.

In the Future NFS trends section, Kirch doubts whether NFSv4 will meet its goal of interoperability with the Windows world. It already has. Not in the sense that NFSv4 is widely deployed on Windows (even though Hummingbird has an NFSv4 client for Windows), but in the sense that with state, on multiprotocol servers like filers, NFSv4 clients can coordinate much better with CIFS clients, and a CIFS open cannot suddenly stop NFSv4 I/O to previously opened files, unlike NFSv3 I/O.

In the section So How bad is it really, Kirch says NFSv4 ACLs aren't CIFS compatible. News to those of us at NetApp. Our NFSv4 and NTFS ACLs are pretty much the same. As for there being "no mechanism to enforce NFSv4 ACLs locally, or via NFSv3", filers and other NFS servers enforce NFSv4 ACLs just fine, as do local filesystems on conventional systems like ZFS on Solaris. Perhaps he is talking about issues in Linux.

Kirch is correct that the inability to perform callbacks over an established TCP connection is an issue. NFSv4.1 will address it (another area of the NFSv4.1 spec that I've been hammering on). He also suggests NFS should have a better session protocol to enable a more efficient and robust replay detection cache. Again, to be fixed in NFSv4.1.

6 Comments:

Blogger Jeff Garzik said...

A few comments...

1) Linux time has nanosecond granularity these days. Quite true that native Linux filesystems backing a Linux NFS server might only support second granularity, including the most popular options, ext2/3 filesystems.

2) With regards to cluster filesystems, I think you misunderstood the "scaling above hundreds of nodes" statement. He is very likely referring to the size of the -server- grid, not the client count.

3) As the author of an NFSv4-only (TCP-only) server, I am very happy to hear that NFSv4.1 will support callbacks over the same TCP connection. Requiring a separate connection is a huge pain, and not friendly to firewalls.

4) I'm also interested in a robust session protocol. Similar to iSCSI, I would like to see the ability to support multiple connections (i.e. multiple paths) for a single session. And if the server implementation is smart enough, sharing that session across multiple servers.

Saturday, November 04, 2006 3:52:00 AM  
Anonymous Anonymous said...

In regards of cache consistency, Linux client pings NFS server every
30 seconds with RENEW op, not sure why. Related to delegation and making
sure callback path is there maybe.

Thursday, November 09, 2006 5:13:00 PM  
Blogger Mike Eisler said...

> Linux client pings NFS server every
30 seconds with RENEW op, not sure why.

Alex,

If the client has an open, a byte range lock, or a delegation, then it has leased state, and it is obligated to renew that lease in order to keep its state. Otherwise, it can lose its open, lock, delegation.

Thursday, November 09, 2006 7:47:00 PM  
Anonymous Anonymous said...

I'm not exactly sure, but how is an
NFSv4 server supposed to support more
than a handful of clients? Around here
we have 3K+ clients, with 10-12 mounts
each... that's 30K+ tcp connections
last time I looked. Using V3, we have
to make sure to use UDP, or all hell
breaks loose...

Wednesday, November 22, 2006 8:11:00 AM  
Blogger Mike Eisler said...

This comment has been removed by the author.

Wednesday, November 22, 2006 10:52:00 AM  
Blogger Mike Eisler said...

Let's try this again, without the big typo. :-)


I'm not exactly sure, but how is an
NFSv4 server supposed to support more
than a handful of clients? Around here
we have 3K+ clients, with 10-12 mounts
each... that's 30K+ tcp connections
last time I looked. Using V3, we have
to make sure to use UDP, or all hell
breaks loose...


Thanks for your comment AC.

I've always been concerned to point of paranoia about TCP connection scalability having learned many hard lessons over the years. This is why the Solaris NFS/TCP client, which I worked on, uses a single TCP connection between the client and a server IP address, regardless how many NFS mounts there are. I understand that the Linux NFS/TCP client now has a per client/server pair approach to creating connections.

NFS/TCP has gotten a bad rap. I'm happy to report that in the past year, as measured by NFS operations/sec, NetApp's customers collectively now send more NFS traffic over than over UDP.

Several NetApp customers have grids tens of thousands of NFS clients and use TCP to filers. While these grids share a common pool of filers, all clients in the grid can have connections to all filers in the storage pool.

Wednesday, November 22, 2006 11:52:00 AM  

Post a Comment

<< Home