Friday, October 27, 2006

Review of "Why NFS Sucks" Paper from the 2006 Linux Symposium

Olaf Kirch of SUSE/Novell, a major Linux distributor gave a talk on NFS July 26, 2006, at the Linux Symposium. There were some press reports of his presentation, which were stunning in their inaccuracies (e.g Sun invented RFS). In fairness, Olaf's paper has fewer errors, and I'll presume, since I wasn't there, that his presentation was no less accurate than his paper. Also according to first hand accounts of engineers I've exchanged email with, his presentation was far less critical of NFS than the paper. One attendee told me:

The parts of his talk that I did hear, though, left me with the impression that NFSv4 is the best thing since sliced bread since it fixes all the nits and problems with NFSv2/v3.

There were a few inaccuracies, but overall it was actually rather positive.


Kirch's paper pokes lots of holes, some accurate, without always explaining why those holes are there, or how hard it would be to fill them. You might get more information understanding NFS warts by reading the original NFSv2 USENIX paper.

The first section on History claims that AT&T's RFS predated NFS, and Sun designed NFS in reaction to weaknesses of RFS. That is reversed. Sun released NFS with SunOS 2.0 in 1985. RFS arrived in System V Release 3, which came in 1987. I was an employee of Lachman Associates, Inc. at the time, when Lachman obtained early access to System V Release 3 source code, and ported NFS from SunOS 2.0 to System V Release 3 during 1985 (Lachman also ported NFS to System V Release 2 in the same time frame). RFS was, if anything, a reaction to NFS, and is a classic example of the problems one will get if 100% adherence to POSIX semantics is the primary goal of a remote file access protocol. Kirch's explanation of the problems with RFS are correct, but later in his paper he criticizes NFS for not going down the same road.

The paper claims that in the NFSv3 specification was written mostly by Rick Macklem, and published in 1995. RFC1813 , published indeed in 1995, documents the specification, but it was made available in a PostScript form by Sun in 1993. The primary contributors to the specification were Brian Pawlowski, Peter Staubach, Brent Callaghan, and Chet Juszczak (Chet being the catalyst for finally getting the NFS industry to sit down at the 1992 Connectathon and get serious about NFSv3). Rick certainly contributed to NFSv3 specification, but so did several others, and they are listed in the acknowledgements of RFC 1813. For what it is worth, Rick's contributions to NFSv3 out weighed mine.

Regarding the claim that WebNFS gained no real following outside of Sun, I know of many NetApp customers that use it from Solaris clients to NetApp filers. Without NFSv4, it is the most practical way to use NFS through a firewall. It is certainly the case that web browsers unfortunately don't support nfs:// URLs, though I noticed Mac OS X uses nfs:// syntax for some applications. In the Linux world there's no WebNFS following, but that is a function of no support for it in Linux.

Kirch states that the NFSv4 WG formed in reaction to Microsoft rebranding SMB as CIFS. Actually, the rebranding took place after Sun announced WebNFS. The Sun-hosted NFSv4 BOF at the 1996 San Jose IETF meeting took place after the Microsoft-hosted SMB BOF at the 1996 Montreal IETF meeting. I was at the SMB BOF in Montreal, and then co-chaired (with Brent Callaghan) the NFSv4 BOF at San Jose. Readers are free to connect the dots.

In the section on NFS file handles, Kirch notes the difficulties the Linux dentry model poses for NFS. NFS was around for years before Linux arrived. I submit that it "sucks" to design a VFS layer that didn't account for most popular remote file access protocol at the time. Few UNIX systems at the time or now with a VFS layer shares the problems the Linux VFS layer has with NFS.

In the section on write performance, Kirch claims "virtually all" NFS server implementations provide an option to turn off stable writes. Actually Solaris never did, and NetApp's ONTAP never has either. Those two are rather significant servers, and so "virtually all" is a stretch. Actually I'm not even sure most servers had such an option.

At any rate, it is hard to understand what Kirch is arguing, when he claims that even the safe unstable writes of NFSv3 are unsatisfactory. He doesn't offer any alternatives for the problem of ensuring data reliability in face of server or client crash. Once the storage is decoupled from the application, and the storage and computer environment can independently fail, one has this problem.

As for his claim that the performance gain of NFSv3 safe unstable writes is a mirage due to internal write buffers in modern disk drives that don't actually flush data, in my experience, NFS vendors are well aware of the issue, and spend a lot of engineering resources to keep those disk buffers stable or force them to disk. The SPEC SFS committee reviews benchmark all the time, and rigorously enforces the requirement that committed NFSv3 writes go to stable storage.

In the section on NFS over UDP, Kirch makes some concise and excellent arguments for why you should use NFS over TCP.

Kirch's criticisms in the Retransmitted Requests section are dead on accurate. This is why NFSv4.1 will support true exactly once semantics (I spent much of the summer getting the NFSv4.1 spec in shape for the exactly once semantics description which is why my blogging output has been pathetic of late).

There are some inaccuracies in the Cache Consistency section. Kirch claims the client at regular intervals revalidates the cache. Actually clients set a time to live on the cache for a certain interval, and the next time the cache is accessed, the cache is revalidated if the time to live has expired. So if a file is cached, but not actively in use (no process is issuing read or write system calls to it), no over the network revalidation requests (GETATTRs) occur.

Kirch also claims "most file systems store time stamps with second granularity". Perhaps in Linux this is the case. Outside the Linux world, file systems have been storing time stamps with microsecond or finer resolution for at least a decade, probably closer to two decades. It is certainly a huge problem if you are using Linux as your NFS server.

Kirch also glosses over the fact that applications that concurrently access the same file need to have a synchronization method, and that this method is usually byte range file locking. He mentions that NFS clients that set a byte range file lock will either bypass the cache for reads or writes, or invalidate the cache before each read, and flush the cache after each write. But he doesn't note that even if an application were doing concurrent I/O to the same file on a local file system, synchronization would be necessary. This is no different than a multi-threaded application accessing a shared data structure. Synchronization primitives like spinlocks are needed, even the the data structure is kept in local memory.

It is hard to tell whether Kirch considers cache consistency a performance problem, or a correctness problem, as he dimisses NFSv4 delegations which are not available when there is contention, and in a later section notes that cluster file systems he touted earlier as possible solutions have their own problems, including scaling "beyond a few hundred nodes". NetApp has many customers with grid computing farms of thousands to tens of thousands of NFSv3 clients accessing as few as one filer. For some, NFSv4 delegations will be very appopriate.

The section on POSIX Conformance is accurate.

The section on Access Control Lists is mostly accurate. Note that when the ACCESS procedure was introduced in NFSv3, ACLs weren't widely used in UNIX at all. ACCESS was needed anyway to deal with the situation where NFS servers mapped superuser (uid 0) to "nobody", but clients would let superuser open the file anway, resulting in user surprises, like being able to read the parts of a file with mode 0000 that were in local cache but not the uncached bits. The deficiencies he states are issues in NFSv4 ACLs are actually problems with the Linux implementation not the protocol itself. Kirch is accurate that mapping NFSv4 ACLs to draft (but never standardized) POSIX ACLs is not always possible. That is by intent; it was never a goal to provide a perfect mapping. NTFS ACLs have won, and it is time to move on from draft POSIX ACLs.

The section on NFS Security is accurate.

In the section on NFS File Locking, Kirch states that no one has explained why NFS originally did not support file locking. The explanation is SunOS 2.0 was based on a 4.2 BSD kernel, and 4.2BSD had very limited support for file locking. Only when SunOS added support for System V APIs, and complied with the System V Interface Definition (SVID) did Sun acknowledge the requirement to support byte range locking on NFS and local file systems. This section is mostly accurate, but skips noting the vast improvements NFSv4 makes over NFSv3 in terms of lock recovery.

Kirch's appraisals of AFS and CIFS are fairly accurate, though I cannot reconcile his accurate statement that "crash recovery" in CIFS is the "job of the application" with his opinion "CIFS could be serious competition to NFS in the Linux world". Without real crash recovery, except perhaps for desktops, CIFS isn't a viable competitor to NFS. For example you don't see Oracle recommending its database be used over CIFS. If CIFS had crash recovery, there might never have been an NFSv4.

In the Future NFS trends section, Kirch doubts whether NFSv4 will meet its goal of interoperability with the Windows world. It already has. Not in the sense that NFSv4 is widely deployed on Windows (even though Hummingbird has an NFSv4 client for Windows), but in the sense that with state, on multiprotocol servers like filers, NFSv4 clients can coordinate much better with CIFS clients, and a CIFS open cannot suddenly stop NFSv4 I/O to previously opened files, unlike NFSv3 I/O.

In the section So How bad is it really, Kirch says NFSv4 ACLs aren't CIFS compatible. News to those of us at NetApp. Our NFSv4 and NTFS ACLs are pretty much the same. As for there being "no mechanism to enforce NFSv4 ACLs locally, or via NFSv3", filers and other NFS servers enforce NFSv4 ACLs just fine, as do local filesystems on conventional systems like ZFS on Solaris. Perhaps he is talking about issues in Linux.

Kirch is correct that the inability to perform callbacks over an established TCP connection is an issue. NFSv4.1 will address it (another area of the NFSv4.1 spec that I've been hammering on). He also suggests NFS should have a better session protocol to enable a more efficient and robust replay detection cache. Again, to be fixed in NFSv4.1.

OSDL's NFSv4 Press Release

I got a question about the implications about this excerpt from OSDL's NFSv4 press release:

The Open Source Development Labs (OSDL), the global consortium dedicated to accelerating the adoption of Linux® and open source software, today announced that the Network File System v4 (NFSv4) for Linux is available in Red Hat Enterprise Linux from Red Hat and SUSE Linux Enterprise from Novell. This milestone reflects the maturity of NFSv4 for Linux in the enterprise and coincides with Network Appliance’s latest donation of $100,000 to the NFSv4 testing community.

''NFS testing has been a key priority for OSDL and the Linux development community, and we have passed a significant milestone for it to be ready for enterprise validation,'' said Stuart Cohen, CEO of OSDL.
First, this is all good news, and it is consistent with the claims I've made last year at SNIA and LISA that, unlike the history with NFSv3, Linux is not lagging the industry on NFSv4. There are several commerical NFS vendors that are behind Linux in NFSv4 support.

Second, given the juxtaposition of "test", "significant milestone" , "Enterprise", and "Linux", a reasonable reader might conclude that OSDL is stating that Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise (SLE) have passed all of OSDL's NFSv4 tests, and OSDL is stating NFSv4 on the current releases of those two distributions are enterprise ready.

I asked around and apparently OSDL did its testing in Linux kernel code from kernel.org, and not RHEL or SLE. RHEL and SLE at the time this blog post was written did not have all the necessary NFSv4 updates. I'm told that RHEL and SLE will need several of updates from the mainline (kernel.org) code before both distributions have an NFSv4 implementation that is "ready for enterprise validation."