Tuesday, April 11, 2006

NFSv3 Exclusive Create and NTFS Qtrees

A customer recently was having trouble using NTFS qtrees in Data ONTAP, when using NFSv3 or NFSv4 to gunzip some files. No such problem with NFSv2. It was narrowed down to the fact gunzip, or at least the gunzip being used by the customer, creates files with the exclusive create flag set.

A file created by the open() system call with the O_EXCL flag present tells the kernel (UNIX or Linux), that if the specified file already exists, return an error, otherwise create the file. This allows applications that want to use lock files to work correctly. However, NFSv2 doesn't do anything special with exclusive create; its CREATE procedure is used for exclusive and non-exclusive create. If the file already exists, then NFSv2 CREATE just returns success from the NFSv2 server to the NFSv2 client. NFSv2 clients simulate the O_EXCL semantic by doing an over the network NFSv2 LOOKUP procedure to see if the file exists, and if it does, return an error to the process attempting the open(), otherwise, it issues the CREATE, and returns the result from the NFSv2 server for the CREATE (which will likely be success, barring permissions issues, out of space issues, or other issues). Clearly this isn't useful for creating lock files from multiple NFS clients because two clients could both find that a file doesn't not exist, and both issue the CREATE operation and both get success.

Enter NFSv3 CREATE. The designers of NFSv3 (BTW, I'm a credited designer, but I can't take any credit for NFSv3 CREATE) produced a very clever yet simple algorithm for implemting exclusive create. Here are the arguments to NFSv3 CREATE:

CREATE3res NFSPROC3_CREATE(CREATE3args) = 8;

enum createmode3 {
UNCHECKED = 0,
GUARDED = 1,
EXCLUSIVE = 2
};

union createhow3 switch (createmode3 mode) {
case UNCHECKED:
case GUARDED:
sattr3 obj_attributes;
case EXCLUSIVE:
createverf3 verf;
};

struct CREATE3args {
diropargs3 where;
createhow3 how;
};
The key thing to understand is that if a non-exclusive create is done, the client provides an initial set of attributes, most likely consisting of the permission bits. However if an exclusive create is done, the client provides not attributes, but does offer a 64 it verifier. What happens in an exclusive CREATE is that the verifier is recorded in one of new file's attributes. If for some reason the client has to retry the request due to a timeout, or server re boot, the retry uses the same verifier. Because the verifier in the request matches what is stored in the file, the server returns success to the client, rather an NFS3ERR_EXIST error. If another client tries to do an exclusive CREATE around the same time, its verifier won't match what the server has recorded in the file, and so the other client gets NFS3ERR_EXIST. So now we have a perfect implementation of POSIX exclusive file create semantics. But we aren't quite done because the recall that the client didn't get set the desired permission bits. The NFSv3 protocol requires the "winner" of the exclusive create to follow up with an NFSv3 SETATTR operating to set all the attributes, including the mode bits.

Here is where we get into trouble with NTFS qtrees in ONTAP. With an NTFS qtree, CIFS and CIFS alone owns the security attributes of a file. So when the NFSv3 client issues the SETATTR to set things like owner, group, and mode bits, ONTAP returns an error. This causes an error to be returned to the process on the NFSv3 client that issued the open() with the O_EXCL|O_CREAT flags.

NFSv4 uses an OPEN operation, but OPEN implements exclusive create the same way.

The vexing thing is that the SETATTR is unnecessary because this is an NTFS qtree; NTFS has already filled in reasonable attributes for the file. But there's nothing in the NFSv[34] protocols to tell the client that.

What do to besides switching to NFSv2 or UNIX qtrees? You can enable the
cifs.ntfs_ignore_unix_security_ops

option on your filer. This option will cause ONTAP to ignore any NFS SETATTR requests, but return success instead of an error.

What I find very interesting is how rare this situation comes up. Very few UNIX utilities apparently attempt exclusive creates. It is curious that gunzip does an exclusive create at all. But if you are depending on gunzip to fail when it attempts to overwrite an existing file, avoid NFSv2.