Friday, March 11, 2005

retry= option to automount maps revisited

My colleague, Tom Haynes, points out that there can be issues with setting retry= to high values. I've updated the blog entry on automounter tuning appropriately. Bottom line: beware of doing anything higher than retry=2 on Solaris 10 (for now anyway), or force the NFS version to vers=3.

Thursday, March 10, 2005

Automounter Tuning

There are two aspects to the automounter that you should pay attention to as a system administrator:
  1. The retry count on mount attempts.
  2. The duration of a mount.
Let's look at the retry count. When you do a manual mount, (i.e. when you use the mount command), the mount command will make a remote procedure call to the NFS server's mount daemon. If this call times out, it will try again, and again. The number of times it will try varies with each NFS client, but it is a big number. When I last looked at this, in Solaris 8, the number of retries was 10000. But the 10000 is the number of times the mount command calls the API to send a remote procedure call, usually a macro called CLNT_CALL(). Internally, CLNT_CALL() will retransmit multiple times, typically 5 times. So we are talking on the order of 50,000 attempts over the network. What this means is that by default, a mount command can take days to timeout to a dead NFS server. This is why the bg option was added to the mount command, so that the mounting NFS file systems automatically at boot time didn't prevent the system from coming up; bg stands for background and after the first mount attempt, the bg option forces the mount operation to work in the background.

There is an NFS mount option, retry, which is used to change the default number of retries. You can do:

# mount -o retry=0 filer:/vol/vol0 /mnt

In which case a single CLNT_CALL() attempt will be made to access the NFS server, and the mount will fail if the call times out. Why would you want to do such a thing? You probably wouldn't. But if you use the automounter, most likely that's exactly what you are doing. Most automounters will make one or two attempts to mount an NFS file system. That's not a very nice thing if the file system that doesn't get mounted is your home directory as you log in, or your database as your DBMS starts running. The good news is that you can override the automounter's default. Just add the option retry=1000 to your automounter maps, and you'll get much more robust automounting. The simplest is to add retry=1000 to the entries in your master automounter map file or table in NIS or LDAP. Note that the retrans option has nothing to do with mount retries. Page 98 of my book talks about the retry and retrans options.

Beware though of retry= for values higher than 2 on some versions of Solaris. My co-author for Managing NFS and NIS, Second Edition Ricardo Labiaga, who overhauled the automounter in Solaris 2.6, says that before Solaris 2.6 this was a problem, as http://www.sunhelp.org/faq/autofs.html (thanks to my colleague, Tom Haynes for the link) points out:
CAUTION: this can "hold up" other automount requests
for 15 seconds per retry specified, on some versions of
Solaris. Do not make this value much larger than 2!!
I found with Solaris 8, that Ricado is correct; retry=1000 works great. However, I had problems with Solaris 10. I set my master map, /etc/auto_master to:
/net -hosts -nosuid,nobrowse,retry=10000
I then put one of my NFS servers (mre1.sim) into a break point so that it would not respond. Then I did:
% ls /net/mre1.sim &
as expected, the above hanged.

Unfortunately, so did:
% ls /net/server2 &
% ls /net/server3 &
and server2 and server3 are live. Setting retry=5 wasn't very satisfying either; it took about a minute for above to complete. As a workaround, I added "vers=3" to the map options, and things work correctly.

Let's look at the duration. The automounter is also an auto-unmounter. The idea is that when NFS filesystems are no longer used, the automounter should unmount them. This a good thing, because from time to time, automounter maps are changed. If the automounter never unmounted anything, then the map updates would never be seen by the client. Ideally, the automounter would wait to attempt an unmount when it knew the file system hadn't been used for some amount of time. However, automounters don't have an interface to know if there are any processes currently with open files in the NFS file systems. As a result, the automounter has a simple minded approach: it waits some number (N) of seconds, and then attempts to unmount a file system, and does this every N seconds. If the file system is in use (busy), the unmount fails.

It turns out that an unmount attempt of a busy file system can be really bad performance-wise. An unmount attempt will flush all cached data, force all modified but unwritten blocks to be written to the NFS server, and flush all cached metadata (attributes, directories, and name cache). At the end of that, if there are still references to the filesystem, the unmount fails. This means that the processes benefiting from caching will now take latency hits as their working sets of cached data are rebuilt.

Thus, you will want to consider tuning your automount duration higher. For example, the automount command in Solaris has a -t option to set the duration to override the default of 600 seconds. You want to strike a balance between good performance and the benefits of re-synchronization with automounter map updates. If you change the location of an NFS file system no more than once a month, then setting the timeout to 86,400 seconds (24 hours) is reasonable. If you are changing things once every few days, you might find 3600 seconds is short enough; I have many years of experience with -t 3600 and vouch for it. Chapter 9 of my book goes into deep discussion of the automounter, including the -t option.

Wednesday, March 09, 2005

What's the deal on the 16 group id limitation in NFS?

So the executive summary here is:
Now I'll provide the deeper explanation for why.

NFS is built on ONC RPC (Sun RPC). NFS depends on RPC for authentication and identification of users. Most NFS deployments use an RPC authentication flavor called AUTH_SYS (originally called AUTH_UNIX, but renamed to AUTH_SYS).

AUTH_SYS sends 3 important things:
  • A 32 bit numeric user identifier (what you'd see in the UNIX /etc/passwd file)
  • A 32 bit primary numeric group identifier (ditto)
  • A variable length list of up to 16 32-bit numeric supplemental group identifiers (what'd you see in the /etc/group file)
So the 16 group id limit actually refers to the supplemental group identifiers, and it is specific to AUTH_SYS, not NFS. It is just that NFS (i.e. Not For Security :) has historically been deployed with AUTH_SYS. It doesn't help either that most, if not all NFS clients and servers use AUTH_SYS by default, even if they support better forms of authentication like AUTH_DH (AUTH_DES) or RPCSEC_GSS (both AUTH_DH and RPCSEC_GSS rely on cryptography to authenticate users).

It turns out that with 800 (someday I'll talk about why that limit is there) available bytes of authentication stuff in the variable length ONC RPC header for credentials and verifier, we could actually support nearly 200 supplemental group identifiers. So why don't NFS clients and servers do that?
  • The standard (yes, AUTH_SYS is part of an IETF standard) says 16. An NFS client that sends more is breaking the standard, and if it did send more, and the server rejected it (per the standard), what would the client do? It would have to truncate the number of supplemental group identifiers. Which 16 would it pick?
  • An NFS server could be forgiving and accept more than 16 supplement group identifiers, but that then begs the question as to which client is going send more given the first bullet item.
So why does the standard limit us to 16 group identifiers? The value 16 is a reflection of what UNIX operating systems supported at the time (the 1980s). Indeed, when Sun owned and controlled ONC RPC (before graciously giving IETF control), my foggy recollection (and I'm really dating myself here) is that AUTH_SYS started off with 8, then went to 12, and finally settled on 16 supplemental group identifiers. Since then, most AUTH_SYS clients and servers live in operating environments and file systems that support at least 32 supplemental group identifiers. Which is great if you don't have to use NFS to access data. Even an NFS client's operating environment supports more than 16 supplemental groups, in every case I know of, the NFS client will refuse to violate the AUTH_SYS standard and so it will not send more 16 supplemental groups. Some clients will truncate the number of supplemental groups to 16, and others will simply refuse to issue the NFS/AUTH_SYS request. So even if an NFS server wanted to be forgiving and accept AUTH_SYS requests that had more than 16 supplemental groups, this would be in vain.

So how do we get out of this?
  1. One possible answer is to create an RPC authentication flavor like AUTH_SYS but with no limit on the number of group identifiers. The trouble is, AUTH_SYS is really bad. It isn't rocket science to exploit it. The 'Net is a much more dangerous place today than in 1980s, and so it would be unethical if IETF published an AUTH_SYS_PLUS standard. In theory, nothing prevents someone from asking IANA for a new ONC RPC flavor number, and building their own authentication flavor that does just that, and publishing it. But I think it would be unethical for vendors of NFS software to support it. But the free market often trumps ethics so we'll see if any vendor cracks first. And gee, why stop at ~200 group identifiers? Just ignore the 800 byte limitations in the ONC RPC header, and send as many as the client wants. But as we will see later, supporting nearly 200 supplemental group identifiers as other issues beyond ONC RPC and NFS.
  2. Another way is to use a flavor like RPCSEC_GSS which doesn't send group identifiers. Instead, it lets the NFS server decide what groups the user is in (server determining access controls; what a novel concept!) based on the local /etc/group file or group tables in NIS or LDAP. Since there is no group id array in the RPC message, only internal NFS server limitations get in the way. NetApp's ONTAP server for example supports 32 supplemental group identifiers. Last I checked Solaris was either unlimited or up to 64, but it was subject to a tunable parameter. A side benefit of RPCSEC_GSS, if used over something like Kerberos V5 or public key certificates, gives you true authentication.
Does RPCSEC_GSS completely get you out of the 16 group id tangle? Not quite. As my colleague Chuck Lever pointed out to me recently, there is this side band protocol called NLM used for advisory byte range locking. I've seen just one NLM client use RPCSEC_GSS, and it wasn't Linux or Solaris. And not all NLM servers support RPCSEC_GSS. Practically speaking, this means that you have to either not use locking (for example use the llock mount option in Solaris, or use the nolock option in Linux), or you'll have to use NFS version 4.

NFS version 4 combines locking and filing (and mounting) in one single protocol. So use NFS version 4 with RPCSEC_GSS to blast past the 16 group identifier limitation.

Some caveats:
  • Most people use a directory service like NIS or LDAP to store their supplemental group identifier information. If you establish more than 16 supplemental groups in NIS or LDAP for your users, you'll want to make sure that all your other NFS clients support NFSv4 and support RPCSEC_GSS, and of course are configured to use Kerberos V5.
  • For a similar reason, make sure your NFS clients can support more than 16 group identifiers per user. When a user logs into his desktop system, the operating system will establish his credentials. If the user is in more than 16 groups, he may well be denied login access if his home directory is NFS mounted.
So you might ask, this is great but why am I limited to 32 or 64 group identifiers? The reason relates to how operating systems set up their in-kernel credentials. Usually the supplemental group identifiers are a simple array of integers. This means that each access attempt can require searching the entire array of integers. This is one thing if the array is 16-64 group identifiers, but get into 100s to 1000s or more, and the performance impact of that many group identifiers might start to get in the way. An answer might be to organize in-kernels as hash tables or trees, but this has costs too. Not to mention that as each in-kernel credential gets bigger the impact on kernel memory usage, which takes away from applications, becomes important.

Another approach to consider is ACLs. NFSv4 has them. An ACL (Access Control List) is a list of ACEs (Access Control Entries). In NFSv4 an ACE is basically:
  • user name or group name
  • permission bits
  • whether the named user or group is being denied or allowed access
How does this solve the problem that lots of groups solves? For a given file, you can list a bunch of users that are allowed access, and there is no over the network specification that limits how many user ACEs you can have in an ACL. The limits are purely on the server. So for a given set of files, you can let lots of users and lots of different sets of users access each file. Compare that to what lots of supplemental groups do for you. Each file has a single group id assigned to it, and you can then assign a lot of users to the group id in /etc/group or the group table in NIS or LDAP. You can assign a different group id to each file. So for a set of files, you can grant access to lots of users, and lots of different sets of users. Semantically the same.

So what ACLs do for the NFS community is make extended access purely a server problem in terms of flexibility and performance. Of course, there needs to be away to edit the ACLs on a given file, which is what NFSv4 does for you.