Wednesday, December 14, 2005

I gave a talk at LISA '05

I was in San Diego last week to give a talk at the Hit the Ground Running Session at the LISA '05 conference. USENIX has now posted them, or read the slides here:





As an interesting aside, you'll note the disposable email address in the first slide. This email address shows up in in exactly two places on the web (as least according to Google), and Google only finds the PDF version at usenix.org. And yet this address has yet to be spammed. Interesting. I can believe spammers aren't adept at parsing text in images, but PDF?















Wednesday, September 21, 2005

Changes to comments settings

I've changed the comments settings for this blog so that non-blogger members can post. However, comment posts require the verification thing. We'll see if this promotes spam or content.

opensolaris.org: the Future of Open Source Communities?

Three months ago I blogged about Sun posting its NFS client and server code to opensolaris.org. At the time I was thinking that from the perspective of NFS implementers this was a way to enhance interoperability since source code of one the major and certainly the most mature NFS implementations was online for anyone to browse and used as a debugging aid, if not as a basis for implementing a competing NFS client or server.

A few weeks back I visited opensolaris.org to look for the source code to their oh so cool source code browser. After I didn't find it (yet: the author says it is on the TODO list to post it), I visited some of "communities" (basically interest groups for particular aspect of OpenSolaris) and was astonished by the high degree of activity from non-Sun employees. Curious, I went to the NFS community, and noticed that the blogs of the NFSers were linked from there (very nice). In particular, Eric Kustarz posted a blog article about a change to how the Solaris client dealt with privileged ports.

Reading Eric's article, and the comments, aside from some concerns I had about the change, I was struck by several thoughts. One, like conventional open source, the world outside of Sun now has direct, and early (i.e. before first customer ship) visibility into what is going on. Two, unlike many large open source projects, the information for getting that visibility is well organized. I don't have to subscribe to an alias with thousands of messages per day in order get that visibility into the particular parts of the operating system I'm interested in. In a sense, sourceforge with its numerous project pages, has this already and probably was a source of inspiration. But sourceforge is a collection of projects, whereas opensolaris.org has a common "look and feel" to all the communities that are part of the greater whole: OpenSolaris. Three, since it this easy to track what is going on in Solaris NFS land, maybe I could influence the outcome?

I tested the latter hypothesis by posting a comment to the NFS discussion forum. I suggested a slightly alternative approach after presenting some of the pitfalls of change. Within minutes, Noel Dellofano responded, and agreed to consider my comments.

This is revolutionary: as an employee of another NFS server vendor I could influence the design and implementation of an important NFS client without having to wait for our mutual customers to file a trouble ticket. And as we all know, filing trouble tickets is not always the fastest way to get a resolution, because we are talking about code that has already been released. Vendors understandably have heavy processes for vetting and limiting change to released products. So that what's in it for the customer, Sun, and other vendors: fewer interoperability bugs out of the chute.

But I think the bigger point is that because Sun has made opensolaris.org so easy to navigate, so easy participate in, and so open to "outsiders", (not to mention flame free), those "outsiders" are going to find that they get much more leverage with OpenSolaris than with other open operating systems. By "leverage" I mean:
leverage = productive outcomes / time spent

Here's another example illustrating leverage. Without naming names, or naming operating systems, I once spent several hours of debating a programmer on an issue with an open source operating system's NFS implementation. After noting that the open source operating system's file system design didn't lend itself well to supporting NFS semantics, one of the retorts I got back was: "I think the future of file access for [this operating system] is CIFS". That's low leverage.

High leverage attracts participation, and the path from participant to contributor can be a slippery slope.

Whether this higher leverage translates into increased market share for OpenSolaris versus other open source kernels remains to be seen. But the design and execution of opensolaris.org may represent the future of open source communities.

Wednesday, August 17, 2005

My Presentation at the Recent SNIA Conference

I gave a presentation at the 2005 SNIA Developer Solutions Conference, entitled the Future of NFS. You can read it now.

Tuesday, June 14, 2005

A great day for NFS interoperability and proliferation!

The opensolaris.org folks have released Solaris 10 source code. This is only about 6 months after the Solaris 10 FCS, which considering the legal issues is a tremendous accomplishment.
For NFSers this gives us access to source code of the Solaris 10 NFS client and server, especially the Solaris NFSv4 client and server. Make sure you understand the terms and conditions of the licenses for this code before using it.

It does not appear that most of the GSS-API code made it, including the RPCSEC_GSS source code. You'll have to scour the net for the tirpc-99 wad Sun released in 1999. One of these days,
I'll have to post the source code for that.

Wednesday, June 08, 2005

Using Active Directory as your KDC for NFS

Recently I've been asked how to use Active Directory as the Key Distribution Center (KDC) for NFS, especially when used with NetApp filers and Linux 2.6 clients.

At the theoretical level, I've always know this was possible. I've used Solaris 10 NFSv[234] clients with filers configured to use Active Directory. I've used CITI's early access NFSv3 w/
Kerberos V5 authentication stuff for Linux 2.4
with filers using Active Directory. And of course, back in my Sun days, I led the team that proved NFS clients and servers could authenticate via Active Directory, work which to this day is the best documented example of how to do so.

But now that Linux 2.6 with NFSv4 and NFS/Kerberos V5 authentication is getting more real, does this still work, and if so, with all 3 NFS versions? It is a reasonable question, since Linux 2.6 continues to change.

I'm happy to report that with Windows 2000 (and 2003!) as the KDC, Fedora Core 3 (Linux 2.6.11-1.27_FC3) as the NFS client, and Data ONTAP 7.0.0.1 as the NFSv4 server, the answer is yes, at least as measured by this trivial sanity checking script:
#!/bin/sh
# NFS/Kerberos sanity.sh for Linux 2.6

if [ $# -lt 3 ]
then
echo Usage: $0 server_name server_export mount_point
echo example:
echo " " $0 mre1.sim /vol/vol0/home /mnt
exit 1
fi

size=1m
file=$size.$$.`uname -n`
echo file = $file
serv=$1
fs=$2
mnt=$3

cd /
sudo umount -f $mnt

for proto in tcp udp ;
do
case $proto in
udp )
moreopts=",rsize=4096,wsize=4096"
;;
* )
moreopts=""
;;
esac

for vers in 2 3 4 ;
do
if [ $proto = udp ] && [ $vers = 4 ]
then
echo NFSv4 is not supported over udp
else
for sec in sys krb5 krb5i ; # krb5p ;
do
echo ----------------------------------------
case $vers in
4 )
opts="-t nfs4 -o proto=$proto,sec=${sec}$moreopts"
;;
* )
opts="-o vers=$vers,proto=$proto,sec=${sec}$moreopts"
;;
esac


if sudo mount $opts $serv:$fs $mnt ;
then
cd $mnt
mount | grep -w $mnt
rm -f $file
if time dd if=/dev/zero of=$file bs=1024 count=1024 ;
then
echo $opts PASS
rm -f $file
else
echo $opts FAIL
exit 1
fi
else
echo sudo mount $opts failed. FAIL
exit 1
fi

cd /
sudo umount -f $mnt
done
fi
done
done

But before one runs this script, some configuration on the KDC, the Linux client, and the ONTAP filer are necessary.

Let's look at the KDC.

I am assuming that an Active Directory realm has been created. My example uses ADNFSV4.LAB.NETAPP.COM as the Kerberos realm.

The first thing we need to create are users. Let's walk through an example for creating a user named jsmith. First thing we do is highlight the Users folder in Active Directory:

Highlight Users Folder
Windows Server 2000 Screenshot




Having done that, right click in the folder to pop up the action menu for the folder:


Pop Up Action Menu for Users
Windows Server 2000 Screenshot




Pick the New --> User option. Now we fill in the information. I find that the First name, Full name, and User login name have to agree with each other, but you may have a different experience:


Fill in Information for New User
Windows Server 2000 Screenshot



Now click next to get to the password setting window:


Password for New User
Windows Server 2000 Screenshot




Finally, we get to the confirmation window. Click finish to complete adding the user:



Confirmation Window for New User
Windows Server 2000 Screenshot




Now we see that the user, jsmith, is in the Users folder of the Active Directory realm:


Active Directory listing for ADNFSV4.LAB.NETAPP.COM realm
Windows Server 2000 Screenshot



Now we need to create a "machine" credential for the Linux NFS client. Currently, Linux 2.6 requires a credential of form:
nfs/hostname@REALM-NAME
Our host name will be scully.lab.netapp.com. The realm name is ADNFSV4.LAB.NETAPP.COM.

We start by creating yet another User principal.

You must create this principal as type User. Do NOT create the principal as type Computer. There is some dispute about this. Mario Wurzl says that he has no problem creating machine credential principals as type Computer. However, Microsoft's Kerberos Interoperability document says otherwise:
Use the Active Directory Management tool to create a new user account for the UNIX host:
  • Select the Users folder, right-click and select New, then choose user.
  • Type the name of the UNIX host.
The above passage is taken from a series steps for creating a principal of form host/hostname@REALM. We are ultimately going to create a principal of form nfs/hostname@REALM, so I contend the above excerpt from Microsoft applies. It may be the case that principals of type Computer work fine for machine credentials. I have never tried that, and absent a compelling reason, won't try it.

As we will see, this principal can be any name, but let's use a convention:
servicenameNot-fully-qualified-hostname
E.g. concatenate the service name "nfs" with the capitalized base hostname "Scully". So, our new principal will be:
nfsScully
You might be asking: "Whoa, where did this weird convention come from? Why not just call the principal ``scully''"? The issue is that you may find you need multiple machined credentials for various services. You might need host/hostname@REALM, nfs/hostname@REALM and root/hostname@REALM. You can't call the user principal for all three of these hostname. Credit goes to my old Kerberos project team at Sunfor coming up with this convention.

OK. Repeat the steps used to create principal jsmith in the Users folder for principal nfsScully.

The next step requires opening a Command Prompt window on the Windows 2000 server, and mapping nfsScully to its real machine principal,
nfs/scully.lab.netapp.com@ADNFSV4.LAB.NETAPP.COM
The command to do is ktpass, and it is invoked as:
ktpass -princ nfs/scully.lab.netapp.com@ADNFSV4.LAB.NETAPP.COM -mapuser nfsScully -pass XXXXXXXX -out UNIXscully.keytab
I have deliberately italicized the XXXXXXXX in the above to indicate that a real password needs to be provided (This password does not have to be the same as that used when user principal nfsScully was created in the Active Directory GUI. In fact, I've never used the same password for the GUI and the ktpass command. I cannot claim if this will work if the passwords are the same). You should generate password XXXXXXXX randomly, lest an attacker tries to impersonate Linux client scully. And you should be doing all this on a secure connection to the Windows 2000 server, lest an attacker packet sniff your session and grab the password. Here is a screen shot of the above example:


ktpass example
Windows Server 2000 Screenshot

You would then securely copy UNIXscully.keytab to
scully.lab.netapp.com:/etc/krb5.keytab
using a tool like scp (SSH for file copy). Note that it is possible on the Linux client to kinit to nfsScully via password XXXXXXXX. I think this is unfortunate. Machined credential passwords should use randomly generated keys that even you the system administrator don't know the password for. Randomly generate XXXXXXXX blind if possible, such via a .bat script under the Windows 2000 command shell.

Now it is time to focus attention on the Linux client.

Log onto the Linux client, and create an /etc/krb5.conf file. Here is an example:
[libdefaults]
default_realm = ADNFSV4.LAB.NETAPP.COM
default_tkt_enctypes = des-cbc-md5 ; or des-cbc-crc
default_tgs_enctypes = des-cbc-md5 ; or des-cbc-crc

[realms]
ADNFSV4.LAB.NETAPP.COM = {
kdc=ant-c0.lab.netapp.com:88
default_domain=lab.netapp.com
}

[domain_realm]
.netapp.com = ADNFSV4.LAB.NETAPP.COM
.lab.netapp.com = ADNFSV4.LAB.NETAPP.COM
.sim.netapp.com = ADNFSV4.LAB.NETAPP.COM
.adnfsv4.lab.netapp.com = ADNFSV4.LAB.NETAPP.COM

[logging]
FILE=/var/krb5/kdc.log
It is important to realize that:
  • The encryption type specifiers ( and default_tkt_enctypes = des-cbc-md5 ; or des-cbc-crc and default_tgs_enctypes = des-cbc-md5 ; or des-cbc-crc) cannot be omitted. Microsoft states:
    Only DES-CBC-MD5 and DES-CBC-CRC encryption types are available for MIT interoperability.
  • The [domain_realm] section that maps DNS domain names to the Active Directory realm is critical.
  • Active Directory only supports upper case realms. This is the case even though the screen shots of the Windows 2000 Active Directory tree should a lower case domain name.
You want to make sure gssd is running on the Linux client:
$ ps -eaf | grep gssd
root 2587 1 0 15:37 ? 00:00:00 rpc.gssd -m
If it is not, then you will need to start gssd:
# cd /
# /etc/init.d/rpcgssd stop

# /etc/init.d/rpcgssd start
You may have to set the /etc/sysconfig/nfs file to enable Kerberized NFS. Do:
# echo "SECURE_NFS=yes" > /etc/sysconfig/nfs
That takes care of the KDC and NFS client. What of the filer?

ONTAP supports the capability of the filer to directly join an Active Directory realm without having to use the ktpass command to produce a keytab. Indeed, if you are running CIFS as well as NFS, you have joined the Active Directory realm directly as a consequence of running "cifs setup" at the filer's command line.

Prior to joining the Active Directory realm, we need to set the dns server in the filer's resolv.conf file (in the etc subdirectory of the root volume [often /vol/vol0]) to refer to the IP address of the Active Directory server. If you do not do this, the filer will be unable to resolve the Active Directory realm to the Active Directory server. This does not mean the file has to have its DNS domain name be the same as the Active Directory realm it belongs to. The example we've been working through assumes the DNS domain name and the Active Directory realm are different.

Invoke nfs setup on the filer's command line interface:
mre1> nfs setup
Enable Kerberos for NFS?
y

The filer supports these types of Kerberos Key Distribution Centers (KDCs):

1 - UNIX KDC
2 - Microsoft Active Directory KDC

Enter the type of your KDC (1-2):
2
The default name of this filer will be 'MRE1'.

Do you want to modify this name? [no]:

The filer will use Windows Domain authentication.

Enter the Windows Domain for the filer []:
ADNFSV4.LAB.NETAPP.COM
ADNFSV4.LAB.NETAPP.COM is a Windows 2000(tm) domain.

In order to create this filer's domain account, you must supply the name
and password of an administrator account with sufficient privilege to
add the filer to the ADNFSV4.LAB.NETAPP.COM domain.

Please enter the Windows 2000 user [Administrator@ADNFSV4.LAB.NETAPP.COM]:

Password for Administrator:

CIFS - Logged in as administrator@ADNFSV4.LAB.NETAPP.COM.
CIFS - Updating existing filer account
'cn=mre1,cn=computers,dc=adnfsv4,dc=lab,dc=netapp,dc=com'
CIFS - Connecting to domain controller.

Welcome to the ADNFSV4 (ADNFSV4.LAB.NETAPP.COM) Windows 2000(tm) domain.

Kerberos now enabled for NFS.

NFS setup complete.

If you have previously done a "cifs setup", then you won't be prompted for the realm, host name, and administrator login, because CIFS does that. Both "nfs setup" and "cifs setup" create the "nfs/mre1.sim.netapp.com" principal on the Active Directory KDC. If you go back to the Windows 2000 server, you will see an entry for MRE1 in the Computer folder under the adnfsv4.lab.netapp.com tree.

(Note that if the Active Directory KDC is running Windows 2003, "nfs setup" will ask an additional question:
Active Directory container for filer account? [cn=computers]:
Simply push the enter key).

When using Active Directory as the KDC, no krb5.keytab is created. Instead, when the mahcine account MRE1 is created in the Active Directory database, the password (randomly generated by Data ONTAP) for MRE1 is recorded in stable storage on a file in on the filer. The password for MRE1 is used to obtain service keys for CIFS and NFS, and potentially other Kerberized network services. Even if the password for administrator changes, the filer will be able to obtain service keys for CIFS and NFS.

You also need to export the volumes with the sec=krb5 or sec=krb5i (Linux currently does not support sec=krb5p.). krb5 is plain authentication, krb5i is authentication with integrity protection on the requests and responses, and krb5p is like krb5i but also encrypts the requests and responses. If using NFSv4, it is critical to note if an ancestor and descendent directory are both exported, and the descendent is exported with sec=flavorX then the ancestor must include flavorX in its list of flavors. So for example:
/vol/vol0 -sec=sys
/vol/vol0/home -sec=krb5
will break most NFSv4 clients. You will need to change this to:
/vol/vol0 -sec=sys:krb5
/vol/vol0/home -sec=krb5
At this point you should be ready to try some NFS mounts. I suggest trying the sanity test shell script listed earlier in this article and put the Linux NFS client through its paces. First you want to kinit to a user:
$ kinit jsmith
Password for jsmith@ADNFSV4.LAB.NETAPP.COM:
Then run the shell script:
$ sh sanity.sh mre1.sim /vol/vol0/home /mnt

Tuesday, April 19, 2005

Retries in NFSv4 Considered Harmful

I've been asked lately about this text in RFC3530, which I'm responsible for:
When processing a request received over a reliable transport such as TCP, the NFS version 4 server MUST NOT silently drop the request, except if the transport connection has been broken. Given such a contract between NFS version 4 clients and servers, clients MUST NOT retry a request unless one or both of the following are true:
  • The transport connection has been broken
  • The procedure being retried is the NULL procedure
Rather than resurrect the original discussion from the NFSv4 working alias (I think it was in 2002, since that's the time when I joined Network Appliance, and rejoined the NFSv4 standards effort) and resubmitting it to the NFSv4 working alias, and then being asked about it several years from now, and doing it all over again, etc., let this blog entry serve as my words on the issue.

But is there an exception for clients that retry a request that is already in progress on the NFSv4 server? Apparently some NFSv4 servers implement this exception (which is a holdover from code inherited from NFSv3).

No there is no exception. If there was an exception, then this would contradict, or at least be inconsistent with:
clients MUST NOT retry a request
So really the question is, why shouldn't be clients allowed to retry a request over the same, unbroken connection? The answer is that TCP guarantees delivery of data. Having an application level retry above TCP defeats much of the purpose of TCP. NFSv4 supports unlimited transfer sizes. Typical NFSv4 servers allow 64 Kbyte transfers, but some support much more, including one megabyte or more. Allowing clients to retry a one megabyte, or even a 64KByte transfer over the same connection is an awful waste of resources on the client and server and on the network. And it is unnecessary because the NFSv4 server MUST never drop a request, unless the connection is broken. If you know your NFSv4 server is doing that, then you need never retry a request, unless the connection is broken.

A broken connection is the one out allowed by the NFSv4 specification to handle the case of an NFSv4 server rebooting, or to handle the case of the network partition causing a connection to timeout. The latter is necessary if the client's API to the TCP connection doesn't have any feedback telling the sender that that receiver has acknowledged receipt of data sent.

The other out allowed is retrying a NULL procedure. This is there to handle the case of an NFSv4 server crashing without sending a disconnect (in TCP, this is a FIN message) indication to the client. The client is then unaware that the connection no longer exists on the client. If the client sends a request, the server crashes, then the client will wait forever. Clients should wait a reasonable amount of time (personally, I think 60-180 seconds is reasonable, and I wrote the Solaris 2.6 NFS/TCP client with such timeouts) for a response. Then they can either break the connection on their side, and then retry, or they can send a NULL procedure "ping" to the NFSv4 server. I prefer the latter, because NULL pings don't use much resources, and if an NFSv4 server is live, it saves the cost of unneccesarily re-sending a request that results in a big transfer.

Of course, some NFSv4 client implementors might be worried that there will still be some NFSv4 servers that drop requests without a disconnected indication, or a nice NFSv4 error code like NFS4ERR_DELAY or NFS4ERR_RESOURCE. Given that there are NFSv4 servers out there that don't follow the specification with respection to retries of requests that are in progress (and I thought it was clear that servers MUST NOT do that), I guess I have to accept that some NFSv4 servers might mistakenly drop requests. So clearly, if the NULL ping does not force a disconnect, and some number of seconds later, the response to the original request has not been received, NFSv4 clients have no recourse but to disconnect, and retry.

Note that the reason NFSv4 introduces strict rules about retries over TCP is because there were no rules at all for NFSv3 over TCP. As a result some of the initial clients (and even some modern clients) had timeouts that were too short, servers would (and still do) drop requests, and client implementors really had no clear guidelines for when to retry and when to timeout. With NFSv4, we never retry on the same connection, and for the most part, rely on the timers built in to the connection-oriented transport. This seems like a better way, but that's just my opinion. :-)

Friday, March 11, 2005

retry= option to automount maps revisited

My colleague, Tom Haynes, points out that there can be issues with setting retry= to high values. I've updated the blog entry on automounter tuning appropriately. Bottom line: beware of doing anything higher than retry=2 on Solaris 10 (for now anyway), or force the NFS version to vers=3.

Thursday, March 10, 2005

Automounter Tuning

There are two aspects to the automounter that you should pay attention to as a system administrator:
  1. The retry count on mount attempts.
  2. The duration of a mount.
Let's look at the retry count. When you do a manual mount, (i.e. when you use the mount command), the mount command will make a remote procedure call to the NFS server's mount daemon. If this call times out, it will try again, and again. The number of times it will try varies with each NFS client, but it is a big number. When I last looked at this, in Solaris 8, the number of retries was 10000. But the 10000 is the number of times the mount command calls the API to send a remote procedure call, usually a macro called CLNT_CALL(). Internally, CLNT_CALL() will retransmit multiple times, typically 5 times. So we are talking on the order of 50,000 attempts over the network. What this means is that by default, a mount command can take days to timeout to a dead NFS server. This is why the bg option was added to the mount command, so that the mounting NFS file systems automatically at boot time didn't prevent the system from coming up; bg stands for background and after the first mount attempt, the bg option forces the mount operation to work in the background.

There is an NFS mount option, retry, which is used to change the default number of retries. You can do:

# mount -o retry=0 filer:/vol/vol0 /mnt

In which case a single CLNT_CALL() attempt will be made to access the NFS server, and the mount will fail if the call times out. Why would you want to do such a thing? You probably wouldn't. But if you use the automounter, most likely that's exactly what you are doing. Most automounters will make one or two attempts to mount an NFS file system. That's not a very nice thing if the file system that doesn't get mounted is your home directory as you log in, or your database as your DBMS starts running. The good news is that you can override the automounter's default. Just add the option retry=1000 to your automounter maps, and you'll get much more robust automounting. The simplest is to add retry=1000 to the entries in your master automounter map file or table in NIS or LDAP. Note that the retrans option has nothing to do with mount retries. Page 98 of my book talks about the retry and retrans options.

Beware though of retry= for values higher than 2 on some versions of Solaris. My co-author for Managing NFS and NIS, Second Edition Ricardo Labiaga, who overhauled the automounter in Solaris 2.6, says that before Solaris 2.6 this was a problem, as http://www.sunhelp.org/faq/autofs.html (thanks to my colleague, Tom Haynes for the link) points out:
CAUTION: this can "hold up" other automount requests
for 15 seconds per retry specified, on some versions of
Solaris. Do not make this value much larger than 2!!
I found with Solaris 8, that Ricado is correct; retry=1000 works great. However, I had problems with Solaris 10. I set my master map, /etc/auto_master to:
/net -hosts -nosuid,nobrowse,retry=10000
I then put one of my NFS servers (mre1.sim) into a break point so that it would not respond. Then I did:
% ls /net/mre1.sim &
as expected, the above hanged.

Unfortunately, so did:
% ls /net/server2 &
% ls /net/server3 &
and server2 and server3 are live. Setting retry=5 wasn't very satisfying either; it took about a minute for above to complete. As a workaround, I added "vers=3" to the map options, and things work correctly.

Let's look at the duration. The automounter is also an auto-unmounter. The idea is that when NFS filesystems are no longer used, the automounter should unmount them. This a good thing, because from time to time, automounter maps are changed. If the automounter never unmounted anything, then the map updates would never be seen by the client. Ideally, the automounter would wait to attempt an unmount when it knew the file system hadn't been used for some amount of time. However, automounters don't have an interface to know if there are any processes currently with open files in the NFS file systems. As a result, the automounter has a simple minded approach: it waits some number (N) of seconds, and then attempts to unmount a file system, and does this every N seconds. If the file system is in use (busy), the unmount fails.

It turns out that an unmount attempt of a busy file system can be really bad performance-wise. An unmount attempt will flush all cached data, force all modified but unwritten blocks to be written to the NFS server, and flush all cached metadata (attributes, directories, and name cache). At the end of that, if there are still references to the filesystem, the unmount fails. This means that the processes benefiting from caching will now take latency hits as their working sets of cached data are rebuilt.

Thus, you will want to consider tuning your automount duration higher. For example, the automount command in Solaris has a -t option to set the duration to override the default of 600 seconds. You want to strike a balance between good performance and the benefits of re-synchronization with automounter map updates. If you change the location of an NFS file system no more than once a month, then setting the timeout to 86,400 seconds (24 hours) is reasonable. If you are changing things once every few days, you might find 3600 seconds is short enough; I have many years of experience with -t 3600 and vouch for it. Chapter 9 of my book goes into deep discussion of the automounter, including the -t option.

Wednesday, March 09, 2005

What's the deal on the 16 group id limitation in NFS?

So the executive summary here is:
Now I'll provide the deeper explanation for why.

NFS is built on ONC RPC (Sun RPC). NFS depends on RPC for authentication and identification of users. Most NFS deployments use an RPC authentication flavor called AUTH_SYS (originally called AUTH_UNIX, but renamed to AUTH_SYS).

AUTH_SYS sends 3 important things:
  • A 32 bit numeric user identifier (what you'd see in the UNIX /etc/passwd file)
  • A 32 bit primary numeric group identifier (ditto)
  • A variable length list of up to 16 32-bit numeric supplemental group identifiers (what'd you see in the /etc/group file)
So the 16 group id limit actually refers to the supplemental group identifiers, and it is specific to AUTH_SYS, not NFS. It is just that NFS (i.e. Not For Security :) has historically been deployed with AUTH_SYS. It doesn't help either that most, if not all NFS clients and servers use AUTH_SYS by default, even if they support better forms of authentication like AUTH_DH (AUTH_DES) or RPCSEC_GSS (both AUTH_DH and RPCSEC_GSS rely on cryptography to authenticate users).

It turns out that with 800 (someday I'll talk about why that limit is there) available bytes of authentication stuff in the variable length ONC RPC header for credentials and verifier, we could actually support nearly 200 supplemental group identifiers. So why don't NFS clients and servers do that?
  • The standard (yes, AUTH_SYS is part of an IETF standard) says 16. An NFS client that sends more is breaking the standard, and if it did send more, and the server rejected it (per the standard), what would the client do? It would have to truncate the number of supplemental group identifiers. Which 16 would it pick?
  • An NFS server could be forgiving and accept more than 16 supplement group identifiers, but that then begs the question as to which client is going send more given the first bullet item.
So why does the standard limit us to 16 group identifiers? The value 16 is a reflection of what UNIX operating systems supported at the time (the 1980s). Indeed, when Sun owned and controlled ONC RPC (before graciously giving IETF control), my foggy recollection (and I'm really dating myself here) is that AUTH_SYS started off with 8, then went to 12, and finally settled on 16 supplemental group identifiers. Since then, most AUTH_SYS clients and servers live in operating environments and file systems that support at least 32 supplemental group identifiers. Which is great if you don't have to use NFS to access data. Even an NFS client's operating environment supports more than 16 supplemental groups, in every case I know of, the NFS client will refuse to violate the AUTH_SYS standard and so it will not send more 16 supplemental groups. Some clients will truncate the number of supplemental groups to 16, and others will simply refuse to issue the NFS/AUTH_SYS request. So even if an NFS server wanted to be forgiving and accept AUTH_SYS requests that had more than 16 supplemental groups, this would be in vain.

So how do we get out of this?
  1. One possible answer is to create an RPC authentication flavor like AUTH_SYS but with no limit on the number of group identifiers. The trouble is, AUTH_SYS is really bad. It isn't rocket science to exploit it. The 'Net is a much more dangerous place today than in 1980s, and so it would be unethical if IETF published an AUTH_SYS_PLUS standard. In theory, nothing prevents someone from asking IANA for a new ONC RPC flavor number, and building their own authentication flavor that does just that, and publishing it. But I think it would be unethical for vendors of NFS software to support it. But the free market often trumps ethics so we'll see if any vendor cracks first. And gee, why stop at ~200 group identifiers? Just ignore the 800 byte limitations in the ONC RPC header, and send as many as the client wants. But as we will see later, supporting nearly 200 supplemental group identifiers as other issues beyond ONC RPC and NFS.
  2. Another way is to use a flavor like RPCSEC_GSS which doesn't send group identifiers. Instead, it lets the NFS server decide what groups the user is in (server determining access controls; what a novel concept!) based on the local /etc/group file or group tables in NIS or LDAP. Since there is no group id array in the RPC message, only internal NFS server limitations get in the way. NetApp's ONTAP server for example supports 32 supplemental group identifiers. Last I checked Solaris was either unlimited or up to 64, but it was subject to a tunable parameter. A side benefit of RPCSEC_GSS, if used over something like Kerberos V5 or public key certificates, gives you true authentication.
Does RPCSEC_GSS completely get you out of the 16 group id tangle? Not quite. As my colleague Chuck Lever pointed out to me recently, there is this side band protocol called NLM used for advisory byte range locking. I've seen just one NLM client use RPCSEC_GSS, and it wasn't Linux or Solaris. And not all NLM servers support RPCSEC_GSS. Practically speaking, this means that you have to either not use locking (for example use the llock mount option in Solaris, or use the nolock option in Linux), or you'll have to use NFS version 4.

NFS version 4 combines locking and filing (and mounting) in one single protocol. So use NFS version 4 with RPCSEC_GSS to blast past the 16 group identifier limitation.

Some caveats:
  • Most people use a directory service like NIS or LDAP to store their supplemental group identifier information. If you establish more than 16 supplemental groups in NIS or LDAP for your users, you'll want to make sure that all your other NFS clients support NFSv4 and support RPCSEC_GSS, and of course are configured to use Kerberos V5.
  • For a similar reason, make sure your NFS clients can support more than 16 group identifiers per user. When a user logs into his desktop system, the operating system will establish his credentials. If the user is in more than 16 groups, he may well be denied login access if his home directory is NFS mounted.
So you might ask, this is great but why am I limited to 32 or 64 group identifiers? The reason relates to how operating systems set up their in-kernel credentials. Usually the supplemental group identifiers are a simple array of integers. This means that each access attempt can require searching the entire array of integers. This is one thing if the array is 16-64 group identifiers, but get into 100s to 1000s or more, and the performance impact of that many group identifiers might start to get in the way. An answer might be to organize in-kernels as hash tables or trees, but this has costs too. Not to mention that as each in-kernel credential gets bigger the impact on kernel memory usage, which takes away from applications, becomes important.

Another approach to consider is ACLs. NFSv4 has them. An ACL (Access Control List) is a list of ACEs (Access Control Entries). In NFSv4 an ACE is basically:
  • user name or group name
  • permission bits
  • whether the named user or group is being denied or allowed access
How does this solve the problem that lots of groups solves? For a given file, you can list a bunch of users that are allowed access, and there is no over the network specification that limits how many user ACEs you can have in an ACL. The limits are purely on the server. So for a given set of files, you can let lots of users and lots of different sets of users access each file. Compare that to what lots of supplemental groups do for you. Each file has a single group id assigned to it, and you can then assign a lot of users to the group id in /etc/group or the group table in NIS or LDAP. You can assign a different group id to each file. So for a set of files, you can grant access to lots of users, and lots of different sets of users. Semantically the same.

So what ACLs do for the NFS community is make extended access purely a server problem in terms of flexibility and performance. Of course, there needs to be away to edit the ACLs on a given file, which is what NFSv4 does for you.