Watch out for Python's `fcntl.flock` on Solaris

My employer uses 4Suite, a content management system written in Python, to serve requests for resources to both man and machine. It spawns off multiple processes, each of which can handle requests independently. These requests can certainly affect the same resources, which is why proper care must be taken to ensure that only one process is working on a particular resource at a time. Deep in the bowels of our code, we use Python's fcntl.flock to lock this resource, but it wasn't working. After spending a painful amount of time debugging this, I finally figured it out today, and I wanted to share what I've learned with you.

The problem manifested, of course, as missing data. If you have two different data sources trying to update a resource at the same time, then their changes can conflict, and only the last one to make its changes actually has any effect. Specifically, you can have two processes read the contents of the same file, then they both do some work based on that file's contents, then one of the processes writes the new file, and finally the second process writes its version of the new file. The second process wins, and the data written by the first process is discarded. This is why it's important for the first process to first lock the file, so that the second process (and beyond) will not even start to work with the file until the first one is finished. When I looked in the server logs, though, I could clearly see two processes working at the same time. Bad news.

How should this stuff work? Well, fire up two Python consoles and let's take a look. In both Python consoles, we'll open and lock the same file.

$ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fcntl
>>> f = open('some-filename', 'a')
>>> fcntl.flock(f, fcntl.LOCK_EX)
>>>

Figure 1 — First console acquires a lock

The second one will block (that is, wait) until the other process removes its lock before it continues.

$ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fcntl
>>> f = open('some-filename', 'a')
>>> fcntl.flock(f, fcntl.LOCK_EX)
# This console blocks here, and will not (yet)
# continue executing

Figure 2 — Second console tries to acquire a lock

Then, in the first console, if we remove the lock on the file, we can see that the second console then successfully acquires its lock. If we then try to acquire a new lock, the first process will block as it waits on the second process.

>>> fcntl.flock(f, fcntl.LOCK_UN)
>>> fcntl.flock(f, fcntl.LOCK_EX)
# Here this first process blocks

Figure 3 — First console releases its lock and tries to acquire a new lock

This all works as we expect. What could be causing the problem?

After some searching around the Internet, I discovered this valuable advice from Wikipedia:

All fcntl locks associated with a file for a given process are removed when any file descriptor for that file is closed by that process, even if a lock was never requested for that file descriptor.

You can confirm this by looking at the fcntl (2) man page. Here, Wikipedia is referring to the underlying system calls, not the Python interface. So how is the Python interface actually implemented? Python's documentation tells us to “[s]ee the Unix manual flock (2) for details. (On some systems, this function is emulated using fcntl().)” If this is a system where Python is locking using fcntl, then if we just open and close that file again in the second console (which now holds the lock), that should be enough to remove the lock.

>>> f2 = open('some-filename', 'a')
>>> f2.close()
>>>

Figure 4 — Second console simply opens and closes the file again

What happened when you tried this? When I try this, I get divergent behavior. On Linux, the second console still holds the lock (so the first console still blocks). On Solaris, when I try the same experiment, the second console removes its lock, and so the first console then acquires its lock. Here we see the problem. On Solaris (or another system that uses fcntl-style locking), if the process that currently has the lock wants to simply open and close the locked file again before it explicitly removes its lock, then it will unintentionally and invisibly lose the lock. This makes things very difficult.

To aggravate things further, it's difficult to determine in advance which of those systems Solaris is. The Python source code looks for a C macro called HAVE_FLOCK, but it looks like Solaris does make that function available. On the other hand, the Solaris flock (2) man page has this caveat:

The compatibility version of flock() has been implemented on top of fcntl (2) locking. It does not provide complete binary compatibility.

No kidding. Don't forget to read carefully. So either way, Python on Solaris ends up using this very slippery fcntl-style locking.

So how do you actually fix this problem in practice? Well, as I see it you can either make sure that your processes don't repeatedly open (and close) a locked file (perhaps by always accessing a single file object), or you could lock a special guard file that would not need to be opened again by your application, or you could avoid Solaris. I went with the second option.

I have to tell you, I was, and am, very happy and relieved to have found out why these locks weren't working. It is so frustrating to dig for so long and sit, staring, at code that looks like it should work, only it doesn't. On the other hand, it's always a great feeling when you excavate that one fact that you need to bring the whole picture into focus.

Watch out for Python's fcntl.flock on Solaris

Abstract

Watch out for Python's `fcntl.flock` on Solaris