Adding recursion to a read/write lock

Adding recursion to a read/write lock

Disclaimer: Years of working with custom locks, lock free, and memory consistency models has taught me that low level multi-threading is hard. I make no claims that the code below is safe or the right way to solve the problem. Use these techniques at your own risk. So far this code has passed all of my functional tests and is in active use in my codebase dealing with large databases. But… if I do find a flaw in this code you can be sure I’ll write up an interesting blog explaining the problem.

The windows Slim Read/Write lock is great, it allows for shared read locks, exclusive write locks, it is fast and very light weight (just one pointer in size). The only problem is that it doesn’t support recursion, and I really need recursion. My use case is frequent simultaneous reads of a database, and very occasional writes where I need to block all reads. Each read takes a reasonable amount of time, so I’m not too concerned about read lock overhead.

I have a wrapper class around SRWLock called ReadWriteLock, and I’m writing a wrapper around that called, unsurprisingly, RecursiveReadWriteLock.

Adding exclusive recursion turned out to be surprisingly simple. By exclusive recursion I mean that recursion is only allowed if the current thread has an exclusive lock. In many cases this is enough. I’ll call this class ExRecursiveReadWriteLock.

If the current thread has an exclusive lock then we are guaranteed that no other thread has a shared lock. We can also make the assumption that we will release any shared locks we take before we release our exclusive lock. This means that we can simply ignore any shared locks while we have the exclusive lock. Simple! If another threads tries to take a shared lock it will fall through to the ReadWriteLock which will block until the exclusive lock has been released.

void ExRecursiveReadWriteLock::EnterShared() const
{
	if (Threading::GetCurrentThreadId() == m_ExclusiveLockThread)
		return;

	m_ReadWriteLock.EnterShared();
}

void ExRecursiveReadWriteLock::LeaveShared() const
{
	if (Threading::GetCurrentThreadId(); == m_ExclusiveLockThread)
		return;

	m_ReadWriteLock.LeaveShared();
}

To allow for recursion taking exclusive locks we just add a count. If the current thread tries to take an exclusive lock that it already has then we just increment the count. If another thread tries to take an exclusive lock it falls through to the ReadWriteLock and blocks.

void ExRecursiveReadWriteLock::EnterExclusive() const
{
	int thread_id = Threading::GetCurrentThreadId();
	if (thread_id == m_ExclusiveLockThread)
	{
		++m_ExclusiveLockThreadCount;
		return;
	}

	m_ReadWriteLock.EnterExclusive();

	m_ExclusiveLockThread = thread_id;

	m_ExclusiveLockThreadCount = 1;
}

void ExRecursiveReadWriteLock::LeaveExclusive() const
{
	if (--m_ExclusiveLockThreadCount)
		return;

	m_ExclusiveLockThread = 0;

	m_ReadWriteLock.LeaveExclusive();
}

GetCurrentThreadId is just a single instruction on Windows, so this implementation is very fast. If you only need recursion when you have an exclusive lock this is what you should use. However, it’s pretty tricky to verify that this is the case.

So far it’s very simple and not really worthy of a blog post. However, things start to become interesting when considering recursive shared locks when you don’t have an exclusive lock. The important thing to note is that the above implementation doesn’t support the case where no one has an exclusive lock and multiple threads want to take recursive shared locks.

Implementing full recursive read/write locks is a bit more complex and requires more overhead (another lock). In order to cope with multiple threads taking recursive read locks we need to keep a shared lock count for each thread. This could be achieved with thread local storage, but I’ve found that thread local storage on Windows is not as fast as you might expect. It also has some setup and teardown overhead and is very platform specific. Instead, I added a dynamic resizable array protected by a full lock (critical section).

The EnterShared function then becomes:

void RecursiveReadWriteLock::EnterShared() const
{
	int thread_id = Threading::GetCurrentThreadId();

	if (thread_id == m_ExclusiveLockThread)
		return;

	// if we already have a shared lock just increment the count
	{
		CriticalSectionScope lock(m_SharedLocksLock);
		int shared_lock_count = m_SharedLocks.GetCount();
		for(int i = 0; i < shared_lock_count; ++i)
		{
			SharedLock& shared_lock = m_SharedLocks[i];
			if (shared_lock.m_ThreadId == thread_id)
			{
				++shared_lock.m_Count;
				return;
			}
		}

		SharedLock new_shared_lock = { thread_id, 1 };
		m_SharedLocks.Add(new_shared_lock);
	}

	m_ReadWriteLock.EnterShared();
}

It iterates over the array to see if this thread has already taken a lock, if it has it increments the lock count and returns. Otherwise it takes the shared lock. As you can see this is more complex and requires a full lock. However, in my use case there probably won’t be any more than 4 threads locked at a time, usually nore more than 1 or 2. The full lock will only be held for a very short amount of time, and is only held while locking. Once we have acquired a lock the full lock isn’t held anymore, so we still get full multi-read concurrency.

LeaveShared is a similar function, we only release the shared read when the count for the current thread reached zero.

void RecursiveReadWriteLock::LeaveShared() const
{
	int thread_id = Threading::GetCurrentThreadId();

	if (thread_id == m_ExclusiveLockThread)
		return;

	// decrement the shared count count and call m_ReadWriteLock.LeaveShared() if zero
	{
		CriticalSectionScope lock(m_SharedLocksLock);
		int shared_locks_count = m_SharedLocks.GetCount();
		for(int i = 0; i < shared_locks_count; ++i)
		{
			if (m_SharedLocks[i].m_ThreadId == thread_id)
			{
				int count = m_SharedLocks[i].m_Count;
				if (count == 1)
				{
					m_SharedLocks.RemoveSwapLast(i);
					break;
				}
				m_SharedLocks[i].m_Count = count - 1;
				return;
			}
		}
	}

	m_ReadWriteLock.LeaveShared();
}

Usually, I would say that this was far too much overhead for a simple low level locking mechanism. But as I said, in my use case I’m not too worried about lock overhead because my work times are relatively high.

However, there are some things we can do to keep this fast. I implemented a specific array class that has an embedded default array, with default size set to 8. This means that the class won’t allocate any memory unless more than 8 threads take a shared lock at the same time. It also means that the array will be inline with the class and should already be in the cache. Iterating over a few items in an array that is already in the cache will be very fast, which is great, because we don’t want to hold the full lock any longer than we have to.

The EnterExclusive and LeaveExclusive stay the same as the previous implementation.

So far this code has passed all of my tests and seems solid. It also seems fairly fast. The full lock is rarely contended because it is held for such a short amount of time, and critical sections that are not contended are very fast, in my measurements they are actually slightly faster than an uncontended SRWLock.

However, as thread count increases this implementation will not scale well. As long as the array fits within a cache line you shouldn’t notice any real degradation. In my implementation that’s 14 threads. As of 2018 that’s probably fine for most cases. If you have more than 14 threads holding concurrent shared locks at the same time you should probably consider a different solution anyway.

One last piece of advice, when you’re writing your own shared read/write lock, add in debug information and asserts for invalid recursive locking. Debug info can work the same way as described above, keep information on each thread, and also keep track of who has the exclusive lock. Recursively trying to lock an SRWLock is undefined behavior, but usually just deadlocks the system. This is actually pretty useful for tracking down bugs, but it's much nicer to have an assert which fires, and tells you who has the lock.

100% CPU usage with very low kernal times. Multiple shared locks on 12 threads on 350MB database.
100% CPU Utilisation

Comments are closed.