Monday, September 24, 2012

Work stealing and mostly lock-free access

ParaSail uses work stealing to schedule the picothreads that make up a ParaSail program.  With work stealing, there are a relatively small number of heavy-weight worker processes, roughly one per physical core/processor, each serving their own queue of picothreads (in a LIFO manner), and periodically stealing a picothread from some other worker process (using FIFO, so as to pick up a picothread that has been languishing on the other worker's queue).  See the following blog entry for more discussion of work stealing:

What this means is that a worker's queue is referenced mostly by only one process, namely the owner of the queue.

A similar situation arises in the region-based storage management used in ParaSail.  A region is created when a new scope is entered, and most of the allocation and deallocation within a region is done by the worker that created the scope.  But due to work stealing, some other worker might be executing a picothread that is expanding or shrinking an object associated with the region, so some synchronization is necessary in this case.

So what sort of synchronization should be used for these situations where most of the access to a resource arises from an owning worker process, but some of the access arises from other non-owning workers?  We could use a traditional lock-based mutex all of the time, but this slows down the common case where all the access comes from the owner.  We could use a general N-way lock-free synchronization, but this generally requires some kind of atomic compare-and-swap and involves busy waiting.  Atomic compare-and-swap is not always available in a portable fashion at the high-level language level, and busy waiting presumes that there are never more worker processes than there are available physical processors/cores, so the current holder of the lock-free resource is actually making progress while other workers are busy-waiting.

So for ParaSail we are adopting a middle ground between fully lock-based synchronization and N-way lock-free synchronization, which recognizes the asymmetric nature of the problem, namely that one process, the owner, will be performing most of the references.  With the adopted solution, we only need atomic load and store, rather than atomic compare-and-swap, and there is never any busy waiting, so we can run on top of, for example, a time-slicing operating system, where some worker processes might be preempted.

So what is the adopted solution?  For a given resource, we have two flags which are atomic variables, one mutex, and a queue, named as follows:
  • Owner-wants-resource flag
  • Nonowner-wants-resource flag
  • Resource mutex
  • Nonowner-waiting queue
When the owner wants to use the resource:
  • Owner sets the owner-wants-resource flag atomically;
  • It then checks the nonowner-wants-resource flag:
    •  If nonowner-wants-resource flag is set:
      • Owner calls the mutex lock operation;
      • Owner manipulates the resource;
      • Owner clears the owner-wants-resource flag;
      • <<Check_Queue>> Owner then checks the nonowner-waiting queue:
        • If the queue is empty, it clears the nonowner-wants-resource flag;
        • If the queue is not empty, it wakes up one of the waiting nonowners;
      • Owner calls the mutex unlock operation (note that this might be combined with the above waking up of one of the nonowners -- e.g. using a lock handoff).
    •  If nonowner-wants-resource flag is not set:
      • Owner manipulates the resource;
      • Owner clears the owner-wants-resource flag.
      • Owner rechecks the nonowner-wants-resource flag:
        • If nonowner-wants-resource flag is now set:
          • Owner calls the mutex lock operation;
          • Owner does the <<Check_Queue>> operation (see above);
          • Owner calls the mutex unlock operation (note that this might be combined with the waking up of one of the nonowners by Check_Queue -- e.g. using a lock handoff).
When a nonowner wants to use the resource:
  • Nonowner calls the mutex lock operation;
  • Nonowner sets the nonowner-wants-resource flag atomically;
  • Nonowner checks the owner-wants-resource flag;
    • While owner-wants-resource flag is set:
      • Nonowner adds itself to the nonowner-waiting queue;
      • When woken up, reacquire the lock (or get lock automatically via a handoff from the process that woke us up);
  • Nonowner manipulates the resource;
  • Nonowner does the <<Check_Queue>> operation (see above);
  • Nonowner calls the mutex unlock operation (note that this might be combined with the waking up of another nonowner by Check_Queue -- e.g. using a lock handoff).
How do we know this approach is safe?  We need to prove that the resource is never manipulated simultaneously by the owner and a nonowner.  This can only happen if the owner decides to not use the mutex, since otherwise the manipulation happens under protection of the mutex lock.  We know that the owner sets the owner-wants-resource flag before checking the nonowner-wants-resource flag, and similarly a nonowner sets the nonowner-wants-resource flag before checking the owner-wants-resource flag.  Therefore, if the owner decides to bypass the mutex, while a nonowner is going after the resource simultaneously, the nonowner must not yet have checked the owner-wants-resource flag (think about it!). If later the nonowner does reach the check on the owner-wants-resource before the owner is done, it will put itself onto a queue rather than immediately manipulating the resource.

How do we know this approach does not leave a nonowner waiting on the queue forever?  We know the owner rechecks the nonowner-wants-resource flag after clearing the owner-wants-resource flag, so the owner will never miss the possibility that a nonowner queued itself while the owner was manipulating the resource.

So what does this approach accomplish?  We see that the owner only uses a lock-based mutex when it bumps into a nonowner that is simultaneously manipulating the resource.  On the other hand, a nonowner always uses a lock-based mutex, and in addition it uses a queue if it happens to bump into the owner simultaneously manipulating the resource.  As mentioned above, this approach also avoids the need for atomic compare-and-swap, as well as avoiding the need for busy waiting.


  1. I would like to add a couple of observations:

    * you definitely need memory barriers between flag manipulations (atomic operations are not necessarily memory barriers);

    * typical mutex implementation already does something similar to your flag-set-check protocol and it also has a queue of waiters inside of the mutex, so it might be easier to just to mutex_lock();... mutex_unlock(); without much ado.

  2. I believe memory barriers should be provided automatically as needed if you declare the object as atomic/volatile in languages such as C, C++, Java, or Ada.

    Mutex implementations need not use queuing if you use the priority ceiling approach on a monoprocessor, and busy waiting on a multiprocessor. This is the implementation I am presuming. I also should have made it clear that the nonowner-waiting queue is not itself a concurrent data structure -- it relies on the mutex to protect against simultaneous access. (In Ada, it would be an entry queue associated with the protected object used to represent the mutex.)

  3. Declaring an object volatile definitely does not introduce a memory barrier in C (let's say C99) or C++.

    In these languages atomics have to be implemented manually (usually through inlined asm) and "typical" atomic operations implementations, also do not incur barriers: they are expensive and people using atomics are supposed to understand what they are doing! E.g., see Linux kernel Documentation/atomic_ops.txt:

    "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"

    1. Right you are. C1x and C++11 do have the notion of atomic variables which ensure sequential consistency, but alas, these are probably not yet implemented in most C/C++ compilers. Luckily, the ParaSail run-time is currently implemented in Ada 95, where the rules specifically allow synchronization using atomic variables. Again, that doesn't prove that all implementations ensure that, but they have had more time to do so. ;-)

      The other approach is to just rely on a very efficient implementation of a mutex (e.g. a Linux futex), which special cases the no-contention situation. Probably the only way to know which has the best performance is to do experiments, which will be the next job.