Monday, 28 July 2014

Poll me, maybe?

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
The java.util.Queue interface has the ever important poll/offer methods, documented thus:
This allows the caller to assume that if the poll method returns null the queue is empty, and in a single threaded world this is quite straight forward. A problem for multi producer lock free concurrent queues however is that while an element may not be visible the queue may not be empty. "WOT?" I hear you cry, lets look at an example:
The above snippet shows the offer/poll methods of an MPSC linked queue as originated by Mr. Vyukov(MPSC - Multi-Prodcuer, Single-Consumer) and ported into Java by your humble servant (though others have ported this algo before me: Akka/RxJava/Netty, and others...).
Where be dragons?

Multiple Producers Break The Chain

How the above algorithm works for multiple producers:

  • On line 17 we use the same mechanism offered by AtomicReference.getAndSet to atomically swap the producerNode with the new node.
  • This means no producerNode is ever returned more than once, preventing producer threads from overwriting the producerNode reference field and that node's reference to the next node.
  • We use the same mechanism offered by AtomicReference.lazySet to set this new node as the next node from the previous.
  • On the consumer thread side we process nodes by grabbing the next node and replacing the consumerNode with it after we pluck out it's value.

The problem is in the offer method lines 17-19 where we first xchgProducerNode and then set the new node (now the producerNode) to be the next node after the previous producerNode. For a short moment the old producer node has null as the next element, the chain is broken. This short moment is of undefined length, depending on scheduler pressure and the whimsical sense of humour of the gods the producer thread which got interrupted at line 18  (puff) may be held up for a while.
And yet, while this producer thread is sleeping (what dreams may come?), other producers can make progress. They may add any number of nodes to the queue, each linking the producerNode to the next. The producerNode can be at any 'distance' from that suspended node on that suspended thread waiting for it's next field to be patched through and may have any number of further broken links in the way.
Looking at the poll method in this light the problem becomes more obvious. If a node may have it's next set to null due to the timing described above, then poll may return null when the queue is in fact not empty.

Unbreaking The Chain

To fix this issue we could do the following:

And indeed this is how other implementors have chosen to tackle the issue (in spirit, if not in code).
In doing so we have given up quite allot however:
  1. poll() is no longer wait free. Which is a shame, I quite like wait freedom.
  2. The consumer thread is volunteered into spin waiting on the next field to become visible.
  3. In the event of hitting null we now read the producerNode. This introduces the potential for a cache miss. This is a big problem to my mind as this cache miss has an unknown and potentially very large cost.
  4. The producerNode read from the consumer thread will have a negative impact on the producer threads contending to write to it. This has been previously explored here. This will be particularly bad when the consumer is spinning on the poll() method while waiting for the next value.
Is it worth it? I'm not convinced... Given that a Queue already mandates an isEmpty method could we not settle for a relaxed definition of poll? given the above observation of the queue emptiness is also (by it's lock free and concurrent nature) imprecise should we really sacrifice performance/scalability for it?
At the moment I am leaning towards offering weaker guarantees for the lock-free queues offered as part of JCTools, but I'm hoping to get some feedback from prospective users on how important that part of the queue interface is to their purposes.

NOTE: The same issue exists for the offer method when we look at an SPMC queue, as discussed in this issue.

5 comments:

  1. Thanks for the detailed analysis, we came to the same conclusions. Unfortunately we cannot tolerate relaxed semantics in Akka. When sending a message to an actor we need to atomically schedule it to run unless it is currently active. This wake-up signal must not be lost because otherwise the actor will lie dormant until the next message arrives, which might be never. Therefore the woken-up actor must not find its mailbox empty, hence the spinning. We could technically parry for it within the actor code for that precise case instead of making poll() reliable, but I am not sure whether that is worth the code complexity; someone should benchmark it … ;-)

    ReplyDelete
    Replies
    1. I think it's a relatively minor change, and to my mind reflects the intent more correctly.
      Consider the case where an actor has gotten into this state (poll() == null && !isEmpty()), to my mind the right thing to do is re-schedule it and move on to executing an actor with visible messages in it's queue. I consider that a better solution than highjacking the thread until the message becomes visible. Others might disagree...
      The above may or may not improve performance but might be worth a shot, I'm not sure how you guys benchmark Akka...

      Delete
  2. I'm sure you covered this elsewhere, but could you please repeat it for my benefit?

    Why don't you use the following way of inserting which avoids this themporary "gap":

    while (true) {
    curr_head = get head;
    new_node.next = curr_head;
    if (cmp_and_swap(curr_head, new_node)) { break;}
    }

    ReplyDelete
    Replies
    1. Not covered elsewhere as such, and no need to apologize if I did ;-)
      What you suggest is to replace getAndSet() with a CAS loop, and for JDK7 you are making a very valid swap as getAndSet is a CAS loop anyway, so the above is indeed a nice win in simplifying poll without hurting offer much. I'll update my JDK7 implementation of this queue to reflect your suggestion.
      For JDK8 getAndSet is a LOCK XCHG intrinsic (on x86) rather than a CAS retry loop. This intrinsic performs better than a CAS loop and is so quite desirable. A similar improvement have been made on getAndAdd which I've benchmarked here: http://psy-lob-saw.blogspot.com/2014/06/jdk8-update-on-scalable-counters.html
      So in the JDK8 case we'll be making offer worse for the benefit of poll. How much worse is hard to judge without a relevant usecase/benchmark, but there's certainly a case to be made for avoiding gaps which may be worth the trade off.

      Delete
    2. sorry, on second read what you are suggesting is not going to work... the prevHead.next is the new node, not the other way around. The change you suggest will allow reversed traversal (LIFO) so starting from the last element one could progress to the first. In the above described algorithm/data structure head is the last element added, and the graph goes: tail.next -> node.next -> head.next -> null
      The next field being set after the getAndSet is the prevHead.next, and re-writing it concurrently will lead to a bug as the value will be trashed when a CAS is lost.
      So sadly this is not a valid alternative for the data structure discussed, but perhaps another data structure is possible.

      Delete