Comments on Psychosomatic, Lobotomy, Saw: Degrees Of (Lock/Wait) Freedom

Going back to the qualification of non-blocking as...

2015-05-22T18:53:45.788+01:00

Going back to the qualification of non-blocking as "failure or suspension of any thread cannot cause failure or suspension of another thread", anything which triggers a global safepoint is blocking. The global safepoint is in effect a lock, and any thread waiting on it falls under "failure or suspension of other thread". If you suspend a Java thread on it's way to a safe-point, or the GC thread in a safepoint, everyone is stuck. The amount of time you plan to spend in a lock, even a short or bounded one, does not matter if suspending a thread inside it will have the effect of blocking some other thread until the thread inside the lock is resumed.
The reason I separate allocation and deopt from safepoints is to differentiate cause and effect. A Java thread can cause a global safepoint through allocation and deoptimization (and a few other ways). A Java thread can be effected by other Java threads, or the JVM itself, when hitting a safepoint initiated by them.
> "non-blocking (systems) as opposed to algorithms or sets of threads only exit on paper, in the sense that if you have a system that MUST be non-blocking (as a system), you probably need to execute it on paper"
This is probably true, my point is that to communicate using the "lock free" term we need to recognise it means different things to different people in different context. It's fuzzy on the boundaries of "system" and "progress", and should perhaps not be waved around quite as enthusiastically as I have been waving it :-).

Being nit-picky, lets dissect the three JVM items ...

2015-05-22T15:47:16.568+01:00

Being nit-picky, lets dissect the three JVM items you identify as causing blocking: Safepoint, Depot, and Allocation. I don't think any of these are specific to JVMs. The same qualities are shared by pretty much all OSs and by Hardware.. The reality is that if If a system MUST be non-blocking, it should probably not be on the JVM, OS, Hypervisor, Hardware-that-uses-Power-management, Hardware-that-does-ECC, etc. Basically, non-blocking (systems) as opposed to algorithms or sets of threads only exit on paper, in the sense that if you have a system that MUST be non-blocking (as a system), you probably need to execute it on paper... ;-)

We probably agree that having the JVM suspend the execution of a thread temporarily for various reason does not not remove wait-freedom qualities from the thread or from the system of threads your algorithm (or whole program) represents. The JVM itself is not wait-free. But neither is any OS I know of. And neither is hardware. E.g. on any OS with interrupts enabled, any thread may experience a delay for some amount of time for interrupt and/or exception processing, but will still be considered wait free if other qualities hold.

The key point is that all three examples of JVM blocking stuff are bound in time, and do not create any new cross-thread-blocking potential within the affected threads (that run on the JVM):

Safepoint: A JVM may bring all threads to a global safepoint that can last an arbitrary (but bounded) amount of time. That makes for an ugly pause, but it does not make the threads running on it non-wait-free. Not any more than e.g. a page fault in an OS does to a C program, at least. The magnitude may be larger and uglier (on JVMs that make threads wait for large an ugly safepoints to complete), but the concept is the same.

Time-to-safepoint (TTSP): The real issue with safepoints is cross-thread dependence on execution of code, and the resulting time-to-safepoint that extends the perceived stall seen by all threads in a global safepoint. In the fairytale land of "all-safepoints-all-the-time", a safepoint can happens anywhere in code, without a neccesaity for instructions to be executed by any thread, removing the dependence between threads. However, in the practical world of optimized code, safepoints may occur more rarely in code, which means that threads need to "execute forward" to reach a safepoint, creating a cross-thread dependency. When safepoint opportunities are eliminated from long code paths (e.g. counted loops that count to a billion, or runtime things like array fills of 1GB), TTSP can get very long. When page faults may need to be satisfied for threads to make forward progress (e.g. when a thread is blocked on access to a mapped file), that gets ugly too. However, as ugly as TTSP is, it is still a bounded operation. It does not (in itself) make the threads affected by it non-wait-free.

Deoptimization: Deoptimization without safepoints is possible. But I'll grant that all current JVMs use global safepoints for at least some deoptimizations. And global safepoints are ugly. However, see safepoint discussion above for why they do not make affected threads non-wait-free any more than an OS fault handler that grabs a spin lock does.

Allocation: Allocation on the JVM is actually in better shape than allocation on anything else in this regard... The key is that it's not the JVM that stalls you when you allocate, it's the allocation itself that includes a potential semantic lock (just like a get on a semaphore does). An allocating C thread (using any variant of malloc() or new) is just as susceptible to long delayed operations. Whenever you allocate from something other than a completely-thread-local-self-managed-pool-of-memory thing, you've introduced blocking into your algorithm. JVMs (and other GC'ed runtimes) reduces the likelihood of allocation-related contention and blocking dramatically compared to other environments, but allocation is still a blocking operation, just like anywhere else.

I totally agree with most of your reasoning. What ...

2015-05-22T09:42:32.744+01:00

I totally agree with most of your reasoning. What I'm trying to point to is following: while in computer science there are crucial difference between blocking/lock-free/wait-free algorithms, in engineering area difference is not crucial (and sometimes even miniscule). E.g. there is no way to implement distributed shared memory with RW access without any locks (==exclusive owning of some granularity) involved. Such a locks could be designed to have predictable timing, but, again, such a timings predictable only because on hardware level you have knowledge about max mutators count (==physical cores). Imagine you have N physical cores trying to apply lock xadd on same memory location -- do you really expect upper bound on _any single lock xadd duration_ to be O(1)? I'd expect it to be _at least_ O(N) -- i.e. this number is fixed only as long as cores count is fixed. Are you working in Azul right now? -- it is interesting to know how long could take atomic_add on Azul's with ~1000 Vega cores...

Another point: afaik there is no practical algorithms which are totally lock-free. All of "lock-free" implementations I'm aware of really lock-free only on fast-path -- they are all introduce some kind of back-off (sometimes even in form of blocking) on slow paths.This is, sure, because of starvation is still possible in lock-free world, and from my point of view starvation in lock-free world has nothing really different from starvation in blocking world -- i.e. there are specifics, but it's the same phenomenon in its root.

Yet another point: all of the locks implementation I'm aware of are itself lock-free (sic!) on fast-path. Well, you know -- all this [CAS(ownerThreadId) -> adaptive spinning -> ok, ok, I back off to enqueue you to waiters queue and schedule you off the CPU] stuff. So if we're calling, say, Discruptor to be lock-free (on fast path) -- why we do not call code with ReentrantLock to be lock-free (on fast-path)?

>Yes. It's locked, but the lock duration is...

2015-05-22T09:15:57.956+01:00

>Yes. It's locked, but the lock duration is limited by the single instruction. Unlike a synchronized block, it can't be prolonged by a page fault (as it'd abort the instruction) and it obviously can't be prolonged by a context switch.

Yes, sure. But any thread's execution could be prolonged by page fault or context switch or some interrupt or... If we worry about thread_2 being waited on lock acuired by thread_1, which was interrupted inside lock-ed region because of page fault -- why don't we worry about thread_2 being waited on plain local memory load because of same page fault?

Sure, there are some reasons to worry about duration of lock-ed region _more_ then any thread-local region duration -- e.g. because this way one page fault could slow down not 1, but 2,3... threads at once. But still I see no crucial difference from practial point of view.

I'll take the lawyer way out and hang on the d...

2015-05-22T00:12:46.916+01:00

I'll take the lawyer way out and hang on the definition of non-blocking as: "failure or suspension of any thread cannot cause failure or suspension of another thread". The HW implementation does not matter, and it's speed doesn't matter, as long as instructions are atomic from the executing thread point of view.
Can conflicting LOCKed instructions be considered blocking? No, because a thread cannot be suspended or interrupted half way through them. You need a crack to put your wedge in, and an atomic instruction guarantees exactly that there is no crack.
If we agree that the instructions are at least non-blocking we can argue that LOCKed instructions cannot be considered wait free if starvation is possible. But I'm not sure that starvation is possible, i.e. that there is no bounded return time for LOCKed instructions. I think LOCKed instructions maintain exclusive access to the cache line for the duration of their execution, so there's no gap in which other cores can 'steal' the line. That a core will get exclusive access in a bounded time is perhaps enforced by some fairness of the MESI protocol implementation.
The last paragraph is entirely conjecture based on internet rumours... I can't find an all out, spelling out details kind of explanation of the hardware level implementation.

> Does core_1 intervene and prevent main memory...

2015-05-21T13:10:00.951+01:00

> Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)?

Take it with a big shovel of salt: AFAIK no and yes. Instead of the main memory, a higher level cache is involved. It's inclusive, so it has the given cache line. Moreover it knows that's the core_1 who has the only true version.

> is there any difference between such "locking" of cache line ownership

Yes. It's locked, but the lock duration is limited by the single instruction. Unlike a synchronized block, it can't be prolonged by a page fault (as it'd abort the instruction) and it obviously can't be prolonged by a context switch.

About lock-ed instructions: again, how would you i...

2015-05-21T09:47:54.011+01:00

About lock-ed instructions: again, how would you implement them given MESI+ as you base cache-coherency protocol? Let core_1 process lock xadd: first core_1 needs to issue RFO for the memory location, then cache line acquired in E-state, then core_1 needs to read value into some internal register, add something, and write result back to cache line. But what if core_2 issues RFO for same cache line while RMW operation on core_1 is in progress? How will the request be arbitrated? Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)? Or core_1 just looses it's ownership, and detects this on writing back updated value, and re-try (so, effectively, re-introducing CAS-loop on hardware level)? I'm not sure, but I'd bet on the first option ("locks" cache line for full diration of RMW). And if it is indeed true about current implementation of RMW in hardware, then is there any difference between such "locking" of cache line ownership, and fine-grained short code under synchronized()?

My understanding about this difference is following: it is all about (unlimited number of) software threads with complex scheduling policy (and don't forget about interrupts!). Given limited (and known in advance) number of hardware cores and fair-ness of "lock" you use on hardware level -- you are able to give upper bound of locked region duration. But having unknown number of software threads with different priorities and priority switching policy, and unknown number of interrupts, and without implementation details of locks you use -- there is no way to estimate such an upper bound. For me, as a physicist, it is like transition from newton-laws-guided motion of 1-2-3-.. particles in the box to molecular kinetics with number of particles rising towards 10^23: you could simulate exact dynamics of limited number of particles, but with number growing you're forced to switch to statistics-based approach at some point, even though individual particle motion still remains fully deterministic. "Blocking"/"non-blocking" dichotomy for me looks exactly the same transition.

It's a good point that the 'system' de...

2015-05-21T01:22:39.502+01:00

It's a good point that the 'system' definition should include the hardware (and OS/JVM) if we care about the real world. In particular if we limit our discussion to Intel x86, are the LOCK instructions to be considered as blocking? can we use these instructions and claim lock-freedom?
How would you translate "An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations" to the instruction level? Can we just replace Algorithm with Instruction?
The answer (as I understand it, on recent CPUs, based on the Intel manuals) is that LOCK instructions (XCHG/LOCK XADD/LOCK CMPXCHG etc) are non-blocking:
- On old processors the memory bus is locked on a LOCK instruction, newer processors perform a location limited exclusive access as long as the value is in the cache. This has performance implications, but to my understanding does not make them blocking by the above definition.
- The instructions are atomic and exclusive, that means threads block each other but the blocking duration cannot be stretched by "failure or suspension of any thread". Thread failure cannot induce deadlock.
It is generally accepted in papers I read that the LOCK instructions are wait-free.
As for the difference between LOCK XADD and a CAS loop, the big difference is in the semantics. A LOCK XADD operation cannot fail indefinitely, it will get ordered and executed and the result will be returned. There's no failure mode. A CAS loop has distinct read-modify-compareAndUpdate stages where the compareAndUpdate are atomic, the getAndSet/Add is single stage. This also means they can't do everything the CAS loop allows you to do (see the Mpsc/Spmc/Mpmc queues for examples where CAS cannot be replaced by LOCK XADD).

The interesting thing about progress-freedom is th...

2015-05-20T11:30:52.852+01:00

The interesting thing about progress-freedom is that we could evaluate algorithm "degree of freedom" only given "degree of freedom" for it's building parts -- i.e. instructions. So, talking about "..this method is non-blocking because there is only one CAS" is true only while we're sure CAS itself is really wait-free. But is it? On any hardware JVM probably run on?

I.e. lets take getAndAdd() implemented as CAS-loop (as it was until 1.8) -- this is, theoretically, lock-free (if CAS is indeed lock free on this JVM impl on this hardware platform), but not wait-free, because, theoretically, loop could iterate forever, given enough threads to compete. But in 1.8 JVM (afaik) compiles getAndAdd to atomic lock xadd -- now it becomes wait-free. But how? Is it really any _untrivial_ difference between how lock xadd is implemented on the top of MESI+ and how (looped lock xchg) is implemented on the top of same MESI+? My guess (only guess) is: no, there is no non-trivial difference. CAS-loop is ~ equivalent to lock xadd, just with bigger window of contention. And if it is true, then CAS-loop should be also seen as wait-free -- because, given finite number of physical CPUs to compete, it will terminate in finite amount of iterations.

So, I mean, we should really keep in mind the separation between computer science and software architecture. Well designed blocking (from CS point of view) algorithm could easy outperform wait-free (from CS point of view) in most (if not all) practical applications -- and even more, blocking could be equivalent to wait-free algorithm if we'd go deep enough in reducing actual instructions executed down to roots

P.S. By some reason comment doesn't appear after send, so may be I've duplicate it few times, sorry :)