Comments on Psychosomatic, Lobotomy, Saw: SPSC IV - A look at BQueue

The idea with batching is to reduce cost by removi...

2015-05-11T22:05:19.918+01:00

The idea with batching is to reduce cost by removing unnecessary operations based on the knowledge that more than one element is involved. For the offheap queue case let's consider what's required per element:
- increment counter
- mark as used/unused
You can reduce cost by readAquiring a batch, which would move the counter once. I don't think you can batch the used/unused mark, and doing it per element ensures good locality in any case. The big win in the multi-consumer case is in reducing the contention on the counter by claiming batches of events instead of a single event every time.

Hi Nitsan, Before trying to dispute anything i...

2015-05-11T18:33:00.160+01:00

Hi Nitsan,
Before trying to dispute anything i've to understand better all the mechanics behind a lot of your implementations :D
For example now i'm modifying your spsc Offheap concurrent ring buffer and i really don't know how to implement the batching logic in order to provide it with an higher level interface (similar to the eventHandler of the disruptor)....
repeatably calling readAcquire with only a final commit or...?

If you use the Disruptor as an SPSC queue I think ...

2015-05-02T12:18:54.941+01:00

If you use the Disruptor as an SPSC queue I think the benefit from batching is marginal (depending on size of batch and how the API is constructed). If you are submitting large batches you could benefit from System.arrayCopy in copying them in, or you could delay the consumer visibility and reduce near empty queue contention. There will be a tradeoff, and I doubt the difference will be measurable in a real application. I'm happy to learn something new here, so if you care to dispute the above (with some numbers and evidence) please do.
The JCTools SPSC queue will be a faster solution in the plain Queue use-case (where elements are actually added and removed from the queue). If elements are allocated for offering and GC-ed after being polled the GC concern will become something to worry about, there are separate solutions to pooling objects that can be used in this case which may prove helpful. The Queue interface does not concern itself with Object reuse.
In the MPSC case there's a benefit to batching in reducing the contention on the producer side, it's on the cards for JCTools queues at some point.
The Disruptor offers functionality over and above a queue:
- Pooling of queue events
- Chaining several stages of a pipeline over one queue
- Parallel consumption of events in SC manner (effectively a broadcast)
So to use it as a generic queue (i.e. passing messages of many types, adding and removing them to/from the Disruptor) is not optimal. It's still a better choice than JDK for certain use-cases, but JCTools offers a more direct implementation for the generic queue requirement.

Hi Nitsan! I know that it is an old post and what ...

2015-05-02T10:45:03.187+01:00

Hi Nitsan!
I know that it is an old post and what i've to ask seems to be a little off-topic...but i'm really interested to know your opinion about the good/optimal use cases for the disruptor...In the last few year i use it in so many project as a replacement of a queue plus the ability to batch in either the producer and consumer side (tipically when i/o bounded for the latter case..). In the post of Martin T. about the Smart-Batching i recognize that even a plain queue could do the same form the consumer side but on the producer side nope...

Francesco

Implemented a queue that should work for your use ...

2014-12-28T11:45:50.063+00:00

Implemented a queue that should work for your use case: SpscGrowableArrayQueue
https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/SpscGrowableArrayQueue.java
Let me know if it works for you

2) Sounds like I'd need to spend more time und...

2014-12-11T08:04:58.901+00:00

2) Sounds like I'd need to spend more time understanding your application to work this out...
4) The 'linked list' behaviour would have a similar impact to the growth of an ArrayList, so I think overall this is a good compromise, and the size of the small queue will be still quite small (you'll pay for 64b padding upfront, that's it)
When I get round to implementing it you can have a play and we'll see if it helps.

Tanks for the explanation. 2) There is an asynchr...

2014-12-03T18:00:19.923+00:00

Tanks for the explanation.

2) There is an asynchronous unsubscription which would eventually stop the producers on a best effort basis. So when this happens, the queue might be still in use on its producer side. We would need to make sure the producer doesn't offer if the unsubscription has happened or if the unsubscription happens while the producer is offering, it has to clean up and not use the queue again. Otherwise, if the producer is not producing, the async unsubscription should clean up immediately. An AtomicInteger might do the trick but it adds the increment/decrement overhead around an offer.

4) I've been thinking and perhaps do the following: start off with a small queue and no padding. In case of overfill or after certain number of elements switch over to a (fully padded) larger queue for the rest of the lifetime. I haven't mentioned before, but GC pressure seems to be of concern to Netflix, so anything that allocates and throws away objects (such as a linked list) should be avoided if possible.

Let me try and cover the main points here 1. "...

2014-12-03T14:44:50.705+00:00

Let me try and cover the main points here
1. "it allocates several kilobytes of memory due to padding. " : the queue allocates 128b padding on the counters, the object and the array. It's allot of padding it the queue is very small. The problem is however orthogonal to resizable queues. An infinitely resizable queue is probably not what you want.
2. "We tried pooling queues but making sure the queue is returned to the pool at the right time adds a per element overhead." : I'm not certain why the overhead is per element. Surely queues are only added/removed when the operators are created/disposed of?
3. "I've tried combining the two in the form of a concurrent array queue which can grow if necessary" : you've removed all the padding and introduced a growOnFull behaviour. These are 2 distinct features. In particular removing the padding will make creating small queues faster. But will also expose you to false sharing.
4. "Is it worth pursuing this resizable queue?" : yes, but I'd recommend taking a different approach. You can get most of the padding benefits from duplicating the cold fields and keeping the padding between consumer/producer fields (e.g. replace buffer with pbuffer and cbuffer to be accessed from respective ends). You can also reduce the pad size to 64b instead of 128b. This will save you most of the allocation overhead. For resizing I would recommend linking the buffers when growing rather than replacing THE buffer. In this scenario you allocate an array of size "capacity + 1" and use the spare element as the pointer to the next buffer when the producer considers the current pbuffer full. This is along the lines of the Fast Flow unbounded SPSC queue. A further tweak to this method would be using some kind of pool for the object arrays, the pool will only be accessed on queue resize and consumer exhaustion of buffer. This approach is currently filed under issue #18 on JCTools, I will now update it with above details.

Hello! I've been working on improving the perf...

2014-12-03T12:27:19.191+00:00

Hello! I've been working on improving the performance of some operators in the RxJava library and I'd like your opinion about a few things.

We are currently using your SpscArrayQueue which gives great performance for long lasting data transfers, but is a real "killer" for short lived operations (i.e., the queue is created, a few elements pass, then it is forgotten). The reasons seems to be that whenever the queue is created, it allocates several kilobytes of memory due to padding. We tried pooling queues but making sure the queue is returned to the pool at the right time adds a per element overhead. In an experimental branch, we switched to SpscLinkedQueue which suffers from the same "wasteful" allocation due to the node size.

I've tried combining the two in the form of a concurrent array queue which can grow if necessary ( https://gist.github.com/akarnokd/d561a92481062c94d792 ) and performs great in the short lived scenario mentioned above (2.7x better than SpscArrayQueue(128)), but lags behind on prolonged single threaded offer/poll and performs slower/unreliably in your ConcurrentQueueThroughputBusy JMH benchmark.

Is it worth pursuing this resizable queue or is there another way of having small memory footprint for short lived queues yet having great throughput for long lived queues?

2014-07-15T07:18:16.966+01:00

This comment has been removed by a blog administrator.

I agree boost SPSC is not optimal. there's a 1...

2013-12-02T20:20:34.979+00:00

I agree boost SPSC is not optimal. there's a 1024cores blog regarding FF:

http://www.1024cores.net/home/lock-free-algorithms/queues/fastflow

Makes sense. Thank you.

2013-11-27T17:28:58.126+00:00

Makes sense. Thank you.

1. In my example I use it as a queue of ints, ther...

2013-11-27T08:13:52.296+00:00

1. In my example I use it as a queue of ints, there's no nulling anywhere. ValueEvent is a wrapper for a mutable int. If it were a reference queue it would wrap an Object reference.
2. The size being unbounded and unplanned for is part of the issue. It's about expected behaviour. If you take Runnables off a queue and run them, do you expect their state to hang around after they are done?
3. I can't comment on the typical use of the Disruptor, but I know some people try and use it as a replacement for a queue. It's not a bad plan as it beats the JDK queues, but it's not the interface or design it was built for IMHO.
4. The Disruptor doesn't inherently require you to null out anything, only when your events wrap references do you need to start worrying about nulling those references out.

In the example that you posted at https://github.c...

2013-11-26T19:00:14.859+00:00

In the example that you posted at https://github.com/nitsanw/examples/blob/revisited/src/disruptor/DisruptorSPSCPerfTest.java, where are you nulling events out?
When you say a generic reference queue, are you talking about a queue where the references held are to elements of different sizes? In your example all the elements of the queue point to ValueEvent references right?
Is your point that if the references held in the queue point to objects of different kinds we cannot re-use them readily and hence must null them out and create ones. Also when you say the disruptor is not a generic queue, do you mean if the disruptor was used with a poll/offer interface we would need to null out entries? If you use it in a typical manner, where your events are homogenous, you fetch next sequence, modify the object and publish the sequence then would nulling out by the consumer still be a concern?

Not at all bothersome, the more you ask the more I...

2013-11-26T18:41:56.159+00:00

Not at all bothersome, the more you ask the more I have to think.
In a GC runtime any reference which is not nulled will stop the referred object from being collected. If the polled element from a reference queue is not nulled there is a memory leak. The number of references leaked is bounded by the size of the queue, so in some cases you may consider it acceptable. In a generic reference queue where the size of elements referenced in the queue is unknown it is generally not acceptable.
The disruptor is not a generic queue, and using it as such is naive. But if you were to use it as such you would have to null out the reference holder. This is not a criticism of the Disruptor, but of inappropriate use of it.

Thanks Nitsan. Sorry to be bothersome, but why wou...

2013-11-26T15:55:30.356+00:00

Thanks Nitsan. Sorry to be bothersome, but why would there be a memory leak exactly? Could you please point me to an example (maybe on the disruptor src examples) where the consumer actually nulls out the field in the event wrapper.

When people use the Disruptor as a generic queue r...

2013-11-26T15:49:26.986+00:00

When people use the Disruptor as a generic queue replacement (which is not a good use for it) they will have the ring buffer pre-filled with an event holder. The producer will claim a slot, pop an event into it, the consumer will read the slot get the event, and null it out. If you do not null out the event for generic queues you get a memory leak. So the consumer never nulls out the slot in the ring buffer, but does null out the field in the event wrapper. The event wrappers are allocated all at once, so may well be laid out back to back in memory. They have no padding, so read write operations on them can cause false sharing.

Aah this makes sense. I still don't get the nu...

2013-11-26T15:31:54.274+00:00

Aah this makes sense. I still don't get the nulling out of the event part in your 3 cache line exchange explanation and in the false sharing explanation. I thought in the disruptor framework the consumer never nulls out the event in the queue, the producer merely re-uses it when it's safe to do so.

You are comparing the counter size to the event si...

2013-11-26T08:35:26.088+00:00

You are comparing the counter size to the event size, I'm just saying the counter size is not significant.
For a given event (of any size) the disruptor, and the queues discussed above, act as reference queues delivering the reference to the event rather than the event itself. The Disruptor is in fact passing a reference to the event wrapper object which will hold another reference to the actual event (when used naively as a queue) hence the extra pointer chase.
For the Disruptor, consumers don't cache the producer index like the SPSCQueue implementation, but rather sample it once per batch read of events from the ring buffer (see the BatchEventProcessor). The SingleProducerSequencer does caching in a similar fashion to the SPSCQueue. This means that the control/synchronization data is only exchanged when the locally cached view is exhausted. So once per many events in theory.
When the queue/disruptor is empty or near empty the consumer view is rapidly exhausted, thus reducing the effectiveness of the batching. To exchange a single event we trigger the following:
1. Read miss of the tail value
2. Read miss of the event from the buffer
3. Potential write miss when nulling out the event in the queue if the producer wrote another event
The above is not about the size of the event (though inlined events will spare you number 3 when inlined to the buffer) it is due to the queue being near empty. If the queue is near empty all the time because the consumer is faster than the producer for instance than this will become significant.
False sharing for the disruptor is an extra risk because of the extra indirection of the event wrappers which can experience false sharing between them. This is a worst case scenario when the event wrappers are small enough and laid out densely enough to trigger false sharing when an event is read from one and nulled out/written in the other. This worst case scenario is again about the case where the queue is near empty or near full.

Edit: event information : control information rati...

2013-11-26T05:38:55.641+00:00

Edit: event information : control information ratio would be bigger.

I didn't allege that the fact that the counter...

2013-11-26T03:15:38.382+00:00

I didn't allege that the fact that the counter is a long and not an int or short is the problem. When you write an event in the disruptor you are writing an int for the event and writing a long for the control. In the FF queue you are just writing an int. The issue I was trying to highlight is that each event data (set on one core) needs to travel to the other core to be processed. As does the counter. The fact that the event size is less than or equal to the counter size means that similar amounts of control information and real event information are having to travel to another core. But if each event were say a large image, then the event information : control information ratio would be smaller, diminishing this particular advantage that the FF queue has - the advantage that less data needs to travel across cores.

When you say "when the queue is empty", which queue are you talking about? The disruptor? If so why would it have traffic around 3 cache lines when empty but not otherwise? Is it because the cached counter values are no longer good enough to make a decision as to whether to overwrite a particular entry?

Also why would the event objects be subject to false sharing for the disruptor and not for the other queues? My C++ implementation actually in-lines the array with all the control information so that the event objects are not subject to false sharing.

2013-11-26T02:35:02.864+00:00

This comment has been removed by the author.

2013-11-26T02:32:54.700+00:00

This comment has been removed by the author.

I don't think your analysis of the cache traff...

2013-11-25T18:02:12.790+00:00

I don't think your analysis of the cache traffic is correct. The counters are cache local and only experience a miss(read/write) when the last read value is exhausted. Given that cache coherency happens on the cache line level it would make no difference if the counters were ints (or for that matter shorts), a miss is a line miss. The issue with the counter approach is when the queue is near empty leading to traffic around 3 cache lines instead of just 1 in the case of FF.
Also note that the Disruptor when used naively as a queue adds a further pointer chase to the queue. The event objects themselves may be subject to false sharing leasing to further issues.

Thanks Nitsan. I appreciate the caveats regarding ...

2013-11-25T16:49:57.644+00:00

Thanks Nitsan. I appreciate the caveats regarding the benchmark in the article. One thing I'd like to point out with the Disruptor and tiny events is that since it uses a couple of longs for synchronization, if the event itself is just an int the synchronization variables causes as much or more cache traffic than the actual payload. With a much larger payload the synchronization costs for the disruptor should be relatively lower. But your conclusions in general make sense, if we can prevent false sharing in queues with no control variables at all, they should be faster than the disruptor for SPSC workloads.