For the last few days I’ve been doing a deep dive on v3’s performance.
Although it’s pretty late in the game and this will delay things a little, I’ve decided to change the execution model Cantabile uses to dispatch work items across multiple cores. The problem is always getting the size of the work items right - big enough that it’s worth dispatching to another thread, but small enough that there’s enough work to keep each core busy.
In build 3097 the work items were very small, but the execution engine coalesced these small items together to make larger batches of work that could be executed together. This worked reasonably well, but there was a fair bit of overhead during each audio cycle to work out the execution plan.
In the new execution model, responsibility for these groupings has been moved from the execution engine to the higher level application objects. This gives two advantages - the execution groups are more controlled and much of the execution planning can be done up front rather than on every cycle.
It’s not completely working yet (haven’t really tested racks at all), but plugin processing is looking promising. I’ve been testing with 64 instances of simple vsti, with no notes sounding (since I’m interested in load in Cantabile, not the plugins).
Here’s how the timing looked in 3097:
00078386 [3940:2]: Parallel Thread: exec:2648861 wakes:9029 soft:2 hard:2615 locks:55610 lockrate:6.2
00078387 [6176:2]: Parallel Thread: exec:6054086 wakes:10326 soft:0 hard:9028 locks:658256 lockrate:63.7
00078387 [6176:2]: Parallel executor (execute): 10326 hits, avg: 1103 ticks (0.403ms), max: 4927 (1.802ms)
00078387 [6176:2]: avg: 0 ticks (0.000ms) max: 2 ticks (0.001ms) - Lock taken
00078387 [6176:2]: avg: 479 ticks (0.175ms) max: 1239 ticks (0.453ms) - Prepared
00078387 [6176:2]: avg: 1102 ticks (0.403ms) max: 4927 ticks (1.802ms) - Tasks executed
Here’s the new execution model:
00070726 [3728:2]: Parallel Thread: exec:324315 wakes:6959 soft:0 hard:3507 locks:338288 lockrate:48.6
00070727 [5336:2]: Parallel Thread: exec:387199 wakes:10502 soft:0 hard:6958 locks:401115 lockrate:38.2
00070727 [5336:2]: Parallel executor (execute): 10502 hits, avg: 786 ticks (0.288ms), max: 1866 (0.682ms)
00070727 [5336:2]: avg: 0 ticks (0.000ms) max: 29 ticks (0.011ms) - Lock taken
00070727 [5336:2]: avg: 90 ticks (0.033ms) max: 803 ticks (0.294ms) - Prepared
00070727 [5336:2]: avg: 786 ticks (0.287ms) max: 1865 ticks (0.682ms) - Tasks executed
Breaking this down:
- Comparing at the first two lines in each test you can see the load is much more evenly distributed across the two threads. In 3097 one thread is doing more than double the work of the first.
- Still on those two lines, you can see the total number of executed items dropped from about 8.5 million to about 700 thousand, reflecting the larger size of each unit of work.
- Comparing the third line, the average execution time has dropped from 0.4ms to about 0.29ms. (25% improvement)
- Still on the third line, the maximum execution time has dropped from 1.8ms to less than 0.8ms - more than twice as fast.
- On the fifth line the average time to prepare the execution has dropped from 0.175ms to 0.033ms - this represents the dramatic reduction in work required to prepare the execution plan because much of it is now pre-planned.
So in practice what does this look like on the load meter?
- In 3097 the load varied from about 15 to 30%, occasionally dropping to 7% and occasionally spiking as high as 70%.
- In the new build it’s much more stable and sits around the 8% mark (+/- maybe 2%).
I call that a win! Well worth crazy number of code changes, but now I need to stabilize it.
Having said all that, v2 runs the same 64 plugins at about 5% - but the only reason it can do that is because of the much simpler routing capabilities. v3 needs more audio mixers per plugin - but even so, it works out less than 0.05% load per plugin for the extra capabilities - probably worth it. Either way, I’ll take a stab at trimming that down too.