Potential fix for Roland Zenology

brad · October 8, 2020, 12:04am

Excellent - thanks guys.

Sorry this took so long to get resolved - busy on other stuff and random bugs are hard to track down.

Derek · October 8, 2020, 5:32am

Oh, but bugs are never random. They just appear that way. Find the failure mode and trigger it and they are 100% deterministic.

But seemingly random bugs are devils to find. Kudos as usual for finding a fix for Roland’s naughtiness…

brad · October 8, 2020, 11:53pm

Technically true, but sometimes the conditions can be so arbitrary that it might as well be random.

eg: in this case it was a combination of user’s timing opening and closing the GUI in relation to previous interactions with Cantabile popups and the .NET garbage collector that highlighted a bug in GuiKit that shouldn’t normally happen but was triggered by the plugin not re-enabling a window.

ie: I never got to the point where I could reliably trigger the bug, but understood in enough that I could trigger it more quicker than by chance alone.

Derek · October 9, 2020, 6:58pm

Sorry, for the pedantry. I do a lot on system reliability in work as part of proving what we do is safe. So it was a bit mind boggling to get my head around the fact that software reliability is very different to hardware reliability (which is probabilistic failure). Software is 100% deterministic and will not fail until you hit those input conditions, but hit them and it will fail 100% of the time. The hard part is finding and replicating those conditions. Even stranger when doing reliability analysis, is that once you have fixed the bug, you can discount all related failures from your stats as you have removed the failure condition. So your reliability improves as you fix software problems, but for example, a hardware failure will stay in the stats,

Worst one I ever had in this vein was in the days of MSDOS. I had a real time system that would “randomly” crash with a C runtime stack overflow error at least a few times a day. No matter how large I set the stack, it would still do it. Took me months to find it. If you remember INT18 calls, I was using those to drive the character display, but I also had re-entrant interrupt driver routines for handling data coming in. What I found, almost by accident one day when I noticed that the stack frame pointer in the debugger was not in an area of program memory I was expecting, is that when you make an INT18 or 21 call, MSDOS would set the stack pointer to its own stack that was only about 5 bytes deep! So if a data interrupt came in when I was on my generous stack, all was fine, but if that interrupt came in during an INT call on MSDOS’s puny stack then my program’s stack frame needs were not met and the crash happened. I had to patch my interrupt routines to set their own stack to stop that, and restore the stack to whatever it was when the finished

Ah, the memories!

Back on topic, kudos to you for finding it, it must have been a bugger to find.

cdv_gabriel · October 10, 2020, 7:12am

@Derek Same here! You made me recall the time when I was developing acquisition systems on Motorola 68000 and 68030 cpu’s.
At least they had a much better interrupt architecture (IMHO) with respect to Intel ones! Anyway, interrupts are always nasty beasts…at the time we were developing stand alone programs, no OS was fast enough to sustain the throughput…
Ok…enough for nostalgia.
I am glad Brad was able to fix the problem (BTW, I just received the Cantabile stickers…it took one month from Australia to Italy but they have finally arrived).

Derek · October 10, 2020, 7:47am

What it all means with this digression is that Brad has my sympathies when tracking things down like this, and kudos for fixing it, as any developer has often been in the same boat.