Recently, I’ve noticed that the C compiler hung in the simulator.
This was rather odd. The curious thing about it was that nothing really changed. At least nothing regarding the simulator. Sure, I’ve reorganized the servers, and actually run the host inside a VM now, but nothing changed inside the simulator. I didn’t even re-compile the binaries, just moved the whole directory over and re-started it.
So, what gives?
Actually, I do remember having an issue with the C compiler hanging like that in the distant and heroic past when real programmers still ruled the planet. It had to do with the real-time clock advancing in a weird way and resulting in some loop not really realizing that it should re-check something, something… It was a long time ago and I don’t have the notes. I could be wrong, but something to that tune.
I’ve put this problem aside, mostly because I have another project I’m working on now and really, really didn’t feel like digging knee-deep into the simulator guts again. It didn’t sound like a fun endeavor.
Fate had different plans though. A nice fellow from the vast planes of the Internet approached me and asked if I could provide install instructions for a UNICOS install CD he had.
This was another task I’ve been meaning to do for a long time, so now, given the impetus, I’ve started working on that. In fact, half of it I’ve already written up while I migrated the host to a VM, but the missing part: how to install a new UNICOS OS inside the simulator needed to be penned.
I got to work and – instead of writing detailed instructions – decided to create an install script that automates as much of the process as possible. Again, some of this I already have done previously, but back then I had a lot of missing pieces of the simulator and there were gaping holes in my understanding of how the pieces fit together anyway, so that process was rather clunky.
The reason you need instructions from me (why you can’t simply rely on the instructions from Cray) is that the simulator replaces the IO subsystem. The real iron used a set of VME-bus based machines as IO processors (IOSVs was the imaginative name given to them for the J90). They also used a Sun workstation as the control interface, called the SWS. In my simulator all those pieces are missing, replaced by the actual simulator.
The operative word is replaced, not simulated. It works totally differently from the original, it just presents the same interface towards the software running on the mainframe.
Now, the original install process – naturally – started on the SWS. You ran scripts and binaries on this machine, that eventually booted up (I think through BOOTP) the IOSVs and started talking to them. Eventually it installed the operating system for the IOSVs (which was VXWorks based). At that point you could get a console on the IOSVs and start bringing up the mainframe.
Within the simulation environment, none of that works, you start and end with the mainframe.
Long story short, I started putting together this script, taking bits from the old one, reading the scripts on the install CD and trying to piece together a process that actually worked.
And then it hit me: part of the install process is to compile a new kernel. If I wanted the install process to succeed, I had to stop procrastinating and figure out what the problem was.
So, let’s start with what we know:
- Nothing changed
- Things are still broken
- Last I’ve seen this behavior it was time (and timing) related
Fine, point #1, is not exactly true, especially because I don’t really know when was the last time, things did work.
What changed recently? Well, it’s 2020 now, something that the original engineers of the program would probably have never guessed we would live to see. QA probably didn’t test for it either. So it’s not inconceivable that there’s a date-related bug that screws things up in 2020.
That’s easy to test, just set the time back and retry.
Nice try, no cheese. On to the next theory…
I remember last time I’ve seen this, it was rather easy to find the problem, once I managed to trigger the logging at the right time: the system was sitting in a tight loop, checking and re-checking the real-time clock and not being able to get out of the loop for some reason (which I don’t quite recall anymore). I do know that setting the right triggers up is a pain, so I hoped I don’t have to do it this time.
Especially, because I have X and TotalView, a GUI debugger now! Simple enough: start a compilation under the debugger, wait until it hangs, hit ‘break’ and there it should be: the tight loop which causes all the trouble.
I’ve started TotalView:
totalview /opt/ctl/CC/3.4.0.0/bin/cc -a hello.c
And was greeted with this dialog:

Well, sh*t. I don’t even know what a ‘shared-text program’ is – the Cray PVPs didn’t really even have shared memory as a concept. They didn’t have a reasonable enough MMU to support such a thing. It doesn’t really matter though. TotalView flat-out refuses to co-operate. That’s what matters.
I’ve tried it anyway, and it did allow me to break in at least:

So, the disassembly is all bogus, but at least I have a call-stack.
The program is stuck in waitpid, around offset 5c (which in TotalView parlor means offset 5 (octal), parcel 2).
So, this is something. All I need is a disassembly for waitpid, and I’m golden, more or less.
I have written a disassembler a long time ago that can take object files and create a listing with symbols and such, so I’ll just have to find out which object file (in which library) contains this routine, extract it, transfer it back to the host, disassemble and … profit.
After some tinkering it turned out it’s in libc.a, in waitpid.o (well, not much of a surprise there).
However, when I’ve tried to disassemble it, my program threw all sorts of assert, steadfastly refusing to make any sense of the object file. Damn!
I’ve tried to disable the asserts to at least get something, but that didn’t seem to lead to anything useful, just more confused asserts later in the process.
OK, next attack: I can disassemble the whole executable, and – we have the pointer in the disassembly window – find the culprit that way.
This approach worked a little better, resulting in this:
0x000107D4:p0 (0x000107D4:p0) 0022504 - A5 0x00000004 0x000107D4:p1 (0x000107D4:p1) 0030557 - A5 A5+A7 0x000107D4:p2 (0x000107D4:p2) 0025601 - B1 A6 0x000107D4:p3 (0x000107D4:p3) 0025702 - B2 A7 0x000107D5:p0 (0x000107D5:p0) 0025566 - B54 A5 0x000107D5:p1 (0x000107D5:p1) 0040000:0000233:0000000 - S0 0x9B 0x000107D6:p0 (0x000107D6:p0) 0030660 - A6 A6+1 0x000107D6:p1 (0x000107D6:p1) 0071106 - S1 A6 0x000107D6:p2 (0x000107D6:p2) 0031660 - A6 A6-1 0x000107D6:p3 (0x000107D6:p3) 0003616 - SM14 0 0x000107D7:p0 (0x000107D7:p0) 0003614 - SM12 0 0x000107D7:p1 (0x000107D7:p1) 0004000 - EX 0 0x000107D7:p2 (0x000107D7:p2) 0003716 - SM14 1 0x000107D7:p3 (0x000107D7:p3) 0014004:0017545 - JSZ 0x000107D9:p1 0x000107D8:p1 (0x000107D8:p1) 0006004:0000655 - J 0x0001006B:p1 0x000107D8:p3 (0x000107D8:p3) 0006004:0017562 - J 0x000107DC:p2
Well, that’s not terribly enlightening: it’s just a system call of some sorts (I didn’t bother to figure out which one, chances are, wait or something similar
So, the actual waiting is in kernel space (well, d’uh, now that I think about it), which means that TotalView, being a user-land debugger is useless.
Argh!
I really, really don’t want to set up an instruction trace in the simulator, that’s such a hassle!
As a last-ditch effort, I started looking at what the real-time-clock code looks like in the simulator.
Here’s the relevant piece:
CInt_t GetRealTimeClock() const { if (mUseHostRealTimeClock) { boost::timer::nanosecond_type DeltaTime = mRealTimeTimer.elapsed().wall - mRealTimeStart; CInt_t DeltaClocks = CInt_t(double(DeltaTime) / mCpuClockPeriod); return mRealTimeClock + DeltaClocks; } else { return mRealTimeClock; } }
Aha!
So, I’m reading the hosts real-time clock (if set up that way) and return the computed clock-periods, based on that.
Quick check: yes, I’m set up that way. So first, test: let’s turn that off (which would mean that the real-time-clock would return an estimate of number of clock-cycles spent, based on the number of instructions executed:
void UpdateRealTimeClock() { if (!mUseHostRealTimeClock) mRealTimeClock += mRealTimeClockIncrement; }
Configuration change, retry… and drum-roll, please!
It worked! The compiler didn’t hang anymore.
This is not a solution of course, but a strong clue: the problem has something to do with the way I’m dealing with the real-time clock. But, again: nothing changed in this code for ages. So why, starting now(ish) the compiler started being unhappy about it?
Then it hit me! When I’ve migrated to a VM, I might have changed the way the host real-time clock behaved. More specifically, it is possible that the resolution of it got reduced, which would mean that GetRealTimeClock would be more likely to return the same value for subsequent calls. If the use of these values within cc is to measure small delta-times, it’s possible that the call would always return 0 time elapsed and would stay in an endless loop. Maybe. It’s as good a theory as any other. So, provided I’m right, what to do about it? How to ensure that while GetRealTimeClock returns the – well – real time, it also never returns the same value twice?
It is important to note here that the clock in the J90 was running at a 100MHz. So, the true real-time clock advanced every 10ns. If the host real-time clock is more ‘chunky’ to the point where it is likely to return the same value if called twice in quick succession, it’s resolution is probably in the tens of microsecond if not worse. So, it would be safe to simply increment the value by 1 every time GetRealTimeClock is called, if the underlying real-time clock returns the same implementation. With that, this is what the modified code looks like:
CInt_t Mainframe_c::GetRealTimeClock() const { if (mUseHostRealTimeClock) { boost::timer::nanosecond_type DeltaTime = mRealTimeTimer.elapsed().wall - mRealTimeStart; CInt_t DeltaClocks = CInt_t(double(DeltaTime) / mCpuClockPeriod); if (DeltaClocks == mLastRealTimeReading) { DeltaClocks += 1; } mLastRealTimeReading = DeltaClocks; return mRealTimeClock + DeltaClocks; } else { return mRealTimeClock; } }
OK, technically, this allows the clock to run backwards, if I managed to call it multiple times within 10ns, but let’s face it, that’s just not going to happen. That would require the simulator running significantly faster then the original hardware. Fat chance!
I’ve made this change, but does it work? Yes, yes it does. Sort of. It worked on one machine, one VM, but not on another. As luck would have it, the machine serves public server is the one still in trouble. So now what?
I’ve played around with the numbers with no avail, I could only make the compiler work if I disabled host RTC. And of course, with that mode, SSH server refuses to work for some reason (I guess because the host and the client machine clocks are off).
So, I have to roll up my sleeves and actually figure out what’s going on. As much as I would have hated to figure out how to enable tracing, I realized that under these circumstances it wouldn’t necessarily help anyway: tracing slows simulation down and consequently the relationships between the RTC and the instruction sequence would change.
But! I have a trace buffer. This is always enabled and contains the last few hundred executed instructions. So, all I have to do is to boot into single-user mode (to make sure there isn’t a bunch of other processes running), start a compilation, wait until it gets stuck, and exit the simulator. Upon termination it’ll spit out the disassembly from the trace-buffer, and if I’m lucky, I’m going to see where the loop is.
Here I will omit a lot of wondering in the wilderness. The short of it is that for quite a while I though that the problem was somewhere in the kernel. There were several clues that made me think that:
- TotalView told me we were in a syscall.
- The syscall traces appeared to show that a syscall didn’t return to the calling process (no, the kernel didn’t hang, simply scheduled the calling process off of the CPU)
- The trace buffer after termination contained instructions from the idle loop.
With all that it appeared to me that for some reason the kernel decided to schedule cc off of the processor, and ever schedule it back again. I’ve traced this line of thinking for days without success. Of course the reason for that was all of the above was incorrect.
- While TotalView wasn’t simply lying, it wasn’t exactly correct either. It turns out that cc calls another executable (sccom) to do the true compilation. So, yes, cc sits in waitpid, but that’s not the problem. The problem is that sccom hangs. Incidentally, if one tries to debug sccom with TotalView, it fails to produce even a reasonable stack-trace.
- The syscall actually returned. What confused me was that I was running these tests on a dual-CPU setup and the syscall traces were coming from both CPUs, intermixed.
- The trace buffer had a similar problem: it contained the trace from the wrong CPU. Or at least I was looking at the wrong CPU.
The benefit of all this time wasted was that at least I had a pretty good idea what was the last syscall that cc made. It is something called, mtimes. It is a UNICOS-specific syscall, here’s some excerpts from the corresponding man-page:
DESCRIPTION
The mtimes system call lets users have a structure in user memory continually updated with multitasking execution overlap information. (This is the same information that appears in accounting records for multitasking programs.) From this mtms structure, users can determine how much execution time the multitasking program has accumulated in an interval.
NOTES
The times in the mtms structure refer to the period since the multitasking group began execution, not to the period since the invocation of the mtimes call.
Monitoring the mtms structure at the user level is somewhat tricky
because the operating system running in another CPU may be updating it
at the same time.
At this point my understanding was this: cc calls mtimes, and when the syscall returns, it just waits and waits and waits somewhere for … something.
My next attack vector was to turn on instruction tracing upon return from the mtimes syscall and see what on earth is going on there.
Luckily this was relatively easy to set up as mtimes is not a very often-called function: in the minimal repro, the first and only call to it came from cc.
Of course the big question still is: with tracing enabled, does it change timing to the point that it destroys the repro.
One thing that worked in my favor is that I had a working (RTC based on instruction execution) and failing (RTC based on host real-time clock) setup. So I could run the repro under both configurations and compare the difference. With that, I modified the logging so that it turns on when the mtimes syscall returns. This would generate an instruction trace for the user-mode application (cc) right from the point of the return from kernel mode. I’ve also turned logging off again when the kernel gets called again, which kept the trace sizes manageable.
After both of the logs were done, the difference was rather stark. The traces diverge almost immediately and are essentially doing this:
- Read the contents of the struct that mtimes is expected to fill in twice (into V1 and V2).
- Measure the time it took to do this, using RTC.
- Execute two tests:
- first, that the time it took to do the read twice was less than 200 clock cycles.
- If it was, we then make sure that V1 and V2 have the same values.
In the successful case it took 180 (simulated) cycles to go through, so the code get out of the loop in the first try.
In the failing case, however it took consistently about 16,000 clock-cycles, which is of course way way more then 200. (In a real simulation it would be much quicker, but with all the tracing going on, it is obviously much slower.)
Since this loop isn’t conditional at all, the compiler hangs.
Now, why on earth would the compiler do this? Remember from the note on the man page of mtimes that checking the values is tricky? It is tricky because the values are potentially updated asynchronously from a different processor. So the wise coders of the compiler decided to read twice to make sure the values are stable, but also to bound to time so as to ensure no interrupts were executed in between the two reads.
Only in doing so, they made an assumption about how real time passes while executing an instruction stream. This assumptions was valid for physical hardware, but not for a simulation or a VM.
In some way, they implemented a timing-based side-channel attack, not all that dissimilar to the spectre vulnerability or the many other similar timing attacks.
We have a culprit! Now what to do about it?
Funnily enough, we need to increase the granularity of the RTC to ensure that the values read in quick succession return roughly the same value. However we’ll have to make sure that eventually we snap back to the true host RTC or else our time would not pass in real time. ‘Eventually’ in this case is significantly lower than the scheduling granularity of UNICOS, which appears to default to 60Hz, or 16ms. I ended up choosing the following logic:
- When the RTC is queried, save its last value.
- If it’s read again, look at the previous (saved) value.
- If the delta is large, return (and save) the new value.
- If the delta is small, return (and save) an incremented value of the old one, instead of that of the true RTC
The threshold for ‘small’ defaults to 50us, but can be set. You can also change the increment value.
I’ve made these changes, rinse-repeat and voi-la: the compiler works again! In all environments. Consistently. Now, to be fully honest, the threshold value is platform specific: it’s possible that if you’re running on a much slower platform (hello Raspberry Pi!), it needs adjusting.
Which brings up a few final points:
Not that the project wasn’t open-source before, but I’ve cleaned up and published the git repo under: https://github.com/andrastantos/cray-sim. This is where you can get the latest sources from.
The online simulator at cray.modularcircuits.com has the patch and is up and running again.
I’ve tested the simulator on Raspberry Pi: Version 4 with a 64-bit OS and 4GB RAM works out of the box. Just follow the normal Linux build instructions. For version 3 and below unfortunately it appears that they don’t have enough memory to successfully compile the sources. Cross-compilation should work, but YMMV.
I’ve teased about setup instructions at the beginning of this post. That is coming hopefully soon too.