Oldies But Goldies

Download PDF

A new year a new target

My long-time friend, Chris, sent me an e-mail the other day: he got his hands on a few tapes that contain UNICOS 6.1.6 and 7.0.4 for YMP-el machines. This is a great find! These versions are several years older and are for a previous generation machine than the UNICOS 10 and the J90 I’ve been dealing with in the last year or so. Of course there was the ‘slight’ problem of getting the content of the tapes.

I’ll let him explain his saga on his blog if he wants to. I’ll just say here that it involved getting a tape reader off of e-bay, me mailing him a SCSI card, him 3D printing a new captstan rubber ring, but after all that, during the holidays him and Jeff – his accomplice in this endeavor – managed to extract the content and put it on my server.

So that’s the origin story of this episode: I have two large binary files on my machine, one for each UNICOS version. Now what?

Cracking the big blobs.

First things first, let’s see what’s in the files?

They had a ‘tar’ extension and sure enough they extracted fine using tar. There was slight problem though: the extracted content was a mere 12MB, wheres the original file was close to 500MB. Clearly there’s more to this than what the eye tar can see. With some poking around, I’ve managed to identify a Cray file-system image (NC1FS) shortly after the end of the tar portion. After that came a relatively short cpio archive, some junk that I still don’t know what it’s about and finally a large cpio file. Splitting the binary into these constituents, the following picture emerged:

  • The tar file at the top of the file contains the install files needed for the IOS. Chief among them, the kernel image that’s loaded from the IOS into the mainframe during boot.
  • The file-system is a small root-fs intended for a RAM-disk and contains minimal executables to bootstrap the install process. Things, like ‘fdisk’, ‘mount’, ‘cpio’. I managed to mount this file-system in my existing J90 simulator, running UNICOS 10 and verify that it is intact.
  • The small cpio file seem to contain configuration files for a default install (the newer system I’m familiar with have these under /skl).
  • Finally, the large cpio is a complete system image, including root and usr.

This sounds great, actually. It seems all the pieces are there to get a system working. In fact, there’s even a sample configuration file on the mini file-system:

Which is the first clue that we’re dealing with something old here: this style of configuration is what UNICOS 4.0 used on the X-MP. The UNICOS 10 image I’ve been working with recently used a different format. No big deal of course, just shows how much this OS evolved between version 6 (and 7) and 10.

Getting things going

Life would be too simple of course without problems: now I have an OS for a machine that I don’t have a simulator for. Not only that, but the YMP-el is curiously undocumented. There are only a few documents online, none of which detail the instruction set for example (only summaries are available). But, how much things could have changed between it and the J90? If I were a betting man, I would put my money on ‘not too much’. Maybe they’ve added a few instructions here and there, but the two machines should be fairly close to one another.

With that thought, I put together a setup that mapped the small root FS to a virtual disk, loaded a modified version of the above configuration file and the kernel into the J90 emulator and let it loose.

Nothing…

The machine was just chugging along, no input, no output, no smoke. Nothing.

By enabling instruction traces, it became pretty clear why: the mainframe was awaiting to hear from the IOS. This is behavior that I’ve seen in the X-MP as well: there the communication started by the IOS talking first (sending the time and date info). The newer J90 system went the other way and the mainframe sent the first packet (which by the way makes way more sense).

Luckily I had some code laying around from my early UNICOS 4.0 experiments that simulated this behavior. I have yanked it out from the sources a long time ago, but that’s what revision control is for.

After some futzing around with the code and merging this ancient piece back and patching it up for all the changes since than, I got a console. And a crash report with it:

OK, this is not bad, actually. It means that:

  • The kernel booted
  • It parsed the configuration file properly
  • It found and mounted the root FS
  • It spawned the first process

But why did it crash? Well, obviously – it even tells me as much – because init exited. It’s not supposed to do that: this is the first process created by the kernel and the one that creates all the rest. If it terminates, the kernel doesn’t know what to do.

Looking into the instruction trace, comparing it with the memory dump and the little fragments of source code I have (mostly for UNICOS 10), I realized that the crash was happening in paniktst.s. This piece of code is testing for whether the interrupt was due to an error-exit or not (the Crays don’t have vectored interrupts, so SW has to differentiate between all sources, including exceptions). Furthermore this particular routine was called if there was an interrupt while the processor was in monitor mode (privileged mode on modern platforms, a state where most interrupts are disabled).

So, let’s see what the state of the machine was during the last interrupt:

There’s a lot of mambo-jumbo here, but two things are important. First, there are two ‘exchanges’ right after one another. The first one exchanges out from monitor mode (kernel-to-user transition) but right after that we’re dropped back into the kernel without even executing a single instruction. The reason? It’s a protection error (PRE bit set in flags). Now, this is because the IBA and ILA (instruction base and instruction limit address) registers are set to the same value. Essentially we’ve set up the processor to allow 0 bytes to be executed, which is exactly what it did.

And now, the second point: the raw value of the exchange package (that determines the value of IBA and ILA) contains different values for the protection registers than what they end up with:

This obviously ain’t right. Unfortunately the exchange packet layout is not documented in the printed material that I have access to, but I’ve seen Cray messing around with the fields for these two registers before. In fact machines within the same generation (dual and quad-CPU X-MPs) have these fields mapped out differently. Luckily, I do have access to the kernel headers, in which (in xp.h) the exchange packet layout is documented. And there, in fact is my answer: the fields are shifted by two bits compared to the J90. Just great!

Anyway, easy enough to fix, though it does mean I now have to make run-time decisions based on machine type – and of course also means that I have to introduce the machine time to the simulator to begin with.

With this hurdle down – and the code changed – I’ve restarted the simulator. Which now doesn’t crash, but doesn’t boot either. At least not fully. It is just sitting there looking all boring and dumb.

Looking at the logs (I’m very glad that I implemented some rudimentary syscall extraction), it became pretty clear that the problem is that the third call to ‘close’ doesn’t return to the app as it should, instead it enters an idle loop.

Another nice feature of UNICOS, at least the official builds is that they don’t strip the symbols off of the binaries. That means that I can re-construct the call-tree (I didn’t bother automating it, but the manual process isn’t horrible). As it turns out, close was called from fclose (well, d’uh), which was called from _cleanup, than _exithandle, and finally from exit.

The handle being closed belonged to stderr. It’s pretty clear that the system is waiting for something. But why would close do this? Why would it block?

That’s when the light went on: Buffering, that’s why! Buffered I/O needs to be flushed when the last handle to the file is closed (in this case stderr was redirected to stdout).

And now, that I mention it… I haven’t seen any output from the utility being run, only from the kernel. I’ve seen some ‘write’ calls that seem to target stdout, but nothing came out. Which is another clue by the way: initial kernel output is probably not buffered and even if it is, it’s polling as in such an early stage, the kernel can’t handle interrupts (monitor mode is largely un-interruptable). So, what’s going on is that for some reason the handshake with the (simulated) console is not working, the output gets buffered up in the kernel, and when the file-handle is closed the application gets blocked.

Looking at my code it became pretty clear why: I don’t send any replies for console messages. None what-so-ever. No wonder the poor OS didn’t know it was free to send more data…

With the appropriate fixes (and some additional cleanup), I finally have a prompt in single-user mode:

As you can see I’ve switched from the 6.x release to the 7.x release during this work, which is the next part of the story.

Getting UNICOS 6 to boot

The 6.x release still didn’t work. It crashed with trying to execute an instruction that’s not valid on a Y-MP. This doesn’t happen in the kernel though, it happens pretty much immediately after creating the first process:

This last instruction doesn’t exist on Y-MPs (only on X-MPs) so my simulator blows up.

As it turns out, this code sequence comes from /etc/init, so we’ve gotten as far as loading that, but apparently not much further. What can be going on though? All the documentation I’ve seen clearly states that these instructions don’t exist on Y-MPs, and for a good reason: with the expansion of the A registers from 24 to 32-bits, these instructions, that load a constant value into these registers don’t have enough bits anymore to encode all the possible values.

Could it be that this is actually an X-MP version of the OS? No, that’s not possible either, as that would certainly crash the kernel much earlier. Plus, as you can see in the screen-shot up at beginning, the kernel clearly states that it was compiled for Y-MP.

But, as I’ve said before, this is not the kernel anymore, it’s the first process, /etc/init. Could that be an X-MP binary? Luckily UNICOS binaries contain some info about their target architecture, and parsing that, sure enough, the binary is a 24-bit one. After some digging and guessing, it turns out Y-MPs supported a 24-bit (more or less) X-MP compatible mode. This mode was set on a per-process basis: it’s bit 35 of word 6 of the exchange packet. My simulator doesn’t know anything about that bit as this support was yanked from the J90, but let’s see if the raw exchange packets have that bit set the right way!

Bingo! As we get out of the kernel, the extended address range mode bit (EAM) is set, and when we enter into the process, it’s cleared. So, we’re dealing with a 24-bit executable in a 32-bit OS.

At first this seemed a much bigger headache than it actually turned out to be. You see, the way I’ve built the simulator was that I’ve created a custom type that’s either 24-bit or 32-bit long depending on the architecture. This type (called CAddr_t) is used literally everywhere and for a good reason: it’s not trivial not to make a mistake in sign-extending, wrapping around, converting to/from 64-bit values. It’s best to keep those details wrapped in a type, with the proper operator overloads than to sprinkle them around the code.

Except now, I need to simulate both address types, simultaneously, depending on a stupid bit set in the exchange packet! That’s not what the C++ type-system was designed to do. Not only that, but even in the real HW, the underlying machine used 32-bit addressing, it’s just the instruction-set that got swapped out. There are a lot of potential bugs here and a lot of porting/debugging before it will start working.

Or so I thought at first. But than I realized: I actually wrapped most of the register accesses into a little helper class and some macros:

The reason I’ve done this is to have the ability to log register changes. This have proved an invaluable tool many many times, tracking down tricky bugs and behaviors. In this case it can pay big dividends again: since all assignments and accesses are wrapped, the appropriate truncation can be wrapped in this class. Best: this class knows about the CPU (mParent field above) and thus have access to the current operating mode (24 or 32-bit addressing). This does not take care of all cases, most importantly it doesn’t handle sign-extension properly, but deals with the bulk of the problems.

I was also lucky enough that – even though I don’t have a detailed instruction set description for the Y-MP-s – I have a summary of the ISA that lists all the important differences between the two operating modes.

So, it turns out it was only a few hours of work to get the simulator modified. Of course not without errors. I’ve found myself in a similar situation as before: init starts, but somehow hangs. What could be a problem this time?

A quick look at the syscall logs reveals that the last program executed was ‘sh’. Enabling logging, shows that we’re getting stuck in this loop:

The loop counter is S1 and the terminal count seems to be in S7. What does the trace say about the loop test at the end?

So, S1 was 0x66cbc2 and S7 was 0xffffff. That’s an awful long loop since S1 gets incremented by 1 in every iteration. So, how did S7 become that strange value? It was assigned in the ‘S7 +A7’ instruction. A7 was 0xffffff, but this assignment supposed to sign-extend! So S7 should have become 0xffffffffffffffff, not it’s current value. Clearly I’ve screwed up the simulation of this instruction. Not that it helps in this particular case, that would just make the loop even longer. But this instruction might have been executed (incorrectly) many times before, so I can’t really trust this execution state anymore. I have to fix the bug and re-run. That’s exactly what I’ve done and (drum-roll):

I have boot!

What’s next?

This is a good place to end this post. I have both UNICOS 6 and 7 booting at least into the minimal root FS in RAM.

The fact that these executables are actually compiled for X-mode (that is XMP compatible) brings about a very interesting experiment. You see, I have a partial copy of UNICOS 4.0 for the XMP. The problem with that have always been that I don’t have a root FS for it. I have the kernel, I have a config file, I even have the content of the /usr partition, but nothing for root. That’s a problem of course as many of the crucial utilities, like ‘init’ and ‘sh’ are there. Booting without these is next to impossible. But now, I have a chance: I can marry the root FS (and its utilities) from UNICOS 6 with the kernel and /usr from UNICOS 4 and get to a more or less functional system. It won’t be perfect (for example mkfs and fsck won’t work for sure and some tools that rely on new syscalls or since-fixed kernel bugs are not going to work) but it’s way more than what I have now. Best of all, this – if works – is for the X-MP, my original project goal. Finally, it’s not inconceivable that UNICOS 6 not only contains X-MP style binaries, but a toolset to compile such binaries. If that’s the case, I can even attempt generating new binaries for the system.

Exciting times for sure. Here’s for a happy 2017!