A debug session

Introduction

This piece will start off in one direction and – as usual for debug exercises – ends up in a completely different place where I originally intended to go with it. It is based on my real-time notes I’ve taken during the session though I’ve abbreviated it here and there. It’s a description of the events and my thoughts as they occurred to me: some correct, some very wrong. It shows the meandering way of getting from an initial problem to the identification of a bug and its fix.

If you’re not into this kind of stuff, you can stop now, the rest will just bore you to death.

The starting point

I see I still have you, so let’s dig in!

I had this problem for a long time: after a while (read a day or so) of running the simulator and doing stuff on it, processes started to not terminate. The ‘ps’ command would show an output like this:

See how process 29463 says ‘defunct’ and the state is set to ‘Z’ which stands for zombie?

I decided to finally attack this problem.

Looking at the output of ‘ps’ you can see a few other peculiarities: the parent process of the zombie is ‘init’. And ‘init’ has a very strange priority: -1648634119. So my theory is the following: someone somehow miscalculated (or simply corrupted) the priority value for this process. This results in ‘init’ never getting scheduled, so it can’t collect the zombie process, thus getting stuck. Also: ‘init not running would prevent new ‘login’ tasks from being spawned, so I would not be able to login any more (existing shells are fine), which was another symptom I’ve noticed.

You can see how ‘init’ is in the ‘R’ or run state, but it’s not scheduled for either of the CPUs. Maybe the scheduler just gets really confused by the wrong priority?

The first idea I had was to create a ‘ps’ equivalent in the simulator so I can monitor the process states. Later, I could add triggers that would monitor for the corruption to happen and start tracing instructions and figure out why and who is doing this.

So, I’ve written up the ‘ps’ equivalent (it’s really not that hard, the process table is documented, and finding it in the kernel binary isn’t all that difficult either).

I also needed a lot of processes, thinking the issue manifests itself after ~20k spawns. To that end, I’ve created a shell script that called ‘ls’ in a loop over and over again:

I started my test and after a while, I hit a kernel panic:

Well, that’s not good, but a different not good. What’s going on?

I don’t have epackin.c, but I have epackin.o (inside io.a) so I could disassemble it and start poking around.

This object file contains a few functions, the main one being ‘epackin’. There are a few others, one in particular is called ‘retrans’, the same as the panic message. (Did you notice the typo in the second message? That’s not me, that’s actually in the file!)

Pretty soon it became obvious that if ‘retrans’ ever gets called, it’ll just panic the kernel.

More investigation revealed that the following code (if executed) would lead down to hell:

Not surprisingly the first execution of this kills the system. Problem is: there are several test that can jump here earlier in the code and those tests are executed many many times before the failure.

So how can I debug this? I can’t put breakpoints into the conditions that actually get me here (they are executed millions of times before the bug happens) and nothing that executes afterwards is interesting.

What I would really need is the ability to ‘step back’ in time to see how I got here. Or have a trace of execution *before* the breakpoint.

The trouble is that generating the trace takes a long time, slowing down the emulator to less than 1% of its normal speed. Combine that with the several hours it takes to repro this problem at normal speed, the picture isn’t pretty. Unless…

Trace buffers

Unless I just store enough bread-crumbs along the way so I can reconstruct the trace of execution after the breakpoint hit. The absolute minimum that’s needed is the value of the program counter. I can store that in a circular buffer (boost happens to have a nice implementation) and then I can re-create the execution flow.

I did just that, but a little more complicated: I’ve also stored every register change as well. This means that most instruction store two entries in the history buffer (one for a changed register, one for the changed program counter). The structure storing this history is very simple:

Essentially just an enum (Reg_e) as to what changed and the value it changed to. (Val2 is there to record certain peculiarities of the memory segmentation.)

Each entry is 16 bytes, so an average instruction stores only 32 bytes. This shouldn’t take all that long (indeed the slow-down is not too perceptible) and best, the time it takes is constant: it doesn’t depend on the depth of the buffer. I can store – if I so choose – millions of instructions and I only pay for it in memory, not in execution speed.

When the breakpoint hit, I can take this trace buffer, look in the memory and re-create the disassembly (with the assumption that the memory content didn’t change since execution) with most of the side-effects as well.

It’s not as rich as a trace during execution (for example memory read and write addresses are missing) but plenty more than what I had before.

It took a day or so to get the new infrastructure working, but after that I’ve ended up with a few hundred megabytes of logs.

From that, the following picture merged:

There is an array of buffers in memory, called ‘eiopkts’. This array gets allocated in in csl_initIOS() (cslkern.c). The piece of code responsible for the allocation is this (and yes, I do happen to have source code for this file):

As you can see, the allocated array gets initialized by a special pattern, EPAK_EMPTY, which is defined to be 0xEAEBECEDEDECEBEA.

The trace shows that this special value got picked up from memory:

So, it’s fairly obvious that the code is picking up data from an uninitialized buffer. That’s an important clue, that I’ll have to follow down. But before that, which piece of data it tries to get access to?

The shifts above extract 10 bits of info 6 bits to the left of the 64-bit word. It’s at offset 0 from wherever A5 is pointing to.

The original array is constructed of the following structs:

Where Ep_t is defined as follows:

And finally the Epacket struct is the following:

Now it starts to become clear: the code tries to extract the Ep_length field from the (presumably received) packet. It however picks up an uninitialized packet, where the length is just garbage. Later on, it tries to compare this value with the maximum packet length (69) and of course it fails. Also, it’s not surprising that if the received packet is overly long, received from the IO subsystem, it panics the kernel.

So, this is clear, but of course it begs the next question: why did it pick up this particular buffer for parsing?

My current theory is that it has to do with spurious interrupts. If for some reason an interrupt gets fired at the wrong moment, the CPU would pick up a buffer that has not been written yet. This theory is all the more plausible because the IOP that’s doing the IO and the CPU is executed on different threads, so a subtle synchronization problem can result in such a behavior.

I expect the chain of events to be something like this:

  1. Code picks up a buffer from the free list
  2. Preps it for receiving a transmission
  3. Mistakenly think it *received* a transmission
  4. Tries to parse the transmission and fails.

Looking at the code traces quickly showed that no spurious interrupt got fired. To be more precise, no spurious interrupt were handled.

The way UNICOS (and probably many other OS-es) handle interrupts is that once it’s in kernel mode, it checks for interrupt sources and handle them in a loop until it runs out of things to handle. Then, and only then does it return to user-land (for Cray interrupts are masked in kernel mode) so that more interrupts can fire.

In this particular instance, the chain of events I extracted from the traces is the following:

  1. Kernel gets interrupted because of an IOI (IO interrupt). The kernel checks that the interrupt is indeed an IOI, retrieves the interrupting IO channel number (These are unidirectional communication interfaces between the IO subsystem and the mainframe. Two channels are used: channel #16 for IO-to-mainframe communication, while #17 for mainframe-to-IO transfers).
  2. The kernel realizes that it’s channel #16 causing the interrupt, which means an input packet. It clears the interrupt (due to the loop mentioned above, I can even see it in the trace that the interrupt is indeed cleared) and calls the appropriate handler.
  3. The handler – which is epackin – takes a new buffer from the free list and preps it for reception.
  4. The handler parses the received packet and succeeds with it.
  5. Then the handler returns to the interrupt checking loop, which shows another interrupt pending.
  6. The kernel – again – realizes that it’s channel #16 causing the interrupt and calls the handler right back.
  7. The handler preps yet another buffer for reception.
  8. The handler parses the received packet, or rather attempts to do so and fails.

So yeah, the spurious interrupt theory seems more and more believable. But I need more instrumentation to figure out what’s going on.

I started putting more and more logging into the interrupt setting and clearing logic and every time I did that the behavior changed dramatically – another signature of threading issues. For a while the repro took longer and longer (4-6 hours), but then I got lucky: I started getting failures right on the first few input packets. That’s great, because it allows quick turn-around testing.

I started wondering if the channel interface logic (that actually writes the data) is broken. There is a channel address (CA) that gets set by SW to prime a buffer for reception. Then, the HW (well, simulated HW in my case) keeps incrementing the CA register and writing data until it either runs out of data to transmit or it reaches the end of the buffer (pointed to by the channel limit (CL) register). At that point, the channel ‘disconnects’ which causes the interrupt to get fired.

Luckily, the common interrupt handler code happens to read the CA register (though it doesn’t do anything with it) so it’s pretty obvious that the channel actually didn’t do any transfers. But then why did the interrupt fire?

There are (essentially) two places where the channel interrupt can fire:

One spot corresponds to when we bump into the end of the buffer (that’s in SetData), the other is when we run out of stuff to send (that’s in Disconnect).

It seems the only way to set the interrupt incorrectly is if we get a ‘Disconnect’ call without sending any data. So, I’ve added the test (mAddress == mStart) into the call to trigger if that happens; you can see it above in the listing.

Well, it didn’t fire, yet the repro kept happening.

So that’s clearly not it. What else could it be?

The code that calls these functions in our case is the following:

Fairly simple code, it takes a packet (if there’s any) from the queue that’s ready for transmission and then word-by-word feeds it into the channel. When the full packet is sent, it disconnects the channel, which triggers an interrupt.

Some more logging revealed the following sequence:

This log is hard to read without detailed knowledge of the source, so let me help:

The first line indicates that a transfer got completed because the buffer got full. It’s coming from the IOP thread.

The second line is coming from a callback function that’s part of the IOP implementation. It doesn’t add much information to what we know here already.

The third line however is coming from the CPU, indicating that it is checking which IO channel has a pending interrupt (and that it recognized that it’s channel #16). Crucially, this particular line is coming from a different thread (remember, IOP and CPU are running on separate threads).

The fourth line is again coming from the IOP, indicating that ‘Disconnect’ was called.

Now it finally starts to make sense!

What if the input buffer is exactly the same size as the packet to be sent?

Well, the code in ChannelTick, will take the packet and start spoon-feeding it into the channel. When the last word is sent, the buffer gets full, and the code in SetData sets the interrupt.

Then, when SetData returns, the code in ChannelTick realizes that it’s finished with the packet and calls Disconnect, which sets the interrupt again.

If we happen to be unlucky enough to check (and clear) the pending interrupt in between the last SetData and Disconnect calls, the interrupt fires twice.

This in turn will trigger the kernel to think that the just-prepped buffer is complete, even though no data-transfer have happened. It will take the buffer, try to parse it and fail.

So, how to fix this?

First, SetData should never set the interrupt. Second, every call-site of SetData will have to make sure to call Disconnect at the end. Fairly simple, considering there’s only a couple of call-sites for these methods.

Back to the top

An overnight run verified that the issue doesn’t happen anymore. It also reproed my original issue:

So, now I can look in my simulator-side ps equivalent and see what the table looks like:

Great, I have verification that the in-memory process table is corrupted for some reason. The next step is to figure out why.

What I’ve done was to modify the simulator to monitor every write into the process table and trigger when an invalid priority value gets written. This is actually easier to be said than done, as the memory access functions didn’t quite support this kind of surgery, at least initially. After some fiddling however I got a trigger and a trace:

Of course here you can’t see how the corrupted value (in S1, highlighted) gets computed, but at least the code clearly attempts to write the priority field. So it’s not a rogue pointer, it’s a computation error. As it turns out, this piece of code is one of the few cases where I have source as well, so, here’s the original assembly for the same section:

After some more digging, I’ve chanced upon the place where the invalid value is computed:

It’s not obvious to see it here, but S4 contains an invalid floating point value: exponents must be in the 02000 to 06000 range, here we have 07777 as the exponent, clearly an overflow.

This overflow screws up some integer math later on and results in the process priority being corrupted.

It wasn’t hard to find the piece of code that computes this value, it was much harder to find the first time this corruption happens: it takes several hours and millions of passes through the same code before an invalid value show up.

Nevertheless I was on track: even though I feel pretty confident about my floating point implementation right now, there could be corner-cases I’m not handling well. In particular, under- and over-flows are tricky and I’ve ignored them for a long time.

Luckily just recently I had a conversation with Chris Fenton about a curious difference between his (Verilog) implementation of the Cray floating point math and mine. The discussion boiled down to how to handle underflows in floating-point adds and I still had that in my head.

I went back and re-read the relevant chapters in the processor manuals and it turns out, I indeed didn’t handle underflows correctly: the processor simply zeroed out the result in case of an underflow, where my implementation just gave an invalid result.

Now, the above invalid float is actually an overflow, not an underflow, but it is still a lead. Especially considering that there are reciprocal calculations in nearby code, so an underflow can get converted to an overflow. The troubling part is of course that an underflow-corrected-to-0 would result in a divide-by-zero once we get to the reciprocal part. Still a bug is a bug, so let’s fix it, and it’s possible that there are special-case checks in place to avoid the divide-by-zero problem.

Now changing floating-point code in the simulator is risky because it’s not heavily used and due to representational differences (these Cray machines used a non-IEEE compliant float format) spotting minute differences is hard. Luckily there’s one test program in UNICOS that exercises many of the floating-point functions so there’s at least some coverage.

Epilog

I’ve made the change, run the simulator and tada! (well tada after waiting three days) The bug disappeared: no zombie processes any more, the OS stayed up without issues for more than a week by now.

The moral of the story? Multi-threaded development is hard, especially if you try to morph single-threaded code (written 5+ years ago) into a multi-threaded environment.

Floating point math is also hard so is precisely implementing every nuance of a several hundred page manual.

Finally, I’ve also learned that this trace feature I’ve just added is insanely useful and should have done that long time ago.

I’m sure it’ll be indispensable in the future too and soon I won’t be able to imagine how I tracked down any bugs without it before.

With a stable simulator, I can now tackle the next problem: ssh shell access. You see, while the simulator is online and you can ssh into it, it’s a bit of a hack at the moment: you actually ssh into the host running the simulator which in turn uses rlogin to get you in to UNICOS. It works, but just barely: it doesn’t give you scp access and even more importantly it doesn’t give you X11 forwarding, so you’re limited to a command line.

Getting sshd running on UNICOS is going to be my next project – and the next post when I get around writing it up.

Print Friendly, PDF & Email