Introduction
This piece will start off in one direction and – as usual for debug exercises – ends up in a completely different place where I originally intended to go with it. It is based on my real-time notes I’ve taken during the session though I’ve abbreviated it here and there. It’s a description of the events and my thoughts as they occurred to me: some correct, some very wrong. It shows the meandering way of getting from an initial problem to the identification of a bug and its fix.
If you’re not into this kind of stuff, you can stop now, the rest will just bore you to death.
The starting point
I see I still have you, so let’s dig in!
I had this problem for a long time: after a while (read a day or so) of running the simulator and doing stuff on it, processes started to not terminate. The ‘ps’ command would show an output like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 3 S 0 0 0 0 0 20 2566 28738 255352 ? 0:00 swapper 1 R 0 1 0 0 -1648634119 10 73264 156 ? 0:30 init 103 R 0 2 0 0 999 21 72670 16 cpu-00 ? 1287:16 idle 3 R 0 3 0 0 999 21 72710 16 ? 1296:43 idle 43 S 0 4 0 0 20 0 72730 16 3236361 ? 0:16 smppulse 43 S 0 5 0 0 24 0 72750 16 3240603 ? 0:00 sfspuls0 43 S 0 6 0 0 24 0 72770 16 3240604 ? 0:00 sfspuls1 43 S 0 7 0 0 24 0 73010 16 3240605 ? 0:00 sfspuls2 43 S 0 8 0 0 24 0 73030 16 3240606 ? 0:00 sfspuls3 43 S 0 9 0 0 20 0 73050 16 3344315 ? 0:00 clock 43 S 0 10 0 0 20 0 73070 16 3354737 ? 0:00 utility 1 S 0 452 1 0 28 20 73520 116 3323137 console 0:00 getty 0 S 0 61 1 0 26 20 560 108 70314001 ? 0:00 fslogd 1 S 34 456 455 0 30 20 101270 628 267336 p000 0:01 bash 1 S 0 106 1 0 26 20 73110 108 40510001 ? 0:00 slogdemo 1 S 0 462 456 0 30 20 104460 644 270676 p000 0:07 bash 1 S 0 209 1 0 26 20 75104 144 3363437 ? 0:06 msgdaemo 1 S 0 215 1 0 39 20 75324 220 230500 ? 0:55 fsdaemon 1 S 0 317 1 0 26 20 76500 284 3363437 ? 0:01 inetd 1 S 0 453 1 0 28 20 73754 116 3323217 01 0:00 getty 1 S 0 454 1 0 28 20 74210 116 3323277 02 0:00 getty 1 S 0 338 1 0 26 20 76064 204 3363437 ? 0:00 portmap 1 S 0 428 1 0 26 20 77740 248 3363437 ? 0:00 samdaemo 1 S 0 18603 1 0 26 20 77134 252 3363437 ? 1:08 prngd 1 S 0 396 1 0 26 20 77554 116 3363437 ? 0:00 errdemon 1 S 0 412 1 0 26 20 115010 396 1623174 ? 0:43 cron 101 R 0 29783 28347 0 40 20 103060 640 cpu-01 p001 0:00 ps 1 S 0 455 317 0 26 20 74374 88 3363437 P000 0:20 rlogind 1 S 0 28336 317 0 26 20 102730 88 3363437 P001 0:01 rlogind 1 Z 0 29463 1 0 47 10 0:00 <defunct> 1 S 0 28347 28337 0 30 20 113624 628 305056 p001 0:01 bash 1 S 34 28337 28336 0 30 20 112440 628 305636 p001 0:01 bash 1 R 0 19590 462 0 -1871006465 20 105664 196 p000 0:51 sh |
See how process 29463 says ‘defunct’ and the state is set to ‘Z’ which stands for zombie?
I decided to finally attack this problem.
Looking at the output of ‘ps’ you can see a few other peculiarities: the parent process of the zombie is ‘init’. And ‘init’ has a very strange priority: -1648634119. So my theory is the following: someone somehow miscalculated (or simply corrupted) the priority value for this process. This results in ‘init’ never getting scheduled, so it can’t collect the zombie process, thus getting stuck. Also: ‘init not running would prevent new ‘login’ tasks from being spawned, so I would not be able to login any more (existing shells are fine), which was another symptom I’ve noticed.
You can see how ‘init’ is in the ‘R’ or run state, but it’s not scheduled for either of the CPUs. Maybe the scheduler just gets really confused by the wrong priority?
The first idea I had was to create a ‘ps’ equivalent in the simulator so I can monitor the process states. Later, I could add triggers that would monitor for the corruption to happen and start tracing instructions and figure out why and who is doing this.
So, I’ve written up the ‘ps’ equivalent (it’s really not that hard, the process table is documented, and finding it in the kernel binary isn’t all that difficult either).
I also needed a lot of processes, thinking the issue manifests itself after ~20k spawns. To that end, I’ve created a shell script that called ‘ls’ in a loop over and over again:
1 2 3 4 5 6 |
#!/bin/sh CNT=0 while [ $CNT -lt $1 ]; do ls > /dev/null let CNT=CNT+1 done |
I started my test and after a while, I hit a kernel panic:
1 2 3 4 5 |
bash-2.03# ./proctest 100000 ================== KERNEL PANIC ================ PANIC: epackin.c: retrans ================================================ PANIC: epackin.c: retrains |
Well, that’s not good, but a different not good. What’s going on?
I don’t have epackin.c, but I have epackin.o (inside io.a) so I could disassemble it and start poking around.
This object file contains a few functions, the main one being ‘epackin’. There are a few others, one in particular is called ‘retrans’, the same as the panic message. (Did you notice the typo in the second message? That’s not me, that’s actually in the file!)
Pretty soon it became obvious that if ‘retrans’ ever gets called, it’ll just panic the kernel.
More investigation revealed that the following code (if executed) would lead down to hell:
1 2 3 4 5 6 7 8 9 10 11 12 |
0x0000028D:p0 (0x0000028D:p0) 0074565 - S5 T53 | 0x0000028D:p1 (0x0000028D:p1) 0075602 - T2 S6 | 0x0000028D:p2 (0x0000028D:p2) 0040400:0020000:0000000 - S4 0x0000000000002000 | 0x0000028E:p1 (0x0000028E:p1) 0044045 - S0 S4&S5 | * 0x0000028E:p2 (0x0000028E:p2) 0014000:0005215 - JSZ 0x000002A3:p1 <BLK @CODE BLK> | Jump if S0 == 0 0x0000028F:p0 (0x0000028F:p0) 0003435 - SM29 1,TS | Test and set semaphore SM29 * 0x0000028F:p1 (0x0000028F:p1) 0120600:0000000:0000000 - S6 [0x00000000,] <EXT utp EXT> | 0x00000290:p0 (0x00000290:p0) 0040500:0000004:0000000 - S5 0x0000000000000004 | 0x00000290:p3 (0x00000290:p3) 0060456 - S4 S5+S6 | 0x00000291:p0 (0x00000291:p0) 0042360 - S3 <16 | form 16 ones from right 0x00000291:p1 (0x00000291:p1) 0044234 - S2 S3&S4 | 0x00000291:p2 (0x00000291:p2) 0060125 - S1 S2+S5 | |
Not surprisingly the first execution of this kills the system. Problem is: there are several test that can jump here earlier in the code and those tests are executed many many times before the failure.
So how can I debug this? I can’t put breakpoints into the conditions that actually get me here (they are executed millions of times before the bug happens) and nothing that executes afterwards is interesting.
What I would really need is the ability to ‘step back’ in time to see how I got here. Or have a trace of execution *before* the breakpoint.
The trouble is that generating the trace takes a long time, slowing down the emulator to less than 1% of its normal speed. Combine that with the several hours it takes to repro this problem at normal speed, the picture isn’t pretty. Unless…
Trace buffers
Unless I just store enough bread-crumbs along the way so I can reconstruct the trace of execution after the breakpoint hit. The absolute minimum that’s needed is the value of the program counter. I can store that in a circular buffer (boost happens to have a nice implementation) and then I can re-create the execution flow.
I did just that, but a little more complicated: I’ve also stored every register change as well. This means that most instruction store two entries in the history buffer (one for a changed register, one for the changed program counter). The structure storing this history is very simple:
1 2 3 4 5 |
struct StateChange_s { Reg_e Reg; CAddr_t Val2; CInt_t Val; }; |
Essentially just an enum (Reg_e) as to what changed and the value it changed to. (Val2 is there to record certain peculiarities of the memory segmentation.)
Each entry is 16 bytes, so an average instruction stores only 32 bytes. This shouldn’t take all that long (indeed the slow-down is not too perceptible) and best, the time it takes is constant: it doesn’t depend on the depth of the buffer. I can store – if I so choose – millions of instructions and I only pay for it in memory, not in execution speed.
When the breakpoint hit, I can take this trace buffer, look in the memory and re-create the disassembly (with the assumption that the memory content didn’t change since execution) with most of the side-effects as well.
It’s not as rich as a trace during execution (for example memory read and write addresses are missing) but plenty more than what I had before.
It took a day or so to get the new infrastructure working, but after that I’ve ended up with a few hundred megabytes of logs.
From that, the following picture merged:
There is an array of buffers in memory, called ‘eiopkts’. This array gets allocated in in csl_initIOS() (cslkern.c). The piece of code responsible for the allocation is this (and yes, I do happen to have source code for this file):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
hdr = (word *)z_alloc(1); /* header of 'e pkt' tables */ *hdr = *(word *)eiop_hdr; eiopkts = (GetMachineInfo(mi_nioc) + 2) * NEPKTSC; if (((DDEVCT + HDDEVCT) * PKTS_PER_DEV) > eiopkts) { eiopkts = (DDEVCT + HDDEVCT) * PKTS_PER_DEV; } pktlow = (eiopkts / 4) + 20; /* low water mark */ eiopack = (struct eiopack *) z_alloc(eiopkts * sizeof(struct eiopack)); epackend = eiopack + eiopkts; #ifdef CRAYJ90 /* pattern the empty packets */ len = btow(eiopkts * sizeof(struct eiopack)); realpkt = (long*)eiopack; for (i = 0; i < len; i++) { realpkt[i] = EPAK_EMPTY; } #endif |
As you can see, the allocated array gets initialized by a special pattern, EPAK_EMPTY, which is defined to be 0xEAEBECEDEDECEBEA.
The trace shows that this special value got picked up from memory:
1 2 3 4 5 6 7 |
64591613574 CPU0: ------- exec 0x0012CD79:p1 (0x0012CD79:p1) S7 [0x00000000,A5] | 64591613574 CPU0: S7 <== 0xEAEBECEDEDECEBEA (16927884131045272554) 64591613574 CPU0: ------- exec 0x0012CD7A:p0 (0x0012CD7A:p0) S7 S7 < 6 | 64591613574 CPU0: S7 <== 0xBAFB3B7B7B3AFA80 (13473428111743449728) 64591613574 CPU0: ------- exec 0x0012CD7A:p1 (0x0012CD7A:p1) S7 S7 > 54 | 64591613574 CPU0: S7 <== 0x00000000000002EB (747) 64591613574 CPU0: ------- exec 0x0012CD7A:p2 (0x0012CD7A:p2) S6 0!S7 | S6 0 or S7 |
So, it’s fairly obvious that the code is picking up data from an uninitialized buffer. That’s an important clue, that I’ll have to follow down. But before that, which piece of data it tries to get access to?
The shifts above extract 10 bits of info 6 bits to the left of the 64-bit word. It’s at offset 0 from wherever A5 is pointing to.
The original array is constructed of the following structs:
1 2 3 4 5 6 7 8 9 10 11 12 |
struct eiopack { struct eiopack *epak_next; /* queue pointer */ uint unused :32, :8, epak_redo:6, /* flag for immediate rewrite */ epak_use :6, /* packet use */ epak_ios :6, /* ios number */ epak_iop :6; /* iop number */ time_t epak_start; /* time i/o started */ time_t epak_end; /* time i/o completed */ struct Ep_t pkt; /* packet (Epacket struct) */ }; |
Where Ep_t is defined as follows:
1 2 3 4 5 6 7 8 |
struct Ep_t { struct Epacket Epacket; uint Ep_type :8, /* Packet type */ :24, :32; uint Ep_data [EPAK_MAXLEN-2]; /* Variable length packet */ uint Ep_trailer; /* Repeat header for validation*/ }; |
And finally the Epacket struct is the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
struct Epacket { uint Ep_magic:6; /* Magic number for verification */ uint Ep_length:10; /* Length (words) of packet */ uint Ep_kernid:3; /* Owning kernel's id + 1 */ uint Ep_source:4; /* Location of code which sent packet */ #ifdef CRAYELS uint Ep_cluster:4; /* IOS cluster */ uint Ep_proc:5; /* Location of code to process request */ #else uint Ep_cluster:5; /* IOS cluster */ uint Ep_proc:4; /* Location of code to process request */ #endif /* CRAYELS */ uint Ep_flags:8; /* Processing options */ uint Ep_lpath:8; /* Logical path of code to process req */ uint Ep_seq:8; /* Packet sequence number to validate */ uint Ep_ackseq:8; /* Seq number of last valid packet */ }; |
Now it starts to become clear: the code tries to extract the Ep_length field from the (presumably received) packet. It however picks up an uninitialized packet, where the length is just garbage. Later on, it tries to compare this value with the maximum packet length (69) and of course it fails. Also, it’s not surprising that if the received packet is overly long, received from the IO subsystem, it panics the kernel.
So, this is clear, but of course it begs the next question: why did it pick up this particular buffer for parsing?
My current theory is that it has to do with spurious interrupts. If for some reason an interrupt gets fired at the wrong moment, the CPU would pick up a buffer that has not been written yet. This theory is all the more plausible because the IOP that’s doing the IO and the CPU is executed on different threads, so a subtle synchronization problem can result in such a behavior.
I expect the chain of events to be something like this:
- Code picks up a buffer from the free list
- Preps it for receiving a transmission
- Mistakenly think it *received* a transmission
- Tries to parse the transmission and fails.
Looking at the code traces quickly showed that no spurious interrupt got fired. To be more precise, no spurious interrupt were handled.
The way UNICOS (and probably many other OS-es) handle interrupts is that once it’s in kernel mode, it checks for interrupt sources and handle them in a loop until it runs out of things to handle. Then, and only then does it return to user-land (for Cray interrupts are masked in kernel mode) so that more interrupts can fire.
In this particular instance, the chain of events I extracted from the traces is the following:
- Kernel gets interrupted because of an IOI (IO interrupt). The kernel checks that the interrupt is indeed an IOI, retrieves the interrupting IO channel number (These are unidirectional communication interfaces between the IO subsystem and the mainframe. Two channels are used: channel #16 for IO-to-mainframe communication, while #17 for mainframe-to-IO transfers).
- The kernel realizes that it’s channel #16 causing the interrupt, which means an input packet. It clears the interrupt (due to the loop mentioned above, I can even see it in the trace that the interrupt is indeed cleared) and calls the appropriate handler.
- The handler – which is epackin – takes a new buffer from the free list and preps it for reception.
- The handler parses the received packet and succeeds with it.
- Then the handler returns to the interrupt checking loop, which shows another interrupt pending.
- The kernel – again – realizes that it’s channel #16 causing the interrupt and calls the handler right back.
- The handler preps yet another buffer for reception.
- The handler parses the received packet, or rather attempts to do so and fails.
So yeah, the spurious interrupt theory seems more and more believable. But I need more instrumentation to figure out what’s going on.
I started putting more and more logging into the interrupt setting and clearing logic and every time I did that the behavior changed dramatically – another signature of threading issues. For a while the repro took longer and longer (4-6 hours), but then I got lucky: I started getting failures right on the first few input packets. That’s great, because it allows quick turn-around testing.
I started wondering if the channel interface logic (that actually writes the data) is broken. There is a channel address (CA) that gets set by SW to prime a buffer for reception. Then, the HW (well, simulated HW in my case) keeps incrementing the CA register and writing data until it either runs out of data to transmit or it reaches the end of the buffer (pointed to by the channel limit (CL) register). At that point, the channel ‘disconnects’ which causes the interrupt to get fired.
Luckily, the common interrupt handler code happens to read the CA register (though it doesn’t do anything with it) so it’s pretty obvious that the channel actually didn’t do any transfers. But then why did the interrupt fire?
There are (essentially) two places where the channel interrupt can fire:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
void Channel_i::SetData(CInt_t aData) { CRAY_ASSERT(IsActive()); CRAY_ASSERT(mAddress != mLimit); GetLogger() << setloglevel(LogLevel_IoTrace) << "set " << HexPrinter(mAddress) << " value " << HexPrinter(aData) << " - " << AsciiDumpPrinter(aData) << std::endl; MemAccess<CInt_t>(mAddress) = SwapBytes(aData); ++mAddress; if (mAddress == mLimit) { GetLogger() << setloglevel(LogLevel_IoTrace) << "Going inactive" << std::endl; SetInterruptPending(true); SetActive(false); } } |
1 2 3 4 5 6 7 8 9 10 11 |
void Channel_i::Disconnect() { GetLogger() << setloglevel(LogLevel_IoTrace) << "disconnecting" << std::endl; if (mAddress == mStart) { GetLogger() << setloglevel(LogLevel_IoActivity) << "Disconnecting without sending any data. mAddress: " << HexPrinter(mAddress) << " mStart: " << HexPrinter(mStart) << " mLimit: " << HexPrinter(mLimit) << std::endl; } SetActive(false); SetInterruptPending(true); } |
One spot corresponds to when we bump into the end of the buffer (that’s in SetData), the other is when we run out of stuff to send (that’s in Disconnect).
It seems the only way to set the interrupt incorrectly is if we get a ‘Disconnect’ call without sending any data. So, I’ve added the test (mAddress == mStart) into the call to trigger if that happens; you can see it above in the listing.
Well, it didn’t fire, yet the repro kept happening.
So that’s clearly not it. What else could it be?
The code that calls these functions in our case is the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
void SimIopCluster_c::ChannelIToC_c::ChannelTick() { if (!mPacketValid) { if (mPacketQueue.empty()) return; SendPacket(mPacketQueue.front()); mPacketQueue.pop(); return; } while (IsActive()) { CInt_t Data = mPacket[mPacketIdx++]; SetData(Data); if (mPacketIdx == mPacket.size()) { Disconnect(); mPacketValid = false; } } } |
Fairly simple code, it takes a packet (if there’s any) from the queue that’s ready for transmission and then word-by-word feeds it into the channel. When the full packet is sent, it disconnects the channel, which triggers an interrupt.
Some more logging revealed the following sequence:
1 2 3 4 |
47413489 IOP0.I->C_16: Going inactive 47413489 IOP0.I->C_16: went inactive 47413489 CPU0: CI (ChIdx = 16) 47413489 IOP0.I->C_16: disconnecting |
This log is hard to read without detailed knowledge of the source, so let me help:
The first line indicates that a transfer got completed because the buffer got full. It’s coming from the IOP thread.
The second line is coming from a callback function that’s part of the IOP implementation. It doesn’t add much information to what we know here already.
The third line however is coming from the CPU, indicating that it is checking which IO channel has a pending interrupt (and that it recognized that it’s channel #16). Crucially, this particular line is coming from a different thread (remember, IOP and CPU are running on separate threads).
The fourth line is again coming from the IOP, indicating that ‘Disconnect’ was called.
Now it finally starts to make sense!
What if the input buffer is exactly the same size as the packet to be sent?
Well, the code in ChannelTick, will take the packet and start spoon-feeding it into the channel. When the last word is sent, the buffer gets full, and the code in SetData sets the interrupt.
Then, when SetData returns, the code in ChannelTick realizes that it’s finished with the packet and calls Disconnect, which sets the interrupt again.
If we happen to be unlucky enough to check (and clear) the pending interrupt in between the last SetData and Disconnect calls, the interrupt fires twice.
This in turn will trigger the kernel to think that the just-prepped buffer is complete, even though no data-transfer have happened. It will take the buffer, try to parse it and fail.
So, how to fix this?
First, SetData should never set the interrupt. Second, every call-site of SetData will have to make sure to call Disconnect at the end. Fairly simple, considering there’s only a couple of call-sites for these methods.
Back to the top
An overnight run verified that the issue doesn’t happen anymore. It also reproed my original issue:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 3 S 0 0 0 0 0 20 2566 28738 255352 ? 0:01 swapper 1 R 0 1 0 0 -21765753 10 75454 156 ? 0:44 init 3 R 0 2 0 0 999 21 72670 16 ? 568:01 idle 103 R 0 3 0 0 999 21 72710 16 cpu-01 ? 572:53 idle 43 S 0 4 0 0 20 0 72730 16 3236361 ? 0:32 smppulse 43 S 0 5 0 0 24 0 72750 16 3240603 ? 0:00 sfspuls0 43 S 0 6 0 0 24 0 72770 16 3240604 ? 0:00 sfspuls1 43 S 0 7 0 0 24 0 73010 16 3240605 ? 0:00 sfspuls2 43 S 0 8 0 0 24 0 73030 16 3240606 ? 0:00 sfspuls3 43 S 0 9 0 0 20 0 73050 16 3344315 ? 0:00 clock 43 S 0 10 0 0 20 0 73070 16 3354737 ? 0:00 utility 1 S 0 454 1 0 30 20 107544 616 265776 console 0:06 bash 0 S 0 61 1 0 26 20 560 108 70314001 ? 0:00 fslogd 1 S 0 106 1 0 26 20 73110 108 40510001 ? 0:00 slogdemo 1 S 0 455 1 0 28 20 77574 116 3323217 01 0:00 getty 1 S 0 209 1 0 26 20 74700 144 3363437 ? 0:15 msgdaemo 1 S 0 215 1 0 39 20 75120 220 230500 ? 2:18 fsdaemon 1 S 0 317 1 0 26 20 76530 284 3363437 ? 0:02 inetd 1 S 0 388 1 0 26 20 100404 280 60457420 ? 0:00 nfsd 1 S 0 338 1 0 26 20 76114 204 3363437 ? 0:02 portmap 1 S 0 395 388 0 26 20 101700 280 60457420 ? 0:00 nfsd 1 S 0 430 1 0 26 20 105354 248 3363437 ? 0:00 samdaemo 1 S 0 394 388 0 26 20 101250 280 60457420 ? 0:00 nfsd 1 S 0 397 388 0 26 20 102330 280 60457420 ? 0:00 nfsd 1 S 0 456 1 0 28 20 100030 116 3323277 02 0:01 getty 1 S 0 399 1 0 26 20 74470 116 3363437 ? 0:00 errdemon 1 S 0 420 1 0 26 20 113160 380 1622622 ? 1:49 cron 1 S 0 824 1 0 26 20 77360 136 3425146 ? 0:00 biod 1 Z 0 78364 1 0 980 10 0:00 <defunct> 101 R 0 75214 454 0 980 20 73264 640 cpu-00 console 0:00 ps 1 S 0 417 1 0 26 20 104050 336 3363437 ? 0:00 mountd 1 S 0 825 1 0 26 20 101034 136 3425146 ? 0:00 biod 1 S 0 826 1 0 26 20 102760 136 3425146 ? 0:00 biod 1 S 0 827 1 0 26 20 103170 136 3425146 ? 0:00 biod 1 S 0 836 1 0 26 20 106140 276 3363437 ? 0:00 pcnfsd 1 S 0 844 1 0 26 20 106564 216 3363437 ? 0:00 statd 1 S 0 853 1 0 26 20 112414 356 3363437 ? 0:01 lockd |
So, now I can look in my simulator-side ps equivalent and see what the table looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
┌────────────────────────────────────────────────────── Console ───────────────────────────────────────────────────────┐ │ │ │ps │ │ENTRY NAME PID STATUS PRI NICE UTIME STIME │ │----- ---------------- ------ ------ -------- -------- -------------- -------------- │ │ 0 sched 0 SLEEP 0 20 0:00:00.000 0:00:00.680 │ │ 1 init 1 RUN 18446744073687785863 10 0:00:02.087 0:00:41.665 │ │ 2 idle 2 ONPROC 999 21 0:00:00.000 9:25:47.363 │ │ 3 idle 3 RUN 999 21 0:00:00.000 9:30:38.043 │ │ 4 smppulse 4 SLEEP 20 0 0:00:00.000 0:00:32.140 │ │ 5 sfspuls0 5 SLEEP 24 0 0:00:00.000 0:00:00.000 │ │ 6 sfspuls1 6 SLEEP 24 0 0:00:00.000 0:00:00.000 │ │ 7 sfspuls2 7 SLEEP 24 0 0:00:00.000 0:00:00.000 │ │ 8 sfspuls3 8 SLEEP 24 0 0:00:00.000 0:00:00.000 │ │ 9 clock 9 SLEEP 20 0 0:00:00.000 0:00:00.000 │ │ 10 utility 10 SLEEP 20 0 0:00:00.000 0:00:00.000 │ │ 11 bash 454 SLEEP 30 20 0:00:04.106 0:00:01.420 │ │ 12 fslogd 61 SLEEP 26 20 0:00:00.006 0:00:00.011 │ │ 14 slogdemon 106 SLEEP 26 20 0:00:00.004 0:00:00.033 │ │ 15 ls 72966 ONPROC 751 20 0:00:00.015 0:00:00.047 │ │ 16 getty 455 SLEEP 28 20 0:00:00.177 0:00:00.279 │ │ 17 msgdaemon 209 SLEEP 26 20 0:00:03.241 0:00:11.835 │ │ 18 fsdaemon 215 SLEEP 39 20 0:01:20.180 0:00:56.861 │ │ 19 inetd 317 SLEEP 26 20 0:00:00.969 0:00:01.072 │ │ 20 nfsd 388 SLEEP 26 20 0:00:00.082 0:00:00.135 │ │ 22 portmap 338 SLEEP 26 20 0:00:01.235 0:00:00.646 │ │ 24 nfsd 395 SLEEP 26 20 0:00:00.005 0:00:00.007 │ │ 25 samdaemon 430 SLEEP 26 20 0:00:00.006 0:00:00.010 │ │ 26 nfsd 394 SLEEP 26 20 0:00:00.006 0:00:00.007 │ │ 27 nfsd 397 SLEEP 26 20 0:00:00.006 0:00:00.006 │ │ 28 getty 456 SLEEP 28 20 0:00:00.171 0:00:00.335 │ │ 29 errdemon 399 SLEEP 26 20 0:00:00.056 0:00:00.048 │ │ 30 cron 420 SLEEP 26 20 0:00:32.836 0:01:15.596 │ │ 31 biod 824 SLEEP 26 20 0:00:00.006 0:00:00.010 │ │ 32 ldsync 78364 ZOMB 980 10 0:00:00.003 0:00:00.032 │ │ 33 sh 467 SLEEP 30 20 1:04:27.454 1:42:55.811 │ │ 34 mountd 417 SLEEP 26 20 0:00:00.090 0:00:00.110 │ │ 35 biod 825 SLEEP 26 20 0:00:00.007 0:00:00.011 │ │ 36 biod 826 SLEEP 26 20 0:00:00.007 0:00:00.014 │ │ 37 biod 827 SLEEP 26 20 0:00:00.008 0:00:00.013 │ │ 38 pcnfsd 836 SLEEP 26 20 0:00:00.032 0:00:00.047 │ │ 39 statd 844 SLEEP 26 20 0:00:00.119 0:00:00.167 │ │ 40 lockd 853 SLEEP 26 20 0:00:00.264 0:00:00.325 │ │ │ │ │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
Great, I have verification that the in-memory process table is corrupted for some reason. The next step is to figure out why.
What I’ve done was to modify the simulator to monitor every write into the process table and trigger when an invalid priority value gets written. This is actually easier to be said than done, as the memory access functions didn’t quite support this kind of surgery, at least initially. After some fiddling however I got a trigger and a trace:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
exec 0x000F3B68:p3 (0x000F3B68:p3) A7 [0x0000005A,A6] A7 <== 0x0000000000F064EC (15754476) exec 0x000F3B69:p2 (0x000F3B69:p2) S3 [0x0000002F,A6] S3 <== 0x013746B14EC15E47 (87616445101465159) exec 0x000F3B6A:p1 (0x000F3B6A:p1) S2 [0x00000008,A6] S2 <== 0x0000000000168052 (1474642) exec 0x000F3B6B:p0 (0x000F3B6B:p0) S6 [0x0000001A,A2] S6 <== 0x0000000108B19156 (4440822102) exec 0x000F3B6B:p3 (0x000F3B6B:p3) S5 [0x0000001D,A2] S5 <== 0x0000000007C55D5B (130375003) exec 0x000F3B6C:p2 (0x000F3B6C:p2) S1 [0x0000003E,A2] S1 <== 0xFFFFFFFF8460D811 (18446744071635523601) exec 0x000F3B6D:p1 (0x000F3B6D:p1) A4 [0x00000059,A6] A4 <== 0x00000000000B0A15 (723477) exec 0x000F3B6E:p0 (0x000F3B6E:p0) S4 RT S4 <== 0x013746B14EC2003C (87616445101506620) exec 0x000F3B6E:p1 (0x000F3B6E:p1) [0x0000002F,A6] S4 exec 0x000F3B6F:p0 (0x000F3B6F:p0) [0x00000000,A7] S4 exec 0x000F3B6F:p3 (0x000F3B6F:p3) S3 S4-S3 S3 <== 0x000000000000A1F5 (41461) exec 0x000F3B70:p0 (0x000F3B70:p0) S2 S2+S3 S2 <== 0x0000000000172247 (1516103) exec 0x000F3B70:p1 (0x000F3B70:p1) S6 S6+S3 S6 <== 0x0000000108B2334B (4440863563) exec 0x000F3B70:p2 (0x000F3B70:p2) [0x00000001,A7] S5 exec 0x000F3B71:p1 (0x000F3B71:p1) [0x00000010,A2] S1 |
Of course here you can’t see how the corrupted value (in S1, highlighted) gets computed, but at least the code clearly attempts to write the priority field. So it’s not a rogue pointer, it’s a computation error. As it turns out, this piece of code is one of the few cases where I have source as well, so, here’s the original assembly for the same section:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
A7 W@PWUTMS,A6 A7 <= user-space utms struct addr S3 W@RTSAV,A6 S3 <= RT of last PWS update S2 UNIXTIM,A6 S2 <= pw_unixt S6 W@PSTIME,A2 S6 <= p_stime S5 W@PCTIME,A2 S5 <= p_ctime S1 W@PUPRI,A2 S1 <= p_upri A4 W@PWMTMS,A6 A4 <= user-space mtms struct addr S4 RT W@RTSAV,A6 S4 W@UTMRT,A7 S4 utms_update = RT S3 S4-S3 S3 <= delta system time S2 S2+S3 S2 <= incremented pw_unixt S6 S6+S3 S6 <= incremented p_stime W@UTMUTM,A7 S5 W@PPRI,A2 S1 p_pri = p_upri W@CPRI,A6 S1 curpri = p_pri W@PSTIME,A2 S6 UNIXTIM,A6 S2 |
After some more digging, I’ve chanced upon the place where the invalid value is computed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
exec 0x00175314:p3 (0x00175314:p3) A7 B1 A7 <== 0x0000000001FF57C7 (33511367) exec 0x00175315:p0 (0x00175315:p0) A2 [0x00000001,A7] A2 <== 0x0000000000015D27 (89383) exec 0x00175315:p3 (0x00175315:p3) A5 [0x00000008,A2] A5 <== 0x0000000000015D9E (89502) exec 0x00175316:p2 (0x00175316:p2) A6 [0x00000012,A2] A6 <== 0x000000000000000A (10) <===================================== this is p_nice exec 0x00175317:p1 (0x00175317:p1) A4 [0x0000003E,A2] A4 <== 0x0000000000000028 (40) <===================================== this is p_upri exec 0x00175318:p0 (0x00175318:p0) S5 [0x0000003D,A2] S5 <== 0x3FED9FF43C9CD3EF (1.1917505403981579e-06) <================= this is p_sharepri, a float! exec 0x00175318:p3 (0x00175318:p3) S7 [0x00000000,A5] S7 <== 0x0000000000000001 (1) exec 0x00175319:p2 (0x00175319:p2) S4 [0x000ACB8D,] <====================== this is shconsts.sc_highshpri S4 <== 0x7FF8A1624EBB3D44 (0.0024625246253338617) <================== this is an invalid float!!!! exec 0x0017531A:p1 (0x0017531A:p1) S6 0x0000000000008000 S6 <== 0x0000000000008000 (32768) exec 0x0017531B:p0 (0x0017531B:p0) S3 +A6 <====================================== p_nice S3 <== 0x000000000000000A (10) exec 0x0017531B:p1 (0x0017531B:p1) S2 +A4 <====================================== p_upri S2 <== 0x0000000000000028 (40) exec 0x0017531B:p2 (0x0017531B:p2) S0 S6&S7 S0 <== 0x0000000000000000 (0) exec 0x0017531B:p3 (0x0017531B:p3) T0 S2 T0 <== 0x0000000000000028 (40) exec 0x0017531C:p0 (0x0017531C:p0) A3 A2 A3 <== 0x0000000000015D27 (89383) |
It’s not obvious to see it here, but S4 contains an invalid floating point value: exponents must be in the 02000 to 06000 range, here we have 07777 as the exponent, clearly an overflow.
This overflow screws up some integer math later on and results in the process priority being corrupted.
It wasn’t hard to find the piece of code that computes this value, it was much harder to find the first time this corruption happens: it takes several hours and millions of passes through the same code before an invalid value show up.
Nevertheless I was on track: even though I feel pretty confident about my floating point implementation right now, there could be corner-cases I’m not handling well. In particular, under- and over-flows are tricky and I’ve ignored them for a long time.
Luckily just recently I had a conversation with Chris Fenton about a curious difference between his (Verilog) implementation of the Cray floating point math and mine. The discussion boiled down to how to handle underflows in floating-point adds and I still had that in my head.
I went back and re-read the relevant chapters in the processor manuals and it turns out, I indeed didn’t handle underflows correctly: the processor simply zeroed out the result in case of an underflow, where my implementation just gave an invalid result.
Now, the above invalid float is actually an overflow, not an underflow, but it is still a lead. Especially considering that there are reciprocal calculations in nearby code, so an underflow can get converted to an overflow. The troubling part is of course that an underflow-corrected-to-0 would result in a divide-by-zero once we get to the reciprocal part. Still a bug is a bug, so let’s fix it, and it’s possible that there are special-case checks in place to avoid the divide-by-zero problem.
Now changing floating-point code in the simulator is risky because it’s not heavily used and due to representational differences (these Cray machines used a non-IEEE compliant float format) spotting minute differences is hard. Luckily there’s one test program in UNICOS that exercises many of the floating-point functions so there’s at least some coverage.
Epilog
I’ve made the change, run the simulator and tada! (well tada after waiting three days) The bug disappeared: no zombie processes any more, the OS stayed up without issues for more than a week by now.
The moral of the story? Multi-threaded development is hard, especially if you try to morph single-threaded code (written 5+ years ago) into a multi-threaded environment.
Floating point math is also hard so is precisely implementing every nuance of a several hundred page manual.
Finally, I’ve also learned that this trace feature I’ve just added is insanely useful and should have done that long time ago.
I’m sure it’ll be indispensable in the future too and soon I won’t be able to imagine how I tracked down any bugs without it before.
With a stable simulator, I can now tackle the next problem: ssh shell access. You see, while the simulator is online and you can ssh into it, it’s a bit of a hack at the moment: you actually ssh into the host running the simulator which in turn uses rlogin to get you in to UNICOS. It works, but just barely: it doesn’t give you scp access and even more importantly it doesn’t give you X11 forwarding, so you’re limited to a command line.
Getting sshd running on UNICOS is going to be my next project – and the next post when I get around writing it up.