Needle in the hay-stack

For the inpatient

I’ve fund the bug that caused the COS install process to halt with the fatal. Now it progresses a bit further and stops with a different error message. Turns out in my rush to get the simulator up and running I just stubbed out some floating-point operations, which got called here and resulted in invalid calculations. Lesson learned: ASSERTs are good!

A debug session

I thought I’ll write down how I’ve found this bug as it might be educational in how to debug a system like this. I’ll start the story from the point where I’ve convinced myself that the block-multiplexer channels have nothing to do with the failure.

What I’ve noticed was that while reads happened from the first hard drive (DKA), no writes attempts happened. Still the install script contained a directive to write the disk label. Also, no reads were attempted from any other disk drivers.

So my theory was that the error happens somewhere in the code between the point where it processes the information received from the drive and the where it creates and writes the disk label.

But how do I get close to this piece of code? The problem is that the event (the reading of the sector) happens on a peripheral attached to the DIOP, yet the effect of it – the failure – happens in the code running on the main CPU.

Turning on instruction tracing for the whole boot process is not really a nice option: it generates gigabytes of logs, which not only takes forever to generate, it would be a giant hay-stack to search through.

Event points

I wanted some more ‘clever’. The solution I came up with was events and event-points. When certain things happen in the simulator an event gets fired. This event is essentially a string, describing what have just happened. Sector such-and-such was read, instruction X got executed, etc.

Then, you would have event points, which have the same actions as trace-points, but would trigger on certain event strings, instead of instruction fetch addresses. Matching the event points with events happens using regular expressions, making this mechanism rather flexible.

Of course all this takes (run) time, so there better be not too many events flying around.

Events are global, that is it can be fired from any part of the system and captured by any other part of it. So, I can have the disk drive fire an event on a sector read, which is captured by an event point on the CPU, and turn on instruction tracing.

This worked remarkably well in getting me close to the point of failure. And it ‘only’ generated 178MB of logs.

First hints

The error message I got was this:

There are at least some string to go after. So, let’s see where does the error string show up! String-handling in the Cray was awkward to say the least: since they didn’t have byte-level addressing, strings are loaded into registers 8 character at a time. It’s hard to say, where the 8 character boundaries in the above string could be, but let’s try searching for ‘***’ which is 0x2a2a2a in hex.

2187414 CPU0: XA:0x48 exec 0x017839:p2 A5 A5-1
2187414 CPU0:                state-change A5 <== 0x31 (49)
2187414 CPU0: XA:0x48 exec 0x017839:p3 JAN 0x017837:p1
2187414 CPU0: ================================ JUMP =========================
2187414 CPU0: XA:0x48 exec 0x017837:p1 S2 0!S1
2187414 CPU0:                state-change S2 <== 0x100040030 (4295229488)
2187414 CPU0: XA:0x48 exec 0x017837:p2 S1 0x000001,A6
2187414 CPU0:                mem read 0x0344B2 (0x0382B2) value: 0x20202020202A2A2A
2187414 CPU0:                state-change S1 <== 0x20202020202A2A2A (2314885530819111466)
2187414 CPU0: XA:0x48 exec 0x017838:p0 0x3FFFFE,A3 S2
2187414 CPU0:                mem write 0x3FBD97 (0x3FFB97) value: 0x0000000100040030
2187414 CPU0: XA:0x48 exec 0x017838:p2 A6 A6+1
2187414 CPU0:                state-change A6 <== 0x344B2 (214194)
2187414 CPU0: XA:0x48 exec 0x017838:p3 A3 A3-1
2187414 CPU0:                state-change A3 <== 0x3FBD98 (4177304)
2187414 CPU0: XA:0x48 exec 0x017839:p0 A0 A7-1
2187414 CPU0:                state-change A0 <== 0x4 (4)
2187414 CPU0: XA:0x48 exec 0x017839:p1 A7 A7-1
2187414 CPU0:                state-change A7 <== 0x4 (4)
2187414 CPU0: XA:0x48 exec 0x017839:p2 A5 A5-1
2187414 CPU0:                state-change A5 <== 0x30 (48)
2187414 CPU0: XA:0x48 exec 0x017839:p3 JAN 0x017837:p1
2187414 CPU0: ================================ JUMP =========================
2187414 CPU0: XA:0x48 exec 0x017837:p1 S2 0!S1
2187414 CPU0:                state-change S2 <== 0x20202020202A2A2A (2314885530819111466)
2187414 CPU0: XA:0x48 exec 0x017837:p2 (0x017837:p2) S1 0x000001,A6
2187414 CPU0:                mem read 0x0344B3 (0x0382B3) value: 0x2A2A202020204641
2187414 CPU0:                state-change S1 <== 0x2A2A202020204641 (3038276220964849217)
2187414 CPU0: XA:0x48 exec 0x017838:p0 0x3FFFFE,A3 S2
2187414 CPU0:                mem write 0x3FBD96 (0x3FFB96) value: 0x20202020202A2A2A
2187414 CPU0: XA:0x48 exec 0x017838:p2 A6 A6+1

 

Bingo! If you look closer to the code (and it’s probably a bit easier in the disassembly) you’ll see that this is simply a loop copying a string from one location to another:

 

0x017836:p0 – A3 A2
0x017836:p1 – A0 A7-A5
0x017836:p2 – JAM 0x017837:p1                | Jump if A0 < 0
0x017837:p0 – A7 A5
0x017837:p1 – S2 0!S1                        | S2 0 or S1
0x017837:p2 – S1 [0x000001,A6]
0x017838:p0 – [0x3FFFFE,A3] S2
0x017838:p2 – A6 A6+1
0x017838:p3 – A3 A3-1
0x017839:p0 – A0 A7-1
0x017839:p1 – A7 A7-1
0x017839:p2 – A5 A5-1
0x017839:p3 – JAN 0x017837:p1                | Jump if A0 != 0

 

Here you see the source string pointer being (sort of) in A6, the destination in A3. The loop counter is in A7. You can see part of the loop setup code (loading of A3).

OK, so this piece of code copies the string, but it didn’t create the string nor did it decide whether to print it or not. At the same time, this was the first match for the string in the execution trace. That means that no one assembles the string, it must be a constant that’s stored somewhere in memory.That in turns means two things: one, this code copies the string form it’s constant location to a buffer (possibly for parameter substitution for formatted print), and two, that the decision happens elsewhere. It’s also very likely that routine is rather generic, in the sense that it doesn’t exclusively deal with printing this particular error message. Chances are, the decision to print the error happened somewhere higher-up in the call-chain. So let’s see what that chain looks like!

The beginning of this routine is at address  0x0177FF:p0. It is called from 0x0176BA:p0, which is in a routing starting at 0x017664:p0. This routine is called from 0x033E29:p3 in our case, from a routine starting at 0x033E1E:p0. It is called from 0x034423:p1. Now, let’s stop here for a minute, because this routine is interesting!

Normally a sub-routine returns to the caller with a J B0 instruction. If you look at the code around this call, you see something unusual:

 

0x034422:p1 – A1 0x0344AD
0x034422:p3 – A2 0x0344EA
0x034423:p1 – R 0x033E1E:p0                  @ Report error message
0x034423:p3 – A0 [0x0344FB,]
0x034424:p1 – A1 [0x0344FC,]
0x034424:p3 – A2 [0x0344FD,]
0x034425:p1 – A3 [0x0344FE,]
0x034425:p3 – A4 [0x0344FF,]
0x034426:p1 – A5 [0x034500,]
0x034426:p3 – A6 [0x034501,]
0x034427:p1 – A7 [0x034502,]
0x034427:p3 – S0 [0x034503,]
0x034428:p1 – S1 [0x034504,]
0x034428:p3 – S2 [0x034505,]
0x034429:p1 – S3 [0x034506,]
0x034429:p3 – S4 [0x034507,]
0x03442A:p1 – S5 [0x034508,]
0x03442A:p3 – S6 [0x034509,]
0x03442B:p1 – S7 [0x03450A,]
0x03442B:p3 – J 0x03442B:p3

 

The last instruction is an endless loop. So, we should never get here. Most likely we report a fatal error above, then just stay here and freeze the CPU or at least the process. If that analysis is correct, that means that the we’ve finally reached the main error-reporting routine. The caller of this guy must be the site of the error, or at least the site where we determined that something went awfully wrong, so wrong that we should halt the process.

The beginning of the routine is at 0x0343EB:p0 and the call-site is at 0x02FD89:p3. The code around there looks like this:

 

0x02FD86:p0 – S7 >1                          | form 1 ones from left
0x02FD86:p1 – S6 S6!S7                       | S6 S6 or S7
0x02FD86:p2 – [0x000403,A3] S6
0x02FD87:p0 – R 0x030FCB:p0
0x02FD87:p2 – R 0x03123E:p0
0x02FD88:p0 – JSZ 0x02FD8A:p1                @ If S0 != 0, call FATAL below
0x02FD88:p2 – B56 A7
0x02FD88:p3 – A7 0x000000
0x02FD89:p1 – B57 A0
0x02FD89:p2 – A0 B0
0x02FD89:p3 – R 0x0343EB:p0                  @ Report FATAL (never returns)
0x02FD8A:p1 – S1 [0x000008,A1]               @ Continue processing, if no FATAL
0x02FD8A:p3 – S2 >1                          | form 1 ones from left

 

What you see here is that we call a routine at R 0x03123E:p0. When it returns, we examine the content of S0, and if it’s non-0, we continue execution, otherwise jump to 0x02FD8A:p1. Right after the call to 0x0343EB:p0! If we didn’t jump, we simply set up a few register values and call 0x0343EB:p0, the supposed fatal error report routine.

This is very encouraging! It is exactly what you would expect: we do something, that goes wrong, detect it, and call the fatal error handler if the test fails.

I’ve also noticed something interesting: the location of the call is 0x02FD89:p3. If you print this number in octal, you get 02773047. This was the number that was reported in the error message as the value of P (the instruction pointer) at the time of the error. In retrospect, I could have started at this spot, the error message told me all along! So what about B0, also reported in the error message? Notice, how it’s value is saved into A0 just before the call? B0 is the ‘link’ register and it contains the return address for calls. So it’s value must be the return address of the last call, in our case 0x02FD88:p0. Converting it to octal: 02773040. Again, this is what was reported in the error message.

So, next time I get an error message, like this, I can decode it: the location reported in P is the place of the fatal error handler call, and the location in B0 is the return address of the last subroutine call, essentially pinpointing the where the last subroutine call happened. Good to know, would have been better to know earlier…

Moving on though, the interesting things must happen inside the routine at 0x03123E:p0. It returns a non-0 value in S0, where it clearly should return 0. Why?

A smoking gun

A quick look in the trace shows that the actual failure happens inside a function call to R 0x031161:p0. This routine contains the following code-fragment:

 

0x03116F:p2 – S2 [0x000001,A1]               @ We AIA from DRT table
0x031170:p0 – S3 <16
0x031170:p1 – S0 S2&S3
0x031170:p2 – JSZ 0x03117D:p2                | Jump if S0 == 0
0x031171:p0 – S1 [0x03115F,]
0x031171:p2 – S2 <6                          | form 6 ones from right
0x031171:p3 – S2 S1&S2
0x031172:p0 – A2 S2
0x031172:p1 – S2 -1
0x031172:p2 – S2 0,S2 > A2
0x031172:p3 – S1 S1 > 6
0x031173:p0 – A2 S1
0x031173:p1 – A2 A1+A2
0x031173:p2 – S1 S1 < 6
0x031173:p3 – A1 S1
0x031174:p0 – S0 S1-S4
0x031174:p1 – JSP 0x03117D:p2                | Jump if S0 >= 0
0x031174:p3 – S3 [0x000004,A2]
0x031175:p1 – A2 A2+1
0x031175:p2 – S3 #S3&S2                      | S3 not(S3) and (S2)
0x031175:p3 – S2 -1
0x031176:p0 – A3 ZS3                         | A3 = leading zero count of S3
0x031176:p1 – A1 A1+A3
0x031176:p2 – S1 A1                          | Assign with no sign-extension
0x031176:p3 – S0 0!S3                        | S0 0 or S3
0x031177:p0 – JSZ 0x031174:p0                | Jump if S0 == 0
0x031177:p2 – S0 S1-S4
0x031177:p3 – JSP 0x03117D:p2                | Jump if S0 >= 0
0x031178:p1 – S0 0
0x031178:p2 – [0x031186,] A1
0x031179:p0 – [0x031187,] S0
0x031179:p2 – A2 [0x031181,]
0x03117A:p0 – B0 A2
0x03117A:p1 – A2 [0x03117F,]
0x03117A:p3 – A3 [0x031180,]
0x03117B:p1 – S1 [0x031182,]
0x03117B:p3 – S2 [0x031183,]
0x03117C:p1 – S3 [0x031184,]
0x03117C:p3 – S4 [0x031185,]
0x03117D:p1 – J B0
0x03117D:p2 – S0 1
0x03117D:p3 – J 0x031178:p2

 

What you don’t see here is that register A1 is loaded with the address of the DRT table.

This table essentially contains a bit-map of ‘allocation units’. Each allocation unit corresponds to a track on the hard drive and shows if an allocation unit is available or not. Before the bit-map the table contains a few fields, one of which is called AIA and holds the total number of available allocation units. It is in the lower 16 bits of the first offset of the DRT table.

So what happens here is that we look at the AIA field and test if it’s 0. It is for us (as you can tell from the trace), so the jump at 0x031170:p2 happens and we quickly terminate the routine with S0 set to 1.

Hold on a sec!

So how do I know about the DRT table and AIA field? Well, I have received a very valuable piece of documentation from a gentlemen, called Brian the other day: SM-0045, the Cray COS Table Descriptions Internal Reference Manual. It is exactly what it claims to be: detailed description of most if not all tables and data-structures used by COS.

Another piece of the puzzle is the way the binary images for these early systems were laid out: they often put the symbol names in front of the structures themselves. The reason? Well, these systems were designed to be debugged using memory-dumps. So, having a way to easily find certain pieces of data in a memory dump was important.

Or a cold case…

Well, we might have a gun, but it’s not quite smoking. I knew what the problem was: the AIA field is incorrect. I didn’t know why, in fact I didn’t even know who supposed to set it.

Yet, there was another very curious fact I’ve noticed: the size of the bitmap in the DRT table (in the SZ field) was also set to 0. Why would the system think that it didn’t need any bitmaps at all?

If you look in the memory dump of an uninitialized system at the address where the DRT table is, you’ll find it is empty. If you look at the same spot after the crash, you’ll see it’s filled in. Consequently, this is a dynamically generated table, so there must be a routine that writes these incorrect values. But where?

I figured this routine is probably called fairly early during the boot, so maybe logging every instruction on the main CPU would not be too big of a burden. I started a new simulation, generated a huge trace file, and started digging. I at least had something to search for: a write to address 0x015A79 or 0x015A7A, where the SZ field was stored.

It didn’t take terribly long to find the following in the trace:

 

1872331 CPU0: XA:0x48 exec 0x030D28:p2 A0 A2
1872331 CPU0:                state-change A0 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D28:p3 JAN 0x030D27:p2
1872331 CPU0: —————————– NO JUMP ————————-
1872331 CPU0: XA:0x48 exec 0x030D29:p1 S6 0!S1
1872331 CPU0:                state-change S6 <== 0x4D442D312D323041 (5567624728571686977)
1872331 CPU0: XA:0x48 exec 0x030D29:p2 [0x000000,A6] S1
1872331 CPU0:                mem write 0x015A78 (0x019878) value: 0x4D442D312D323041
1872331 CPU0: XA:0x48 exec 0x030D2A:p0 S1 [0x000008,A7]
1872331 CPU0:                mem read 0x00E886 (0x012686) value: 0x8C40880000015A78
1872331 CPU0:                state-change S1 <== 0x8C40880000015A78 (10106227097400859256)
1872331 CPU0: XA:0x48 exec 0x030D2A:p2 S1 S1 < 14
1872331 CPU0:                state-change S1 <== 0x22000000569E0000 (2449958198742745088)
1872331 CPU0: XA:0x48 exec 0x030D2A:p3 S1 S1 > 60
1872331 CPU0:                state-change S1 <== 0x2 (2)
1872331 CPU0: XA:0x48 exec 0x030D2B:p0 A1 S1
1872331 CPU0:                state-change A1 <== 0x2 (2)
1872331 CPU0: XA:0x48 exec 0x030D2B:p1 A4 [0x030E6B,A1]
1872331 CPU0:                mem read 0x030E6D (0x034C6D) value: 0x0000000000030E73
1872331 CPU0:                state-change A4 <== 0x30E73 (200307)
1872331 CPU0: XA:0x48 exec 0x030D2B:p3 S1 [0x000008,A7]
1872331 CPU0:                mem read 0x00E886 (0x012686) value: 0x8C40880000015A78
1872331 CPU0:                state-change S1 <== 0x8C40880000015A78 (10106227097400859256)
1872331 CPU0: XA:0x48 exec 0x030D2C:p1 S1 S1 < 18
1872331 CPU0:                state-change S1 <== 0x2000000569E00000 (2305843032464818176)
1872331 CPU0: XA:0x48 exec 0x030D2C:p2 S1 S1 > 60
1872331 CPU0:                state-change S1 <== 0x2 (2)
1872331 CPU0: XA:0x48 exec 0x030D2C:p3 A1 S1
1872331 CPU0:                state-change A1 <== 0x2 (2)
1872331 CPU0: XA:0x48 exec 0x030D2D:p0 A4 A4+A1
1872331 CPU0:                state-change A4 <== 0x30E75 (200309)
1872331 CPU0: XA:0x48 exec 0x030D2D:p1 A4 [0x000000,A4]
1872331 CPU0:                mem read 0x030E75 (0x034C75) value: 0x00000000000242AC
1872331 CPU0:                state-change A4 <== 0x242AC (148140)
1872331 CPU0: XA:0x48 exec 0x030D2D:p3 S5 +FA4                        | Assign as unnormalized float
1872331 CPU0:                state-change S5 <== 0x40300000000242AC (4625196817309647532)
1872331 CPU0: XA:0x48 exec 0x030D2E:p0 S5 FS5                         | Normalize S5
1872331 CPU0:                state-change S5 <== 0x401290AB00000000 (4616911632121987072)
1872331 CPU0: XA:0x48 exec 0x030D2E:p1 S6 [0x00000A,A7]
1872331 CPU0:                mem read 0x00E888 (0x012688) value: 0x03370A2012000000
1872331 CPU0:                state-change S6 <== 0x3370A2012000000 (231665038690091008)
1872331 CPU0: XA:0x48 exec 0x030D2E:p3 S7 >13                         | form 13 ones from left
1872331 CPU0:                state-change S7 <== 0xFFF8000000000000 (18444492273895866368)
1872331 CPU0: XA:0x48 exec 0x030D2F:p0 S7 S7 > 27
1872331 CPU0:                state-change S7 <== 0x1FFF000000 (137422176256)
1872331 CPU0: XA:0x48 exec 0x030D2F:p1 S1 S6&S7
1872331 CPU0:                state-change S1 <== 0x12000000 (301989888)
1872331 CPU0: XA:0x48 exec 0x030D2F:p2 S1 S1 > 24
1872331 CPU0:                state-change S1 <== 0x12 (18)
1872331 CPU0: XA:0x48 exec 0x030D2F:p3 S6 [0x000001,A6]
1872331 CPU0:                mem read 0x015A79 (0x019879) value: 0x0000000000000000
1872331 CPU0:                state-change S6 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D30:p1 S7 >16                         | form 16 ones from left
1872331 CPU0:                state-change S7 <== 0xFFFF000000000000 (18446462598732840960)
1872331 CPU0: XA:0x48 exec 0x030D30:p2 S7 S7 > 16
1872331 CPU0:                state-change S7 <== 0xFFFF00000000 (281470681743360)
1872331 CPU0: XA:0x48 exec 0x030D30:p3 S1 S1 < 32
1872331 CPU0:                state-change S1 <== 0x1200000000 (77309411328)
1872331 CPU0: XA:0x48 exec 0x030D31:p0 S6 S1!S6&S7                    | S6 bit-wise S7==1 ? S1:S6
1872331 CPU0:                state-change S6 <== 0x1200000000 (77309411328)
1872331 CPU0: XA:0x48 exec 0x030D31:p1 S1 S1 > 32
1872331 CPU0:                state-change S1 <== 0x12 (18)
1872331 CPU0: XA:0x48 exec 0x030D31:p2 [0x000001,A6] S6
1872331 CPU0:                mem write 0x015A79 (0x019879) value: 0x0000001200000000
1872331 CPU0: XA:0x48 exec 0x030D32:p0 A4 S1
1872331 CPU0:                state-change A4 <== 0x12 (18)
1872331 CPU0: XA:0x48 exec 0x030D32:p1 S3 +FA4                        | Assign as unnormalized float
1872331 CPU0:                state-change S3 <== 0x4030000000000012 (4625196817309499410)
1872331 CPU0: XA:0x48 exec 0x030D32:p2 S6 0.6
1872331 CPU0:                state-change S6 <== 0x4030C00000000000 (4625407923542032384)
1872331 CPU0: XA:0x48 exec 0x030D32:p3 S3 FS3                         | Normalize S3
1872331 CPU0:                state-change S3 <== 0x4005900000000000 (4613251722985340928)
1872331 CPU0: XA:0x48 exec 0x030D33:p0 S4 /HS3                        | Reciproc
1872331 CPU0:                state-change S4 <== 0x3FFCE38E38E38E3C (4610810318499843644)
1872331 CPU0: XA:0x48 exec 0x030D33:p1 S5 S5*HS4                      | Half-precision floating point multiplication
1872331 CPU0:                state-change S5 <== 0x800F740000000000 (9227721704854257664)
1872331 CPU0: XA:0x48 exec 0x030D33:p2 S6 S5+FS6                      | floating point addition
1872331 CPU0:                state-change S6 <== 0x4030C00000000000 (4625407923542032384)
1872331 CPU0: XA:0x48 exec 0x030D33:p3 A4 S6
1872331 CPU0:                state-change A4 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D34:p0 S1 A4
1872331 CPU0:                state-change S1 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D34:p1 S3 0!S1
1872331 CPU0:                state-change S3 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D34:p2 S2 0x0000000000000040
1872331 CPU0:                state-change S2 <== 0x40 (64)
1872331 CPU0: XA:0x48 exec 0x030D35:p0 S1 S1+S2
1872331 CPU0:                state-change S1 <== 0x40 (64)
1872331 CPU0: XA:0x48 exec 0x030D35:p1 S1 S1 > 6
1872331 CPU0:                state-change S1 <== 0x1 (1)
1872331 CPU0: XA:0x48 exec 0x030D35:p2 S6 [0x000002,A6]
1872331 CPU0:                mem read 0x015A7A (0x01987A) value: 0x0000000000000000
1872331 CPU0:                state-change S6 <== 0x0 (0)
1872331 CPU0: XA:0x48 exec 0x030D36:p0 S7 <24                         | form 24 ones from right
1872331 CPU0:                state-change S7 <== 0xFFFFFF (16777215)
1872331 CPU0: XA:0x48 exec 0x030D36:p1 S6 S1!S6&S7                    | S6 bit-wise S7==1 ? S1:S6
1872331 CPU0:                state-change S6 <== 0x1 (1)
1872331 CPU0: XA:0x48 exec 0x030D36:p2 [0x000002,A6] S6
1872331 CPU0:                mem write 0x015A7A (0x01987A) value: 0x0000000000000001

1872331 CPU0: XA:0x48 exec 0x030D37:p0 A1 0x000004
1872331 CPU0:                state-change A1 <== 0x4 (4)
1872331 CPU0: XA:0x48 exec 0x030D37:p1 A3 A6+A1
1872331 CPU0:                state-change A3 <== 0x15A7C (88700)
1872331 CPU0: XA:0x48 exec 0x030D37:p2 S6 [0x000003,A6]
1872331 CPU0:                mem read 0x015A7B (0x01987B) value: 0x0000000000000000
1872331 CPU0:                state-change S6 <== 0x0 (0)

 

Notice the write with the value 0 happening at 0x030D35:p2! And notice how this value supposed to contain something in the lower-order bits (since it got assigned to A4 at 0x030D33:p3). Well, so let’s see how this value got calculated:

 

0x030D2A:p0 – S1 [0x000008,A7]               @ Determine number of sectors for the drive
0x030D2A:p2 – S1 S1 < 14
0x030D2A:p3 – S1 S1 > 60
0x030D2B:p0 – A1 S1
0x030D2B:p1 – A4 [0x030E6B,A1]
0x030D2B:p3 – S1 [0x000008,A7]
0x030D2C:p1 – S1 S1 < 18
0x030D2C:p2 – S1 S1 > 60
0x030D2C:p3 – A1 S1
0x030D2D:p0 – A4 A4+A1
0x030D2D:p1 – A4 [0x000000,A4]
0x030D2D:p3 – S5 +FA4                        @ S5  is the number of sectors on drive
0x030D2E:p0 – S5 FS5                         @ Normalize S5
0x030D2E:p1 – S6 [0x00000A,A7]               @ Get AU (blocks per allocation unit) field from EQT
0x030D2E:p3 – S7 >13
0x030D2F:p0 – S7 S7 > 27
0x030D2F:p1 – S1 S6&S7
0x030D2F:p2 – S1 S1 > 24
0x030D2F:p3 – S6 [0x000001,A6]               @ Store AU into DRT SPB field
0x030D30:p1 – S7 >16
0x030D30:p2 – S7 S7 > 16
0x030D30:p3 – S1 S1 < 32
0x030D31:p0 – S6 S1!S6&S7
0x030D31:p1 – S1 S1 > 32
0x030D31:p2 – [0x000001,A6] S6
0x030D32:p0 – A4 S1                          @ Move AU into S3 as an float
0x030D32:p1 – S3 +FA4                        | Assign as unnormalized float
0x030D32:p2 – S6 0.6
0x030D32:p3 – S3 FS3                         | Normalize S3
0x030D33:p0 – S4 /HS3                        @ Reciproc – S4 is now 1/AU, that is 1/18
0x030D33:p1 – S5 S5*HS4                      @ S5 is now #SECTORS / AU
0x030D33:p2 – S6 S5+FS6                      @ S6 is now 0.6 + #SECTORS / AU
0x030D33:p3 – A4 S6
0x030D34:p0 – S1 A4
0x030D34:p1 – S3 0!S1                        | S3 0 or S1
0x030D34:p2 – S2 0x0000000000000040          @ Round up value to the next multiple of 64
0x030D35:p0 – S1 S1+S2
0x030D35:p1 – S1 S1 > 6                      @ S1 now contains the number of WORDs needed in the DRT table
0x030D35:p2 – S6 [0x000002,A6]               @ Update the SZ (DRT map size in WORDs) field of the DRT table
0x030D36:p0 – S7 <24                         | form 24 ones from right
0x030D36:p1 – S6 S1!S6&S7                    | S6 bit-wise S7==1 ? S1:S6
0x030D36:p2 – [0x000002,A6] S6

 

It’s a bit hard to read, but essentially what’s happening is that the code first determines the number of sectors on the drive using a lookup based on the drive type and characteristics entries from the EQT table. Then it gets the number of sectors per track (which becomes the AU entry BTW) still from the EQT table. Finally it divides the total number of sectors with the sectors per track to get to the total number of tracks, which is the number of allocation units as well. From here, it returns back to the integer domain, divides the number by 64 (shift-right by 6 bits) to get the number of 64-bit words needed to store the bit-map, which is the value stored in the SZ field.

As a side-note: there’s absolutely no reason to use floating point math and division here. Just as there’s a lookup table for the number of sectors for each drive type/characteristics combination, there’s one to look up the number of tracks as well. In fact this table is right next to the first one! Sheesh, people!

Anyway, if we went through all this gymnastics, why did we end up with 0? Is there something wrong with my floating-point routines? Well, I’ve had some unit-tests for it, but who knows. Then, another thing hit me: most of the floating-point calculations happen in half-precision, something that’s very crudely documented if at all. I knew that my half-precision implementation was rather sketchy, but I wrote it a long time ago, so I couldn’t remember what it was like. So it was time to go back to the simulator source code and see what the implementation of those floating-point instructions above looked like. Well, it didn’t take long to find the problem. This was at the top of cray_types.h:

 

typedef float CHalfFloat_t;    // TODO: replace with proper CRAY half precision float type and overloaded operators

 

That would certainly do it! How to fix this? Reading through all the pages about half-precision floating-point arithmetic (there aren’t much) again, I came to the conclusion that it was just a faster and less-precise way of calculating certain operations, but the there wasn’t a different floating-point storage format. So, it would probably be OK at least initially to simulate all half-precision math with their full-precision equivalent. And most importantly, that’s easy to do:

 

typedef class CFloat_t CHalfFloat_t;    // TODO: replace with proper CRAY half precision float type and overloaded operators

 

A re-compile later, I had a huge grin on my face: problem disappeared. The right values got entered into the DRT, the bitmap had a reasonable size, and the boot continued.

Are we there yet?

No, not really unfortunately. The error got replaced by the following one, which appeared soon after:

image

A quick check at the logs show that the write actually happened, and the data is on the disk image. So why did the write fail?

Another few days of debugging solved this mystery too. What happened had nothing to do with the writes. The OS – I guess to verify the writes – reads each sector back. To calculate the cylinder/head/sector numbers from the logical sector index it uses (you guessed) floating-point arithmetic. Due to some rounding I’ve introduced in the floating point add routine, some of these calculations when astray producing negative sector numbers. When such a read request was sent over to the MIOP, it responded with an error message, which of course in turn confused COS.

I’m still not sure what the right way of calculating the floating-point additions is, but the fix (always truncate) seems to have solved the problem. Now I can completely ‘format’ the master hard-drive. I still get a very similar error message on the 4th hard-drive. Curiously this is a drive that’s part of a striped array. I think the reason for the failure is that you can’t have both the striped array and the individual drives on-line at the same time. Changing the install script to reflect that makes that error go away as well:

*INSTALL
*SKIPEFT
*CONFIG,DVN=MD-1-20A,AVAIL,RBN=Y,MSD=Y,WDL=Y,VOL=N
*CONFIG,DVN=29-1-21A,AVAIL,RBN=Y,WDL=Y,VOL=N
*CONFIG,DVN=29-1-22A,AVAIL,WDL=Y,VOL=N
*CONFIG,DVN=STRIPE-1,AVAIL,WDL=Y,VOL=N
*CONFIG,DVN=29-1-24A,NAVAIL
*CONFIG,DVN=29-1-25A,NAVAIL
*CONFIG,DVN=29-1-26A,NAVAIL

*CONFIG,DVN=29-1-30A,AVAIL,WDL=Y,VOL=N
*CONFIG,DVN=29-1-31A,AVAIL,WDL=Y,VOL=N
*CONFIG,DVN=29-1-32A,AVAIL,WDL=Y,VOL=N
*CONFIG,DVN=BMR-0-20,AVAIL,RBN=Y,SCR=Y,VOL=Y,CTL=Y,GDN=*BMR
*CONFIG,DVN=MT0,NAVAIL
*CONFIG,DVN=MT1,NAVAIL
*CONFIG,DVN=MT2,NAVAIL
*CONFIG,DVN=MT3,NAVAIL
*FSS
*PREEMPT,GRN=*BMR,OS=2,NOFLUSH
*DEVICE,LDV=STRIPE-1
*RESIDE,TL=5
*ENDFSS
*END

 

This still doesn’t solve all the problems. Soon after however (well it’s relative, we’re talking hour long run-times now), I get a crash:

image

Looking into the memory dump, we see the following crash buffer:

image

The values of P and B0 here are truncated to words, so they actually reference 0x00247E:p0…p3 and 0x002459:p0…p3 respectively. These crashes are reported in kernel mode, so these address are physical. Looking into the disassembly around the reported P we see this:

0x00247C:p3 0035150         – [,A0] B40,A1                   | Transfer A1 words from B registers starting from B40 to memory at address A0
0x00247D:p0 0052142         – S0 S1 < 34
0x00247D:p1 0017000:0110771 – JSM 0x00247E:p1                | Jump if S0 < 0
0x00247D:p3 0016000:0111004 – JSP 0x002481:p0                | Jump if S0 >= 0
0x00247E:p1 0072002         – S0 SM
0x00247E:p2 0003406         – SM6 1,TS
0x00247E:p3 0110100:0003771 – [0x0007F9,] A1
0x00247F:p1 0024100         – A1 B0
0x00247F:p2 0007000:0152774 – R 0x00357F:p0                  @ FATAL HANDLER CALL
0x002480:p0 0000000         – ERR0                           |
0x002480:p1 0000000         – ERR0                           |
0x002480:p2 0000000         – ERR0                           |
0x002480:p3 0035161         – [,A0] B49,A1                   | Transfer A1 words from B registers starting from B49 to memory at address A0
0x002481:p0 0052143         – S0 S1 < 35                     |

 

The fatal error handler is called at 0x00247F:p2, and it is called if the jump at 0x00247D:p1 happens. The bigger picture shows that this code is part of a series of error-checks after a call to 0x00248C:p0 at 0x002459:p0 which is what the B0 register content told us:

0x002459:p0 0007000:0111060 – R 0x00248C:p0

The dump also contains all the S registers and shows S1 to be 0x008C5A4020FFFFFF, which is (when shifted left by 34) in deed negative (in other words bit 29 is set).

What’s the meaning of all this? I don’t know yet. But at least it gives me something to start with. Having a looooong repro case is not going to make debugging this problem terribly fun. Maybe I can find a quicker (and related) failure. But for now, this is currently the end of this tail. Stay tuned for more from the land of the Cray!

Print Friendly, PDF & Email

4 thoughts on “Needle in the hay-stack

  1. There might be still some bugs in the floating point emulation, I got an error message about in when trying interactive job submission. Not sure what exactly was the cause, and I haven’t been able to reproduce it. Log:

    13414170 : CPU0: Undocumented instruction code 00001702
    13414379 CPU0: Floating point test failed! Expected result: 148.1067578666787767 actual result: 148.0000000000000000 relative difference: 0.0007213369370188 absolute difference: 0.1067578666787767 binary value: 0x4008940000000000
    13414379 CPU0: Binary input values: 0x3FFF9A209A84FBD0 0x40300000000001EC
    13414379 CPU0: Floating point values: 148.0000000000000000 = 0.3010299956639813 F* 492.0000000000000000
    13414420 : CPU0: Undocumented instruction code 00001706

    The two undocumented instructions (always those two) show up more times in the log; might be part of the reason for it. After the error, the station seemed to hang.

    • While this looks like an error, it’s actually fine. What’s happening here is that they try to calculate the head/sector/cylinder address of a logical sector address for a HDD access. They would need divides and modulo for that, but they only have reciprocal and multiply, and all in floating-point, so that’s what they use. I have never been able to figure out why the results aren’t exact in the expected results, but the correct answer is what the emulator reports (148.0). So, it’s ugly, should be dealt with, but not a bug.

      Now, why your station hangs, that’s a different question and might point to a bug. Which version of the code did you try? The latest (v0.92) should not have hanging problems at least not with the default configuration.

      • Ah, thanks. I’ve not been able to reproduce the hanging, might have been something else. (BTW, is there a way to get email notifications for answers? Or did I get one and missed it?)

        • Dirk,

          I’ve added a new plugin that you can use to configure email settings. Hopefully this will allow you to set up your notifications the way you like it.

Leave a Reply

Your email address will not be published. Required fields are marked *