A new year a new target
My long-time friend, Chris, sent me an e-mail the other day: he got his hands on a few tapes that contain UNICOS 6.1.6 and 7.0.4 for YMP-el machines. This is a great find! These versions are several years older and are for a previous generation machine than the UNICOS 10 and the J90 I’ve been dealing with in the last year or so. Of course there was the ‘slight’ problem of getting the content of the tapes.
I’ll let him explain his saga on his blog if he wants to. I’ll just say here that it involved getting a tape reader off of e-bay, me mailing him a SCSI card, him 3D printing a new captstan rubber ring, but after all that, during the holidays him and Jeff – his accomplice in this endeavor – managed to extract the content and put it on my server.
So that’s the origin story of this episode: I have two large binary files on my machine, one for each UNICOS version. Now what?
Cracking the big blobs.
First things first, let’s see what’s in the files?
They had a ‘tar’ extension and sure enough they extracted fine using tar. There was slight problem though: the extracted content was a mere 12MB, wheres the original file was close to 500MB. Clearly there’s more to this than what the eye tar can see. With some poking around, I’ve managed to identify a Cray file-system image (NC1FS) shortly after the end of the tar portion. After that came a relatively short cpio archive, some junk that I still don’t know what it’s about and finally a large cpio file. Splitting the binary into these constituents, the following picture emerged:
- The tar file at the top of the file contains the install files needed for the IOS. Chief among them, the kernel image that’s loaded from the IOS into the mainframe during boot.
- The file-system is a small root-fs intended for a RAM-disk and contains minimal executables to bootstrap the install process. Things, like ‘fdisk’, ‘mount’, ‘cpio’. I managed to mount this file-system in my existing J90 simulator, running UNICOS 10 and verify that it is intact.
- The small cpio file seem to contain configuration files for a default install (the newer system I’m familiar with have these under /skl).
- Finally, the large cpio is a complete system image, including root and usr.
This sounds great, actually. It seems all the pieces are there to get a system working. In fact, there’s even a sample configuration file on the mini file-system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
CONFIG { AUTOGEN OFF; /* ------- IOS settings ------- */ /* The number of IOS's must be set if more than 1. */ /* CHANNEL sets channel number for IOS's beyond IOS 0. */ IOS=2; CHANNEL 040 = MIOP IOS = 1; /* ------- CPU settings ------- */ /* The number of CPU's must be set if more than 1. */ CPUS = 2; /* ------- MEMORY setting -------*/ /* You MUST set the memory size of your machine here */ /* 32 million words = 33554432 */ /* 64 million words = 67108864 */ /* 128 million words = 134217728 */ /* 256 million words = 268435456 */ /* MEMORY = 33554432; */ /* ------- Ldcache settings ------- */ /* NLDCH is the number of ldcache headers. */ /* LDCHCORE is the number of 4k blocks type MEM ldcache. */ /* NLDCH = 1024; */ /* LDCHCORE = 8192; */ /* ------- Tape Config settings ------- */ /*------------------------------------------------*/ /*------- IOS 0 Physical Devices settings ------- */ /*------------------------------------------------*/ /* IOS 0 ESDI controller (MIOP) 0 */ ESDI_000 := DD3 MIOP 0 UNIT=0 ( 0:iroot_e000 /* 6000 blocks */ 6000:root_e000 /* 90000 blocks */ 90000:fs1_e000 /* 238200 blocks */ ); ESDI_001 := DD3 MIOP 0 UNIT=1 ( 0:fs2_e001 /* 184200 blocks */ 184200:tmp_e001 /* 150000 blocks */ ); ESDI_002 := DD3 MIOP 0 UNIT=2 ( 0:fs3_e002 /* 184200 blocks */ 184200:tmp_e002 /* 150000 blocks */ ); ESDI_003 := DD3 MIOP 0 UNIT=3 ( 0:fs4_e003 /* 184200 blocks */ 184200:tmp_e003 /* 150000 blocks */ ); /* IOS 0 ESDI controller (MIOP) 1 */ ESDI_010 := DD3 MIOP 1 UNIT=0 ( 0:usr_e010 /* 150000 blocks */ 150000:src_e010 /* 150000 blocks */ 300000:fs4_e010 /* 34200 blocks */ ); ESDI_011 := DD3 MIOP 1 UNIT=1 ( 0:fs3_e011 /* 184200 blocks */ 184200:tmp_e011 /* 150000 blocks */ ); ESDI_012 := DD3 MIOP 1 UNIT=2 ( 0:fs2_e012 /* 184200 blocks */ 184200:tmp_e012 /* 150000 blocks */ ); ESDI_013 := DD3 MIOP 1 UNIT=3 ( 0:fs1_e013 /* 184200 blocks */ 184200:tmp_e013 /* 150000 blocks */ ); /* IOS 0 DAS controller (MIOP) 0 */ /* note: DAS MIOPs are referenced as MIOP 8 */ DAS_000 := DDAS2 MIOP 8 UNIT=0 ( 0:dump_d000 /* 75000 blocks */ 75000:swap_d000 /* 350000 blocks */ 425000:bkroot_d000 /* 90000 blocks */ 515000:bkusr_d000 /* 150000 blocks */ 665000:core_d000 /* 150000 blocks */ 815000:local_d000 /* 250000 blocks */ 1065000:tmp_d000 /*1470360 blocks */ ); /*------------------------------------------------*/ /*------- IOS 1 Physical Devices settings ------- */ /*------------------------------------------------*/ /* IOS 1 IPI controller (MIOP) 0 */ /* note: IPI MIOPs are referenced as MIOP 10 */ IPI_100 := DD4 MIOP 10 UNIT=0 IOS=1 ( 0:fs5_i100 /* 653000 blocks */ ); IPI_101 := DD4 MIOP 10 UNIT=1 IOS=1 ( 0:fs6_i101 /* 653000 blocks */ ); IPI_102 := DD4 MIOP 10 UNIT=2 IOS=1 ( 0:fs5_i102 /* 653000 blocks */ ); IPI_103 := DD4 MIOP 10 UNIT=3 IOS=1 ( 0:fs6_i103 /* 653000 blocks */ ); /* IOS 1 ESDI controller (MIOP) 0 */ ESDI_100 := DD3 MIOP 0 UNIT=0 IOS=1 ( 0:fs7_e100 /* 334200 blocks */ ); ESDI_101 := DD3 MIOP 0 UNIT=1 IOS=1 ( 0:fs8_e101 /* 334200 blocks */ ); ESDI_102 := DD3 MIOP 0 UNIT=2 IOS=1 ( 0:fs9_e102 /* 334200 blocks */ ); ESDI_103 := DD3 MIOP 0 UNIT=3 IOS=1 ( 0:fs10_e103 /* 334200 blocks */ ); /* ------- Stripe Groups ------- */ /*-----------------------------------------*/ /* ------- Logical Device settings ------- */ /*-----------------------------------------*/ /* DAS on IOS 0 logical entries */ iroot := ( /* 6000 blocks */ iroot_e000 ); dump := ( dump_d000 /* 75000 blocks */ ); swap := ( swap_d000 /* 163840 blocks */ ); bkroot := ( bkroot_d000 /* 90000 blocks */ ); bkusr := ( bkusr_d000 /* 150000 blocks */ ); core := ( core_d000 /* 150000 blocks */ ); local := ( local_d000 /* 250000 blocks */ ); dastmp := ( tmp_d000 /*1470360 blocks */ ); /* IPI on IOS 1 logical entries */ fs5 := ( /* 1306000 blocks */ fs5_i100 fs5_i102 ); fs6 := ( /* 1306000 blocks */ fs6_i103 fs6_i101 ); /* ESDI on IOS 0 logical entries */ root := ( /* 90000 blocks */ root_e000 ); usr := ( usr_e010 /* 150000 blocks */ ); tmp := ( /* 900000 blocks */ tmp_e001 tmp_e011 tmp_e002 tmp_e012 tmp_e003 tmp_e013 ); fs1 := ( /* 422400 blocks */ fs1_e000 fs1_e013 ); fs2 := ( /* 368400 blocks */ fs2_e012 fs2_e001 ); fs3 := ( /* 368400 blocks */ fs3_e002 fs3_e011 ); fs4 := ( /* 218400 blocks */ fs4_e010 fs4_e003 ); /* ESDI on IOS 1 logical entries */ fs7 := ( /* 334200 blocks */ fs7_e100 ); fs8 := ( /* 334200 blocks */ fs8_e101 ); fs9 := ( /* 334200 blocks */ fs9_e102 ); fs10 := ( /* 334200 blocks */ fs10_e103 ); /*--------------------------------*/ /* ------- System Devices ------- */ /*--------------------------------*/ SWAPDEV = swap; ROOTDEV = iroot; PIPEDEV = iroot; } |
Which is the first clue that we’re dealing with something old here: this style of configuration is what UNICOS 4.0 used on the X-MP. The UNICOS 10 image I’ve been working with recently used a different format. No big deal of course, just shows how much this OS evolved between version 6 (and 7) and 10.
Getting things going
Life would be too simple of course without problems: now I have an OS for a machine that I don’t have a simulator for. Not only that, but the YMP-el is curiously undocumented. There are only a few documents online, none of which detail the instruction set for example (only summaries are available). But, how much things could have changed between it and the J90? If I were a betting man, I would put my money on ‘not too much’. Maybe they’ve added a few instructions here and there, but the two machines should be fairly close to one another.
With that thought, I put together a setup that mapped the small root FS to a virtual disk, loaded a modified version of the above configuration file and the kernel into the J90 emulator and let it loose.
Nothing…
The machine was just chugging along, no input, no output, no smoke. Nothing.
By enabling instruction traces, it became pretty clear why: the mainframe was awaiting to hear from the IOS. This is behavior that I’ve seen in the X-MP as well: there the communication started by the IOS talking first (sending the time and date info). The newer J90 system went the other way and the mainframe sent the first packet (which by the way makes way more sense).
Luckily I had some code laying around from my early UNICOS 4.0 experiments that simulated this behavior. I have yanked it out from the sources a long time ago, but that’s what revision control is for.
After some futzing around with the code and merging this ancient piece back and patching it up for all the changes since than, I got a console. And a crash report with it:
OK, this is not bad, actually. It means that:
- The kernel booted
- It parsed the configuration file properly
- It found and mounted the root FS
- It spawned the first process
But why did it crash? Well, obviously – it even tells me as much – because init exited. It’s not supposed to do that: this is the first process created by the kernel and the one that creates all the rest. If it terminates, the kernel doesn’t know what to do.
Looking into the instruction trace, comparing it with the memory dump and the little fragments of source code I have (mostly for UNICOS 10), I realized that the crash was happening in paniktst.s. This piece of code is testing for whether the interrupt was due to an error-exit or not (the Crays don’t have vectored interrupts, so SW has to differentiate between all sources, including exceptions). Furthermore this particular routine was called if there was an interrupt while the processor was in monitor mode (privileged mode on modern platforms, a state where most interrupts are disabled).
So, let’s see what the state of the machine was during the last interrupt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
182067 CPU0: ================================ JUMP ========================= 182067 CPU0: XA:0x15 exec 0x000801B1:p2 (0x000801B1:p2) CMR 182068 CPU0: XA:0x15 exec 0x000801B1:p3 (0x000801B1:p3) SM27 0 182069 CPU0: XA:0x15 exec 0x000801B2:p0 (0x000801B2:p0) A5 S3 182069 CPU0: XA:0x15 exec 0x000801B2:p1 (0x000801B2:p1) A0 [0x00000012,A5] 182069 CPU0: XA:0x15 exec 0x000801B3:p0 (0x000801B3:p0) A7 0x00000010 182069 CPU0: XA:0x15 exec 0x000801B3:p1 (0x000801B3:p1) A7 A7+A6 182069 CPU0: XA:0x15 exec 0x000801B3:p2 (0x000801B3:p2) XA A1 182069 CPU0: XA:0x1B exec 0x000801B3:p3 (0x000801B3:p3) B0 A0 182069 CPU0: XA:0x1B exec 0x000801B4:p0 (0x000801B4:p0) EX 0 182070 CPU0: at 00:00:04.508000 Handling interrupt with flags: .---.--.---.---.---.---.---.--.---.---.---. 182070 CPU0: Creating exchange packet with XP: 0x1B 182070 CPU0: *************** EXCHANGE PACKET SWAP BEGINS *************** 182070 CPU0: Current Exchange Packet: 182070 CPU0: ----------------------------------------------------------- 182070 CPU0: Raw packet 0: 0x002006D100000000 182070 CPU0: Raw packet 1: 0x00000000000001B0 182070 CPU0: Raw packet 2: 0x0001FFFC000121DA 182070 CPU0: Raw packet 3: 0x0000000000221000 182070 CPU0: Raw packet 4: 0x0001FFFC00044026 182070 CPU0: Raw packet 5: 0x001B800100221400 182070 CPU0: Raw packet 6: 0x000009EB00000140 182070 CPU0: Raw packet 7: 0x0000000000000150 182070 CPU0: Raw packet 8: 0x0000000000000000 182070 CPU0: Raw packet 9: 0x00000000000003E5 182070 CPU0: Raw packet 10: 0x0000000000000000 182070 CPU0: Raw packet 11: 0x0000000000221400 182070 CPU0: Raw packet 12: 0x0001AF303BF970C4 182070 CPU0: Raw packet 13: 0x0000000000000000 182070 CPU0: Raw packet 14: 0x0000000000000000 182070 CPU0: Raw packet 15: 0x0000000000000001 182070 CPU0: PN:0 182070 CPU0: S:0x00 182070 CPU0: CSB:0x00000000 182070 CPU0: M:.--.ESVL.---.---.BDM.IOR.IFP.IUM.---.---.IMM.MM. 182070 CPU0: VNU:0 182070 CPU0: F:.---.--.---.---.---.---.---.--.---.---.---. 182070 CPU0: XA:0x1B 182070 CPU0: VL:64 182070 CPU0: CLN:1 182070 CPU0: P:0x000801B4:p1 182070 CPU0: IBA:0x00000000 182070 CPU0: ILA:0x01FFFC00 182070 CPU0: DBA:0x00000000 182070 CPU0: DLA:0x01FFFC00 182070 CPU0: A0:0x00000000 182070 CPU0: A1:0x000001B0 182070 CPU0: A2:0x000121DA 182070 CPU0: A3:0x00221000 182070 CPU0: A4:0x00044026 182070 CPU0: A5:0x00221400 182070 CPU0: A6:0x00000140 182070 CPU0: A7:0x00000150 182070 CPU0: S0:0x0000000000000000 182070 CPU0: S1:0x00000000000003E5 182070 CPU0: S2:0x0000000000000000 182070 CPU0: S3:0x0000000000221400 182070 CPU0: S4:0x0001AF303BF970C4 182070 CPU0: S5:0x0000000000000000 182070 CPU0: S6:0x0000000000000000 182070 CPU0: S7:0x0000000000000001 182070 CPU0: New Exchange Packet: 182070 CPU0: ----------------------------------------------------------- 182070 CPU0: Raw packet 0: 0x0000000000000000 182070 CPU0: Raw packet 1: 0x0000222000000000 182070 CPU0: Raw packet 2: 0x0000222200000000 182070 CPU0: Raw packet 3: 0x0000222000000000 182070 CPU0: Raw packet 4: 0x0000222200000000 182070 CPU0: Raw packet 5: 0x001B000000000000 182070 CPU0: Raw packet 6: 0x000009F800000000 182070 CPU0: Raw packet 7: 0x0000000000000000 182070 CPU0: Raw packet 8: 0x0000000000000000 182070 CPU0: Raw packet 9: 0x0000000000000000 182070 CPU0: Raw packet 10: 0x0000000000000000 182070 CPU0: Raw packet 11: 0x0000000000000000 182070 CPU0: Raw packet 12: 0x0000000000000000 182070 CPU0: Raw packet 13: 0x0000000000000000 182070 CPU0: Raw packet 14: 0x0000000000000000 182070 CPU0: Raw packet 15: 0x0000000000000000 182070 CPU0: PN:0 182070 CPU0: S:0x00 182070 CPU0: CSB:0x00000000 182070 CPU0: M:.--.ESVL.---.---.BDM.IOR.IFP.IUM.ICM.---.---.--. 182070 CPU0: VNU:0 182070 CPU0: F:.---.--.---.---.---.---.---.--.---.---.---. 182070 CPU0: XA:0x1B 182070 CPU0: VL:0 182070 CPU0: CLN:0 182070 CPU0: P:0x00000000:p0 182070 CPU0: IBA:0x00222000 182070 CPU0: ILA:0x00222000 182070 CPU0: DBA:0x00222000 182070 CPU0: DLA:0x00222000 182070 CPU0: A0:0x00000000 182070 CPU0: A1:0x00000000 182070 CPU0: A2:0x00000000 182070 CPU0: A3:0x00000000 182070 CPU0: A4:0x00000000 182070 CPU0: A5:0x00000000 182070 CPU0: A6:0x00000000 182070 CPU0: A7:0x00000000 182070 CPU0: S0:0x0000000000000000 182070 CPU0: S1:0x0000000000000000 182070 CPU0: S2:0x0000000000000000 182070 CPU0: S3:0x0000000000000000 182070 CPU0: S4:0x0000000000000000 182070 CPU0: S5:0x0000000000000000 182070 CPU0: S6:0x0000000000000000 182070 CPU0: S7:0x0000000000000000 182070 CPU0: *************** EXCHANGE PACKET SWAP ENDS *************** 182072 CPU0: at 00:00:04.508000 Handling interrupt with flags: .---.--.---.---.---.---.PRE.--.---.---.---. 182072 CPU0: Creating exchange packet with XP: 0x1B 182072 CPU0: *************** EXCHANGE PACKET SWAP BEGINS *************** 182072 CPU0: Current Exchange Packet: 182072 CPU0: ----------------------------------------------------------- 182072 CPU0: Raw packet 0: 0x0000000000000000 182072 CPU0: Raw packet 1: 0x0000222000000000 182072 CPU0: Raw packet 2: 0x0000222000000000 182072 CPU0: Raw packet 3: 0x0000222000000000 182072 CPU0: Raw packet 4: 0x0000222000000000 182072 CPU0: Raw packet 5: 0x001B000000000000 182072 CPU0: Raw packet 6: 0x000109F800000000 182072 CPU0: Raw packet 7: 0x0000000000000000 182072 CPU0: Raw packet 8: 0x0000000000000000 182072 CPU0: Raw packet 9: 0x0000000000000000 182072 CPU0: Raw packet 10: 0x0000000000000000 182072 CPU0: Raw packet 11: 0x0000000000000000 182072 CPU0: Raw packet 12: 0x0000000000000000 182072 CPU0: Raw packet 13: 0x0000000000000000 182072 CPU0: Raw packet 14: 0x0000000000000000 182072 CPU0: Raw packet 15: 0x0000000000000000 182072 CPU0: PN:0 182072 CPU0: S:0x00 182072 CPU0: CSB:0x00000000 182072 CPU0: M:.--.ESVL.---.---.BDM.IOR.IFP.IUM.ICM.---.---.--. 182072 CPU0: VNU:0 182072 CPU0: F:.---.--.---.---.---.---.PRE.--.---.---.---. 182072 CPU0: XA:0x1B 182072 CPU0: VL:0 182072 CPU0: CLN:0 182072 CPU0: P:0x00000000:p0 182072 CPU0: IBA:0x00222000 182072 CPU0: ILA:0x00222000 182072 CPU0: DBA:0x00222000 182072 CPU0: DLA:0x00222000 182072 CPU0: A0:0x00000000 182072 CPU0: A1:0x00000000 182072 CPU0: A2:0x00000000 182072 CPU0: A3:0x00000000 182072 CPU0: A4:0x00000000 182072 CPU0: A5:0x00000000 182072 CPU0: A6:0x00000000 182072 CPU0: A7:0x00000000 182072 CPU0: S0:0x0000000000000000 182072 CPU0: S1:0x0000000000000000 182072 CPU0: S2:0x0000000000000000 182072 CPU0: S3:0x0000000000000000 182072 CPU0: S4:0x0000000000000000 182072 CPU0: S5:0x0000000000000000 182072 CPU0: S6:0x0000000000000000 182072 CPU0: S7:0x0000000000000000 182072 CPU0: New Exchange Packet: 182072 CPU0: ----------------------------------------------------------- 182072 CPU0: Raw packet 0: 0x002006D100000000 182072 CPU0: Raw packet 1: 0x00000000000001B0 182072 CPU0: Raw packet 2: 0x0001FFFC000121DA 182072 CPU0: Raw packet 3: 0x0000000000221000 182072 CPU0: Raw packet 4: 0x0001FFFC00044026 182072 CPU0: Raw packet 5: 0x001B800100221400 182072 CPU0: Raw packet 6: 0x000009EB00000140 182072 CPU0: Raw packet 7: 0x0000000000000150 182072 CPU0: Raw packet 8: 0x0000000000000000 182072 CPU0: Raw packet 9: 0x00000000000003E5 182072 CPU0: Raw packet 10: 0x0000000000000000 182072 CPU0: Raw packet 11: 0x0000000000221400 182072 CPU0: Raw packet 12: 0x0001AF303BF970C4 182072 CPU0: Raw packet 13: 0x0000000000000000 182072 CPU0: Raw packet 14: 0x0000000000000000 182072 CPU0: Raw packet 15: 0x0000000000000001 182072 CPU0: PN:0 182072 CPU0: S:0x00 182072 CPU0: CSB:0x00000000 182072 CPU0: M:.--.ESVL.---.---.BDM.IOR.IFP.IUM.---.---.IMM.MM. 182072 CPU0: VNU:0 182072 CPU0: F:.---.--.---.---.---.---.---.--.---.---.---. 182072 CPU0: XA:0x1B 182072 CPU0: VL:64 182072 CPU0: CLN:1 182072 CPU0: P:0x000801B4:p1 182072 CPU0: IBA:0x00000000 182072 CPU0: ILA:0x01FFFC00 182072 CPU0: DBA:0x00000000 182072 CPU0: DLA:0x01FFFC00 182072 CPU0: A0:0x00000000 182072 CPU0: A1:0x000001B0 182072 CPU0: A2:0x000121DA 182072 CPU0: A3:0x00221000 182072 CPU0: A4:0x00044026 182072 CPU0: A5:0x00221400 182072 CPU0: A6:0x00000140 182072 CPU0: A7:0x00000150 182072 CPU0: S0:0x0000000000000000 182072 CPU0: S1:0x00000000000003E5 182072 CPU0: S2:0x0000000000000000 182072 CPU0: S3:0x0000000000221400 182072 CPU0: S4:0x0001AF303BF970C4 182072 CPU0: S5:0x0000000000000000 182072 CPU0: S6:0x0000000000000000 182072 CPU0: S7:0x0000000000000001 182072 CPU0: *************** EXCHANGE PACKET SWAP ENDS *************** |
There’s a lot of mambo-jumbo here, but two things are important. First, there are two ‘exchanges’ right after one another. The first one exchanges out from monitor mode (kernel-to-user transition) but right after that we’re dropped back into the kernel without even executing a single instruction. The reason? It’s a protection error (PRE bit set in flags). Now, this is because the IBA and ILA (instruction base and instruction limit address) registers are set to the same value. Essentially we’ve set up the processor to allow 0 bytes to be executed, which is exactly what it did.
And now, the second point: the raw value of the exchange package (that determines the value of IBA and ILA) contains different values for the protection registers than what they end up with:
1 2 3 4 5 |
182070 CPU0: Raw packet 1: 0x0000222000000000 182070 CPU0: Raw packet 2: 0x0000222200000000 182072 CPU0: IBA:0x00222000 182072 CPU0: ILA:0x00222000 |
This obviously ain’t right. Unfortunately the exchange packet layout is not documented in the printed material that I have access to, but I’ve seen Cray messing around with the fields for these two registers before. In fact machines within the same generation (dual and quad-CPU X-MPs) have these fields mapped out differently. Luckily, I do have access to the kernel headers, in which (in xp.h) the exchange packet layout is documented. And there, in fact is my answer: the fields are shifted by two bits compared to the J90. Just great!
Anyway, easy enough to fix, though it does mean I now have to make run-time decisions based on machine type – and of course also means that I have to introduce the machine time to the simulator to begin with.
With this hurdle down – and the code changed – I’ve restarted the simulator. Which now doesn’t crash, but doesn’t boot either. At least not fully. It is just sitting there looking all boring and dumb.
Looking at the logs (I’m very glad that I implemented some rudimentary syscall extraction), it became pretty clear that the problem is that the third call to ‘close’ doesn’t return to the app as it should, instead it enters an idle loop.
Another nice feature of UNICOS, at least the official builds is that they don’t strip the symbols off of the binaries. That means that I can re-construct the call-tree (I didn’t bother automating it, but the manual process isn’t horrible). As it turns out, close was called from fclose (well, d’uh), which was called from _cleanup, than _exithandle, and finally from exit.
The handle being closed belonged to stderr. It’s pretty clear that the system is waiting for something. But why would close do this? Why would it block?
That’s when the light went on: Buffering, that’s why! Buffered I/O needs to be flushed when the last handle to the file is closed (in this case stderr was redirected to stdout).
And now, that I mention it… I haven’t seen any output from the utility being run, only from the kernel. I’ve seen some ‘write’ calls that seem to target stdout, but nothing came out. Which is another clue by the way: initial kernel output is probably not buffered and even if it is, it’s polling as in such an early stage, the kernel can’t handle interrupts (monitor mode is largely un-interruptable). So, what’s going on is that for some reason the handshake with the (simulated) console is not working, the output gets buffered up in the kernel, and when the file-handle is closed the application gets blocked.
Looking at my code it became pretty clear why: I don’t send any replies for console messages. None what-so-ever. No wonder the poor OS didn’t know it was free to send more data…
With the appropriate fixes (and some additional cleanup), I finally have a prompt in single-user mode:
As you can see I’ve switched from the 6.x release to the 7.x release during this work, which is the next part of the story.
Getting UNICOS 6 to boot
The 6.x release still didn’t work. It crashed with trying to execute an instruction that’s not valid on a Y-MP. This doesn’t happen in the kernel though, it happens pretty much immediately after creating the first process:
1 2 3 4 5 6 7 8 9 |
431928 : SYSCALL: 11 exec returning 0x0000000000000000 (0) 431928 CPU0: SYSCALL: 11 exec returning 0x0000000000000000 (0) 431929 CPU0: XA:0x1B exec 0x008763D7:p0 (0x000015D7:p0) S1 RT | Read the real-time-clock 431930 CPU0: XA:0x1B exec 0x008763D7:p1 (0x000015D7:p1) T56 S1 | 431930 CPU0: XA:0x1B exec 0x008763D7:p2 (0x000015D7:p2) T57 S7 | 431930 CPU0: XA:0x1B exec 0x008763D7:p3 (0x000015D7:p3) B57 A1 | 431930 CPU0: XA:0x1B exec 0x008763D8:p0 (0x000015D8:p0) S2 0 | 431930 CPU0: XA:0x1B exec 0x008763D8:p1 (0x000015D8:p1) SM S2 | 431930 CPU0: XA:0x1B exec 0x008763D8:p2 (0x000015D8:p2) A0 0x00400000:p0 | ******* X-MP ONLY ******** MIGHT NOT BE IMPLEMENTED ON YMP |
This last instruction doesn’t exist on Y-MPs (only on X-MPs) so my simulator blows up.
As it turns out, this code sequence comes from /etc/init, so we’ve gotten as far as loading that, but apparently not much further. What can be going on though? All the documentation I’ve seen clearly states that these instructions don’t exist on Y-MPs, and for a good reason: with the expansion of the A registers from 24 to 32-bits, these instructions, that load a constant value into these registers don’t have enough bits anymore to encode all the possible values.
Could it be that this is actually an X-MP version of the OS? No, that’s not possible either, as that would certainly crash the kernel much earlier. Plus, as you can see in the screen-shot up at beginning, the kernel clearly states that it was compiled for Y-MP.
But, as I’ve said before, this is not the kernel anymore, it’s the first process, /etc/init. Could that be an X-MP binary? Luckily UNICOS binaries contain some info about their target architecture, and parsing that, sure enough, the binary is a 24-bit one. After some digging and guessing, it turns out Y-MPs supported a 24-bit (more or less) X-MP compatible mode. This mode was set on a per-process basis: it’s bit 35 of word 6 of the exchange packet. My simulator doesn’t know anything about that bit as this support was yanked from the J90, but let’s see if the raw exchange packets have that bit set the right way!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
431928 CPU0: *************** EXCHANGE PACKET SWAP BEGINS *************** 431928 CPU0: Current Exchange Packet: 431928 CPU0: ----------------------------------------------------------- 431928 CPU0: Raw packet 0: 0x002026BC00000000 431928 CPU0: Raw packet 1: 0x00000000000001B0 431928 CPU0: Raw packet 2: 0x0007FFFF00012818 431928 CPU0: Raw packet 3: 0x0000000000873E00 431928 CPU0: Raw packet 4: 0x0007FFFF00040640 431928 CPU0: Raw packet 5: 0x001B800100874200 431928 CPU0: Raw packet 6: 0x000009EB00000140 ... 431928 CPU0: New Exchange Packet: 431928 CPU0: ----------------------------------------------------------- 431928 CPU0: Raw packet 0: 0x0000575C00000000 431928 CPU0: Raw packet 1: 0x0000874E00008A6D 431928 CPU0: Raw packet 2: 0x000087DA00000000 431928 CPU0: Raw packet 3: 0x0000874E00000000 431928 CPU0: Raw packet 4: 0x000087DA00000000 431928 CPU0: Raw packet 5: 0x001B000000000000 431928 CPU0: Raw packet 6: 0x000009F000000000 |
Bingo! As we get out of the kernel, the extended address range mode bit (EAM) is set, and when we enter into the process, it’s cleared. So, we’re dealing with a 24-bit executable in a 32-bit OS.
At first this seemed a much bigger headache than it actually turned out to be. You see, the way I’ve built the simulator was that I’ve created a custom type that’s either 24-bit or 32-bit long depending on the architecture. This type (called CAddr_t) is used literally everywhere and for a good reason: it’s not trivial not to make a mistake in sign-extending, wrapping around, converting to/from 64-bit values. It’s best to keep those details wrapped in a type, with the proper operator overloads than to sprinkle them around the code.
Except now, I need to simulate both address types, simultaneously, depending on a stupid bit set in the exchange packet! That’s not what the C++ type-system was designed to do. Not only that, but even in the real HW, the underlying machine used 32-bit addressing, it’s just the instruction-set that got swapped out. There are a lot of potential bugs here and a lot of porting/debugging before it will start working.
Or so I thought at first. But than I realized: I actually wrapped most of the register accesses into a little helper class and some macros:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#define RefA0 RefAx_s(0, *this, false, CAddr_t(0)) #define RefAh RefAx_s(h, *this, true, CAddr_t(0)) #define RefAi RefAx_s(i, *this, false, CAddr_t(0)) #define RefAj RefAx_s(j, *this, true, CAddr_t(0)) #define RefAk RefAx_s(k, *this, true, CAddr_t(1)) #define RefA0Target RefAx_s(0, *this, false, CAddr_t(0)) #define RefAhTarget RefAx_s(h, *this, false, CAddr_t(0)) #define RefAiTarget RefAx_s(i, *this, false, CAddr_t(0)) #define RefAjTarget RefAx_s(j, *this, false, CAddr_t(0)) #define RefAkTarget RefAx_s(k, *this, false, CAddr_t(0)) struct RefAx_s: public FieldFormatter_i { RefAx_s(size_t aRegIdx, Cpu_c &aParent, bool aHasSpecialValue, CAddr_t aSpecialValue): mRegIdx(aRegIdx), mParent(aParent), mHasSpecialValue(aHasSpecialValue), mSpecialValue(aSpecialValue) {} RefAx_s &operator=(CAddr_t aValue) { LogLine_c LogLine = mParent.mLogger << setloglevel(LogLevel_SideEffects); if (LogLine.good()) { LogLine << SideEffectIndent << "state-change " << *this << " <== " << HexPrinter(aValue) << " (" << DecPrinter(aValue,0) << ")" << std::endl; } mParent.mState.A[mRegIdx] = aValue; return *this; } operator CAddr_t() const { if (mRegIdx == 0 && mHasSpecialValue) { return CAddr_t(mSpecialValue); } else { return mParent.mState.A[mRegIdx]; } } virtual void Print(std::ostream &aStream) const { if (mRegIdx == 0 && mHasSpecialValue) { aStream << DecPrinter(mSpecialValue,0); } else { aStream << "A" << DecPrinter(mRegIdx,0); } } protected: size_t mRegIdx; Cpu_c &mParent; bool mHasSpecialValue; CAddr_t mSpecialValue; }; |
The reason I’ve done this is to have the ability to log register changes. This have proved an invaluable tool many many times, tracking down tricky bugs and behaviors. In this case it can pay big dividends again: since all assignments and accesses are wrapped, the appropriate truncation can be wrapped in this class. Best: this class knows about the CPU (mParent field above) and thus have access to the current operating mode (24 or 32-bit addressing). This does not take care of all cases, most importantly it doesn’t handle sign-extension properly, but deals with the bulk of the problems.
I was also lucky enough that – even though I don’t have a detailed instruction set description for the Y-MP-s – I have a summary of the ISA that lists all the important differences between the two operating modes.
So, it turns out it was only a few hours of work to get the simulator modified. Of course not without errors. I’ve found myself in a similar situation as before: init starts, but somehow hangs. What could be a problem this time?
A quick look at the syscall logs reveals that the last program executed was ‘sh’. Enabling logging, shows that we’re getting stuck in this loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
0x0000269F:p0 (0x0000269F:p0) 0024703 - A7 B3 | 0x0000269F:p1 (0x0000269F:p1) 0127700:0000165 - S7 [0x00000075,A7] | 0x0000269F:p3 (0x0000269F:p3) 0024602 - A6 B2 | 0x000026A0:p0 (0x000026A0:p0) 0126600:0000015 - S6 [0x0000000D,A6] | 0x000026A0:p2 (0x000026A0:p2) 0042577 - S5 1 | 0x000026A0:p3 (0x000026A0:p3) 0015400:0000010 - A5 16777224 | ******* X-mode ONLY ******** 0x000026A1:p1 (0x000026A1:p1) 0060115 - S1 S1+S5 @ Increment S1 (loop counter) 0x000026A1:p2 (0x000026A1:p2) 0030656 - A6 A5+A6 | 0x000026A1:p3 (0x000026A1:p3) 0024502 - A5 B2 | 0x000026A2:p0 (0x000026A2:p0) 0135700:0000010 - [0x00000008,A5] S7 | 0x000026A2:p2 (0x000026A2:p2) 0135600:0000011 - [0x00000009,A5] S6 | 0x000026A3:p0 (0x000026A3:p0) 0075100 - T0 S1 @ Save S1 (loop counter) 0x000026A3:p1 (0x000026A3:p1) 0007000:0114611 - R 0x00002662:p1 @ calling 'nextpath' 0x000026A3:p3 (0x000026A3:p3) 0024704 - A7 B4 | 0x000026A4:p0 (0x000026A4:p0) 0071717 - S7 +A7 | Assign with sign-extension 0x000026A4:p1 (0x000026A4:p1) 0024602 - A6 B2 | 0x000026A4:p2 (0x000026A4:p2) 0136100:0000015 - [0x0000000D,A6] S1 | 0x000026A5:p0 (0x000026A5:p0) 0074100 - S1 T0 | 0x000026A5:p1 (0x000026A5:p1) 0046017 - S0 S1\S7 | S0 = S1 xor S7 0x000026A5:p2 (0x000026A5:p2) 0015000:0115174 - JSN 0x0000269F:p0 | Jump if S0 != 0 |
The loop counter is S1 and the terminal count seems to be in S7. What does the trace say about the loop test at the end?
1 2 3 4 5 6 7 |
6930648 CPU0: XA:0x1B exec 0x00889CA3:p3 (0x000026A3:p3) A7 B4 | state-change A7 <== 0x00FFFFFF (16777215) 6930648 CPU0: XA:0x1B exec 0x00889CA4:p0 (0x000026A4:p0) S7 +A7 | state-change S7 <== 0x0000000000FFFFFF (16777215) 6930648 CPU0: XA:0x1B exec 0x00889CA4:p1 (0x000026A4:p1) A6 B2 | state-change A6 <== 0x00003A93 (14995) 6930648 CPU0: XA:0x1B exec 0x00889CA4:p2 (0x000026A4:p2) [0x0000000D,A6] S1 | mem write 0x00003AA0 (0x008824A0) value: 0x0000000000000000 '........' 6930648 CPU0: XA:0x1B exec 0x00889CA5:p0 (0x000026A5:p0) S1 T0 | state-change S1 <== 0x000000000066CBC2 (6736834) 6930648 CPU0: XA:0x1B exec 0x00889CA5:p1 (0x000026A5:p1) S0 S1\S7 | state-change S0 <== 0x000000000099343D (10040381) 6930648 CPU0: XA:0x1B exec 0x00889CA5:p2 (0x000026A5:p2) JSN 0x0000269F:p0 | Jump if S0 != 0 |
So, S1 was 0x66cbc2 and S7 was 0xffffff. That’s an awful long loop since S1 gets incremented by 1 in every iteration. So, how did S7 become that strange value? It was assigned in the ‘S7 +A7’ instruction. A7 was 0xffffff, but this assignment supposed to sign-extend! So S7 should have become 0xffffffffffffffff, not it’s current value. Clearly I’ve screwed up the simulation of this instruction. Not that it helps in this particular case, that would just make the loop even longer. But this instruction might have been executed (incorrectly) many times before, so I can’t really trust this execution state anymore. I have to fix the bug and re-run. That’s exactly what I’ve done and (drum-roll):
I have boot!
What’s next?
This is a good place to end this post. I have both UNICOS 6 and 7 booting at least into the minimal root FS in RAM.
The fact that these executables are actually compiled for X-mode (that is XMP compatible) brings about a very interesting experiment. You see, I have a partial copy of UNICOS 4.0 for the XMP. The problem with that have always been that I don’t have a root FS for it. I have the kernel, I have a config file, I even have the content of the /usr partition, but nothing for root. That’s a problem of course as many of the crucial utilities, like ‘init’ and ‘sh’ are there. Booting without these is next to impossible. But now, I have a chance: I can marry the root FS (and its utilities) from UNICOS 6 with the kernel and /usr from UNICOS 4 and get to a more or less functional system. It won’t be perfect (for example mkfs and fsck won’t work for sure and some tools that rely on new syscalls or since-fixed kernel bugs are not going to work) but it’s way more than what I have now. Best of all, this – if works – is for the X-MP, my original project goal. Finally, it’s not inconceivable that UNICOS 6 not only contains X-MP style binaries, but a toolset to compile such binaries. If that’s the case, I can even attempt generating new binaries for the system.
Exciting times for sure. Here’s for a happy 2017!