The UNICOS File-system

Download PDF

Previously I’ve gotten as far as getting the main console more or less working and UNICOS checking in. After cleaning up some garbage on the screen due to some misunderstanding of how the sequence number and acknowledge system worked for the terminal, I have a much cleaner picture now:

Looking at the screen, there are two ominous warnings: one that the system apparently ‘cannot open root with disk inode’ and the other reporting the file-system to be full. I’ll let the first one slide for now as, apparently the system still boots and look at the second problem: why is the file-system full and what to do about it?

Turning on logging of disk activity and looking at what’s going on, it’s pretty easy to identify the culprit: The OS tries to write the full kernel memory content onto the disk (sort of like a core-dump). But why would it dump the core? If the kernel actually crashed, it wouldn’t have gotten as far as it did, and if it wasn’t the kernel, the dump wouldn’t be of the full memory only of the processes’ memory space.

Maybe that’s what just UNICOS does: it dumps the memory on boot for fun and giggles.

However that possibility brings up the next problem: the file-system I have (ram_fs) clearly isn’t big enough to hold the dump.

(Update: As I much later found out, UNICOS only dumps the kernel memory space if it’s booting off of a IOS-attached disk as opposed to a ram_fs. In this case, however I’ve emulated a virtual hard drive, containing the ram_fs image, which confused the OS. Nevertheless, I didn’t figure out much later how to even boot off of a RAM drive, let alone that that was the problem.)

Whatever the reason is, the solution seems to be to figure out how to re-size an existing file-system. This might turn out to be hard to do, but I have to reverse-engineer the file-system anyway: at the moment I can only exchange files with the OS running in the simulator through creating a FS on the host and mounting it inside the simulator. And exchanging files I must if I want to install the full OS: I need to transfer the install media.

So, how should I go about it?

Let’s go back to our trusted source of information: /usr/include/sys. There’s actually a whole directory here dealing with file-system stuff:

Of primary interest are nc1filsys.h nc1ino.h and ncdir.h. They contain (in order) the layout of the super-block, something called the dynamic block, the inode structure and the directory structure.

OK, that was a mouthful. There are many things to tease out here, so let’s start!

The structure of the file-system is based on the traditional UNIX approach, but there are a few key differences. The whole file-system thinks of the disk as a series of 4kByte blocks. These blocks conveniently map to sectors on the hard drives used in the J-90, but that’s not necessarily a requirement. A file-system on the machine could be spread around on multiple ‘partitions’ on multiple drives and supported various striping configurations, though those details are not terribly important for a SW simulator. The mapping of sector-ranges to partitions and file-systems is part of the parameter file, though some of this information is duplicated in the file-system itself. (In other words there’s no partition table on the hard drives.)

Blocks are numbered consecutevly, starting at 0 through all the partitions that constitue a file-system, at least I think so: I’ve seen sections of code that seem to iterate through all the partitions and doing a subtraction of what appears to be the partition size to determine the physical sector corresponding to a logical block.

The most important information about the file-system is stored in the super-block. The primary copy of it is at block 1 with several copies sprinkled around the drive. This structure contains all the (more or less) static information about the file-system. The frequently changing info (like last mount time, locking, number of free inodes, etc.) are factored out into the dynamic block (nc1dblock). The dynamic block is also one block large and it’s location is recorded in the super-block.

These two together describe the file-system layout but not the content. For that, we’ll need a set of inodes and something, called the FREEMAP. Each inode describe one entity (the content of a file or a directory) on the file-system. It is the key structure from which the blocks containing the content can be accessed. An inode entry is 256 bytes long, so 16 of them fits in a block. A set of blocks are set aside when the file-system is created for inode storage. These regions are described in the super-block for each partition that constitutes the file-system.

Given an inode number, it’s block offset can be determined by dividing the number by 16. This block offset than can be used in an iteration through the inode allocation regions in the super-block to convert it to an absolute block number. Within the block, the modulo 16 of the inode number (multiplied by 256) provides the offset of the struct.

The root inode number is two in all UNIX system. Unicos apparently has the feature of changing that default (there’s a field for that in the super-block), but I decided to not mess with it.

Contrary to the original UNIX file-system design, there’s no free inode list. Instead, a bitmap is stored on the hard-drive, which records the state of each block on the file-system: 0 for free, 1 for occupied. This structure is called the FREEMAP, and it’s location and size is recorded in the super-block (s_mapoff and s_mapblks fields).

Theoretically this information is not strictly necessary: one can iterate through all inodes, record all the allocated blocks, and what’s not allocated, is – by definition – free. This is a length process though, so understandably the OS caches the result. The fsck utility among other things checks the and fixes any inconsistencies between the inodes and the FREEMAP.

Inodes

As we’ve discussed, UNIX – pretty much all flavors of it – represents the content of every file (or directory) with an inode. The inode structure contains the list of blocks corresponding to the file. This structure is rather hairy, but the main use-case is fairly easy to grasp: the allocations for the file are held in an array of 8 entries: cdi_addr. Each allocation is a contiguous extent of sectors, so each entry has a start block and a block-count part. I’m sure for highly fragmented file-systems, indirect inodes also exist (when the 8 entries in the inode are insufficient to describe the whole file) but I didn’t bother figuring out how that works: due to the extent-based allocation, it’s pretty difficult to set up a scenario when 8 entries are insufficient. It certainly won’t be a problem for a FS created from scratch on the host.

Inodes also contain the access permissions and time-stamps for creation, modification etc. These details are not terribly important or interesting for the moment. The only thing to note is that UNICOS on top of supporting the traditional UNIX-style permissions, has a whole new and different permission system. If it is enabled by default, I’m in trouble – I’ll have to figure out what the related fields mean. However, there’s no indication that’s default on.

Directories

Inodes only capable of describing the content of something. To make the FS useful, we need to give a name to these content ‘blobs’ and organize them. This is what directories achieve: associate a file name with it’s content, that is, an inode.

So how are directories stored? Of course in an inode! While for normal files, the content of blocks the inode references is ‘just a bunch of bytes’ as far as the OS is concerned, for directories, the format is defined: it is a set of cdirect entries. These entries are not much more than a mapping between a name and an inode (which then describes the content), with one important exception: there’s a field, called cd_signature. After some debugging I realized that this field is a hash of sorts of the file-name. But what kind? There are so many to chose from? The only way to figure that out was to look at the instruction traces for the kernel trying to access a directory entry on the hard drive. From that work, the following algorithm emerged:

The code is a bit hacky, but does the job. The memcpy is needed to make sure that the file-name is zero-padded to 64-bit boundaries, and the SwapBytes call is there to rectify the endianness differences between the host (x86) and the target (Cray).

There are several other details of course that I haven’t figured out, but this is enough to implement a very basic file-system manipulation utility: one that can create a passable virtual hard drive, with a single partition on it, that contains a single file-system. The utility can also create files in the root directory of that file-system and copy their content from files on the host OS.

This utility created a functional – albeit one-way – communication channel between the host PC and the simulated target. It’s imperfect but good enough for the job. It could be extend to be more complete, potentially even to the point where the host can mount Cray FS (NC1FS) volumes, but that’s a lot of work for not much value. It would be way more intesting to bring networking up but that I’ll save for a later post.

Back to the top

So where were we? The original problem I wanted to solve was that the file-system gets full with the OS trying to create a memory dump on a FS that’s clearly not large enough to hold one. So, armed with all this knowledge about the FS strucutre, what can we do?

Interestingly the size of the FS is really only stored in a few places: the s_fsize member of the super-block, the fd_nblk field of the partition descriptors and the size of the FREEMAP (bmp_total field). Changing the first two fields is not a big deal, but changing the size of the FREEMAP is problematic: it can’t really easily grow beyond the size of the block(s) it occupies. Luckily a single block (4kByte) worth of bitmap, which is the smallest allocation unit, supports disks up to 128MBytes in size, a significant extension over the 48MBytes of the intial RAM FS. So really, all it takes is patching up two or three fields to resize the parition to 128MBytes, wihch provides enough room for creating the dump and still leaving some extra space. Problem solved!

Are we done?

Yes, yes we are. I’ll stop this rather boring wall of text here. The next one, I promise, will be much more interesting.