General notes on raidSim Mark Holland, 12-14-94 I. Intro RaidSim is a disk array simulator that supports various flavors of RAID levels 0, 1, 3, 4, and 5, and parity declustering. It came out of the RAID project at Berkeley, but we've modified it extensively here at CMU to support parity declustering, floating Data/Parity, atomic RMWs in RAID5 small writes, disk-oriented reconstruction, and a few others. The guts of raidSim were once the RAID driver in the Sprite operating system, which the Berkeley folks cut out and bolted in to a simulation system. I'm sorry to say that raidSim is very complex and difficult to understand. These notes, most of which apply to our CMU versions only, are intended to get you started both running it and hacking on it. There is really no way that one can get meaningful results out of raidSim without having studied the code at least a little bit. II. Compiling After decompressing and de-tarring the distribution file, copy the files from the subdirectory .md to the main raidSim source directory. RaidSim currently runs on alphas, suns, and DECstations (pmaxen). If you're trying to run on another machine, the main thing you need to do is get setjmp/longjmp working. You should be able to use the library versions of these functions (libc.a). Then, you need to port the code in coproc.c and coproc.h to your machine type. There are just two routines here: CreateCoproc2 and SwitchCoproc. SwitchCoproc should not really require any porting. CreateCoproc2 is very simple: it creates a co-routine by copying a chunk of the calling co-routines' stack to the newly-created co-routines' stack, and then initializes a few registers (e.g. the SP and FP) in the context area, so that the new co-routine can be switched to. After you've got this machine-dependent code set up, you should be able to do a "make depend" (relies on the existence of makedepend) to set up the makefile properly, and then a "make" to build the raidSim executable. III. Running One almost always runs raidSim via a "runtest" script, an example of which is included in the "test" subdirectory. More on this later. Our version of raidSim uses two csh environment variables: DISKDB is the name of the disk geometry database file, e.g. /usr/users/holland/lib/disk.db.new. An example is provided with the source code. This file describes the geometry of all the disks that raidSim can simulate. BDDIR is the directory where the block design files live, e.g. /usr/users/holland/lib/bds. You need this database only if you're going to run simulations of declustered-parity RAIDs. If you're just going to simulate one of the RAID levels, you don't need it. Additionally, the "runtest" script (described below) uses an env variable RSDIR, which gives the path name where the raidSim executable lives, e.g. /usr/users/holland/bin. Make sure to set these variables before trying to run raidSim, or it will probably dump core. Once you've set these up, cd to the test directory and run the "runtest" script. It should spew a bunch of stuff to the screen, and when it completes you should have a file called "mult.out" in the local directory. This contains all results of the simulation. I typically write awk scripts to parse this output and compute the actual numbers I'm interested in. mult.out also contains a bunch of useless lines that clutter it up and make it hard to read, so I use the following to clean it up: alias mclean 'egrep -v "^Disk|^Device|^\*\*\*\*|^RAID:REQ_ERR" mult.out > /tmp/f$$; rm -f mult.out; mv /tmp/f$$ mult.out' runtest also leaves a few other files in the local directory. tmult.out is a temp file that you can delete. error.out contains a one-line summary of all the errors that occured while raidSim was running. scriptFile and RAID1.config are described below. In addition to the disk database, the raidSim executable requires two files. The first, which must be called RAID1.config, describes the configuration of the array you want to simulate. The first non-comment line contains the array parameters. The first eight of these parameters are: numRow -- the number of rows of disks, each a distinct parity group numCol -- num cols of disks in each row, forming each parity group logBytesPerSect -- the base-2 log of the number of bytes per sector. this should always be 9 to specify 512 bytes/sector. sectPerSU -- the number of sectors in a stripe unit SUPerDisk -- the size of each disk, in stripe units rowsPerGroup -- set this to 1 and ignore it. no longer supported. SUsPerPU -- set this to 1 and ignore it unless you're _really_ interested in reconstruction under parity declustering. If this is the case, you have my sympathy, and you should read through the reconstruction code to know how to use it. parityConfig -- One character that specifies the layout you want to use. L == RAID level 5, left-symmetric variant N == RAID level 4 S == RAID level 0 T == parity declustering (a.k.a. Clustered RAID) RaidSim does support some other layouts, but I personally would not recommend using any of them, and disavow all responsibility for the correctness of the implementations and the meaningfulness of any performance results you might get from them. Subsequent parameters on the configuration line are layout-specific. The only one you really need to know about is in the 'T' layout, in which the next (and last) field on the line is the name of the block design file, which must located in the directory given by the BDDIR env variable. The rest of the lines in RAID1.config contain major-minor number pairs for the disks in the system. These pairs are a holdover from raidSim's days as a Sprite device driver, and aren't used for anything important in simulation, but they\ must be there or raidSim will refuse to initialize. There must be at least as many pairs as disks that you want to simulate, but there can be as many as you like. RAID1.config is created by the runtest file, so you don't have to fill it in by hand. The second file required by raidSim, which can have any name but which is typically called scriptFile, contains a description of the workload that you want to run. The lines are of the form [ [ ]] or s is the fraction of the total workload that this line describes. is an 'r' or a 'w' (without the single-quotes) for a read or a write. is the access size in KB is the access alignment in KB is a character describing the access size distribution 'd' means deterministic (always ) 'e' means exponentially distributed with mean is the probability that this access is within the "local region" is the fraction of the arrays data space defining the local region is the offset into the array of the start of the local region. If the scriptFile contains a line in the second form ( s), it means that with probability the next access selected by any given process will be sequential with respect to the previous access, whatever it happened to be. There can be only one such line in any given scriptFile. For example, the following script file says that I want to run a 50/50 read/write workload using random 8k accesses that are 8k aligned: 50 r 8 8 50 w 8 8 In addition, one can specify a trace file instead of a scriptFile. In this case, raidSim will read the traces out of the file instead of generating a synthetic workload based on a script. A trace file consists of a header and then a bunch of trace records as defined in rst.h. The header contains the number of independent processes in the trace, the number of traces for each process, and the file offsets for each trace. The command line format to raidSim is: raidSim [options] 0 where scriptFile is the name of either a script or trace file (trace files must end in ".rst"), the zero is historical and is ignored, and numProc is the number of concurrent processes that run accesses against the array. If the script file is actually a trace file, numProc is ignored and the actual number of processes is read out of the trace file header. RaidSim takes a whole lot of command line options. Many of them represent dead end ideas that I put into raidSim only to discover that they were dead ends. Consequently there are a bunch of flags that are essentially useless to anyone but myself, and I haven't included these in this list. The full set of options is defined in set_options() in main.c. For the options that take args, "s" = string, "n" = integer, and "f" = float -b : cause recon proc to stall instead of skip when recon is blocked [default] -db s : specify the disk database path name -dt s : specify the disk type as defined in the disk database file -do n : n==1 => set disks to random rotational offset (de-sync spindles) -ds n : n==1/0 => enable/disable distributed sparing (declustering only) -e n : set number of floating reconstruction buffers -b f : set relative error bound. In fault-free and degr mode, raidSim runs until the error margin on the response time has fallen to this fraction of the mean. -hs n : set maximum allowable head separation (sectors) for reconstruction. -m n : set minimum number of I/Os raidSim will run prior to termination. -ma n : set max num allowed asynchronous trace entry processes -p n : n==1/0 => enable/disable prioritization of user accs over recon accs -f : fail a disk before starting (set degraded mode) -fR : run in reconfigured mode (declustering + dist sparing only) -fr : force reconstruction on read accs -l f : set user access rate, in accs/sec/disk -ld l : trace all I/Os to the specified set of disks. l = a list of disk ids -nr : suppress rotation of parity column (declustering only) -nn : disallow floating to next sequential block (floating d/p only) -na : interpret all accs in trace file as synchronous -P n : n==1 => enable piggybacking during reconstruction -PO n : n==1 => enable monitor piggybacking -qf : use FIFO disk queueing instead of CVSCAN -R n : n==1 => enable redirection of reads during reconstruction -RO n : n==1 => enable monitored redirection during recon -W n : n==1 => enable user writes to spare during recon -WO n : n==1 => enable monitored user writes to spare during recon -u n : n==1 => enable logging of disk utilizations to a file -w n : set utilization monitoring window size to n accesses -ws n : n==1 => enable write-on-failed-submit (decl+dist spare only) -s n : set random seed value to n -ss n : n==1 => enable shortest-seek optimization (decl w/ G=2 only) -S n : set stack size for co-routines to n bytes -Mt : enable test mode for Merchant/Yu layout -Mp n : select from a pre-defined set of params for Merchant/Yu layout -Ms n : set seed for linear congruential RNG in M/Y layout -D s n: set debug variable "s" to value "n" -r : reconstruction mode -rs : print reconstruction schedule -rt : enable printing of response time histogram III. Code structure RaidSim has four main components: Synthetic Reference Generator RAID Striping Driver Disk Simulation Module Event-Driven Simulator At the top level of abstraction is a synthetic reference generator. Accesses generated here are sent to the RAID striping driver component, which is by far the largest component of raidSim. The driver component is responsible for breaking down user-level accesses into sets of disk-level accesses, and sequencing through these low-level operations to effect the user-level operation. When the driver component wishes to schedule a physical I/O operation, it invokes the disk simulation component. This component computes the completion time of the request based on the access and the current disk state, and then invokes the event-driven simulator component to cause the indicated amount of simulated time to pass. The E-V simulator component, upon receiving a request, de-schedules the calling co-routine and places a wakeup event in the event queue. The event queue is sorted by simulation time. When the wakeup event reaches the front of the event queue, meaning that the indicated amount of simulated time has passed, the co-routine that called into the module is rescheduled and allowed to run. The best way to get a general feel for how raidSim works is to trace an I/O through the system. 1. main() [main.c] starts up, sets the command line options, and calls InitInit() to read the configuration file and do all initialization. 2. main() invokes RaidSim() [raidSim.c], which forks the one co-routine per user-level process, as specified on the command line. 3. Each copy of RaidSim() then goes into a loop where it asks the reference generator for a new access (SelectAction(), ConvertActionToAccess() [script.c]), and submits it to the RAID driver layer in DoIO() [raidSim.c]. 4. DoIO() calls lseek() and then read() or write(), which have been #define'd to my_lseek and myio() [pseudoIO.c]. myio() invokes an access by calling Dev_BlockDeviceIOSync() [devBlockDevice.c]. 5. Dev_BlockDeviceIOSync() invokes Dev_BlockDeviceIO() [devBlockDevice.c], which gets the access started by invoking the start routine through the file system op switch [devFsOpTable.c], and then calls Sync_MasterWait() [sync.c] to wait for it to complete. 6. Dev_BlockDeviceIO() invokes either StripeBlockIOProc() or RaidBlockIOProc() [devRaid.c], using the former for non-redundant arrays (e.g. RAID0) and the latter for redundant arrays (e.g. RAID5). From here downn we assume that we called RaidBlockIOProc(). 7. RaidBlockIOProc calls InitiateStripeIOs() [devRaidInitiate.c]. InitiateStripeIOs breaks up the access into individual stripes, and calls InitiateSingleStripeIO() [devRaidInitiate.c] for each stripe. 8. InitiateSingleStripeIO invokes either InitiateStripeRead() or InitiateStripeWrite() [devRaidInitiate.c], depending on whether the access is a read or a write. 8a. For reads, InitateStripeRead checks to see if the access contains any data residing on a disk that has failed. If so, it invokes InitiateReconstructRead() [devRaidInitiate.c]. If not, it calls InitiateIORequests() [devRaidInitiate.c], which is the primary routine via which one gets physical I/Os started in raidSim. 8b. For writes, there are a couple cases: 8b.1 No failures, < 1/2 the stripe being written: InitiateStripeWrite() calls InitiateReadModifyWrite(), which causes the write to occur via atomic RMWs on the affected data and parity units. 8b.2 No failures and > 1/2 the stripe being written, or a data failure exists: InitiateStripeWrite() calls InitiateReconstructWrite(), which causes the write to occur by reading the unaccessed portion of the stripe, computing new parity, and directly overwriting the parity unit. 8b.3 Parity has failed Invoke InitiateIORequests() to cause the data writes to go out, and ignore the parity. All of the above cases eventually filter down to InitiateIORequests(), which gets disk I/Os started. The data structure that you give to InitiateIORequests() contains a function pointer, which is a callback for this set of I/Os. This function gets invoked when all of the I/Os in the set have completed. There are many different callbacks that are used in the different cases. Some of these just finish up the I/O and release the waiting process, others invoke new I/Os, which have different callback functions, which may again invoke new I/Os, ad naseum. This mechanism of invoking InitiateIORequests() with a callback is the way that raidSim sequences through the sets of physical I/Os that get done for each user level I/O. 9. For each physical disk request in the input set, InitiateIORequests() calls Dev_BlockDeviceIO() [devBlockDevice.c]. This is actually a recursive call, but in the second call it's provided with different parameters so it invokes a different set of I/O functions than before. Dev_BlockDeviceIO() invokes the BlockIOProc() in devDisk.c (there are a few different BlockIOProc()'s in raidSim), which does a co-routine fork to allow this I/O to proceed concurrently with the other I/Os in in the set. At the call to BlockIOProc(), the I/O has dropped out of the raid driver component and entered the disk component. 10. After the fork(), BlockIOProc() [devDisk.c] invokes either DoRead() or DoWrite() [devDisk.c]. 11. DoRead() and DoWrite() then enter the disk queue for the indicated disk. This is done in way that is very difficult to follow in the code. There is a generic "resource" module in raidSim [resource.c], which one can invoke to control access to some resource such as a disk. If the resource is busy when you try to acquire it, the calling co-routine will be de-scheduled and queued waiting for the resource. The queueing routine can be specified at the time the resource is created. There is a queue descriptor structure defined in genqueue.h that describes a queue. When the disk resources are created, a descriptor for either a fifo queue or a cvscan queue is created and installed in each disk resource. So, to put yourself in a disk queue, you create a descriptor called a SchedRequest [cvscan.h] and fill it in with parameters about the disk I/O. You then place a pointer to this SchedRequest into your process descriptor structure invoking the PutCurProcProp() macro [schedule.h]. Then, you call BeginResourceUse() [resource.c], which, if the resource is busy, will invoke the appropriate enqueue routine. This enqueue routine extracts the descriptor from your process area, and uses the information to put you into the queue at the right place. BeginResourceUse then de-schedules the calling co-routine. As co-routines release the disk resources, descriptors are yanked off the disk queue and co-routines are woken up. Therefore, when BeginResourceUse() returns, the indicated co-routine has arrived at the head of the disk queue and can continue. 12. DoRead() and DoWrite() call Access_time() [geometry.c] to compute the amount of time it will take to complete the indicated I/O on the indicated disk. There is also some nonsense in here where we invoke UseBusRead() or UseBusWrite(), but the bus contention module no longer works, and is not that accurate anyway. The bus contention module has been disabled in this version, and I'd strongly discourge anyone from turning it on. 13. After computing the access time, DoRead() and DoWrite() invoke Delay(), which causes the indicated amount of simulation time to pass. This is done by de-scheduling the current co-routine (recall that we did a fork() in step 9 above, so we're not de-scheduling the co-routine that invoked the access), and inserting an event into the event queue