NASD Programmer's Documentation
Cache

Understanding the cache in the drive is important if you intend to modify or extend drive functionality. In addition to maintaining in-core copies of on-disk blocks, the cache mechanism also defines the locking and serialization protocol for many operations.

Structures and definitions used by the cache may be found in nasd_cache.h.

Global state: nasd_odc_state_t

Much of the global state of the cache, and indeed of the drive itself, may be found in the nasd_odc_state_t structure. Drive code may access a global pointer, nasd_odc_state, which is a pointer to a structure of this type.

The disk field of this structure is a copy of the most recent nasd_od_disk_t available for describing this drive. When the disk state is written, as described in the section on I/O modules, it is the contents of this memory which are written.

The nvstate field is a pointer to a nasd_odc_nvstate_t, which is intended to be the contents of the header, if not the entire contents, of the NVRAM available on the drive. Although this structure is supported (and used) in software, at the time this document is written we have not constructed prototypes with NVRAM available for any purpose but boot firmware.

The parts array is an array of NASD_OD_MAXPARTS structures of type nasd_odc_icpart_t, which is a per-partition in-core structure. This structure contains a read/write lock (see threads) which is used to serialize inode-level operations on this partition. Changing the set of inodes in a partition (that is, creating and removing objects) requires holding this lock for writing. Examining the set of inodes in a partition, which includes mapping an inode number to a block number (as with nasd_odc_node_get()) requires holding this lock for reading. This structure also contains several fields related to the operation of the list-of-objects control object, which will be described in the section on control objects.

The section on inodes describes the inode hash table. The npt_sz field of nasd_odc_state_t structure represents the number of blocks in each copy of the inode hash table. The cr_ind field of the nasd_odc_state_t structure represents a block offset in this table. Each time a new object is created on the disk, this cr_ind value is incremented modulo npt_sz. When the inode creation code starts looking for an empty hash table slot, it starts looking at (first_hash_table_block + cr_ind) % npt_sz. This ensures that a series of inode creations rotates through the blocks of the inode pagetable. The goal is to increase performance of concurrent creations by avoiding serialization on busy page table blocks.

Cache entries: nasd_odc_ent_t

Entries in the block cache have type nasd_odc_ent_t. Among other fields, this contains several locks for protecting different components of the structure, a variety of flags words for use by various subsystems which use or manage cache blocks, sets of pointers for maintaining lists of blocks, and a description of the actual contents of the block. All blocks are represented identically in the cache - there are no separate types for inodes, indirect blocks, direct blocks, etc. The type field of the nasd_odc_ent_t indicates what kind of block this is. Valid values for this field are:
TypeMeaning
NASD_ODC_T_NODE inode
NASD_ODC_T_IND indirect block
NASD_ODC_T_FREE unused block
NASD_ODC_T_REFCNT block from the refcount region of the disk
NASD_ODC_T_NPT inode pagetable block
NASD_ODC_T_DATA data block of an object
NASD_ODC_T_ANON any block within the data region
NASD_ODC_T_LAYOUT layout bookkeeping block (swappable)
NASD_ODC_T_LAYOUT_STATIC layout bookkeeping block (non-swappable)

The physical block number of the block (within its region) is stored in the blkno field. Additionally, the sector number of the first on-platter sector of this block is stored in real_sectno.

The data field of the nasd_odc_ent_t represents the actual data contents of the indicated block on disk. This field is a union of several pointer types which various drive components may wish to dereference this data as (to avoid excessive casting within the drive code). Additionally, one of the members of this union (buf) has type char *, and this may be coerced to any other types which may be needed (it is a convention of the drive code that if any such coercion is necessary, it will be performed upon the buf element of the union, and no other).

Queues

Cache blocks are often stored in queues. These are structures of type nasd_odc_oq_t. Each of these structures contains a mutex (lock), a counter of the number of blocks in the queue (size), and a dummy cache nasd_odc_ent_t (head). This dummy element has no valid data field. It exists so that blocks in the queue may be stored in doubly-linked lists, and that adds and removes do not need to special-case the beginning and/or end of the list; the list is circular, and the head element is always present.

The cache maintains two very important queues: the unused queue and the LRU queue. Blocks in the unused queue, nasd_odc_unusedq, are not currently assigned to any purpose. The contents of their data page are completely invalid. Blocks in the LRU queue, nasd_odc_lru, contain valid data. The tail of this queue is the least-recently-used block in the cache.

Queues are initialized by calling nasd_queue_init() with a pointer to the nasd_odc_oq_t to initialize. This function returns nasd_status_t indicating whether or not the structure has successfully initialized. If the return is NASD_SUCCESS initialization has completed successfully, and destruction of the mutex has been registered on the nsad_odc_shutdown list.

The mutex of a nasd_odc_oq_t should be taken with NASD_ODC_Q_LOCK() and released with NASD_ODC_Q_UNLOCK(). Both of these operations take as their sole argument a nasd_odc_oq_t *. In addition to managing the mutex, they keep track of who, if anyone, is the current mutex holder. This is useful for debugging purposes.

Different queues use different pointers within the nasd_odc_ent_t structure, so that a single entry may be maintained in multiple queues simultaneously. For each queue that a block may belong to, two pointers, next and prev, are maintained. These names are made unique by prepending a single letter. For example, blocks in the LRU queue are linked on fields named lnext and lprev.

Several other macros manipulate queues as well. NASD_ODC_Q_SIZE() returns the number of elements in a queue, given a pointer to the queue structure. NASD_ODC_Q_DEQ(ent,list) removes ent from a queue, where it was linked with list as the uniquifying character for the next and prev fields. For example, NASD_ODC_Q_DEQ(some_ent,l) removes some_ent from the LRU queue. It is the responsibility of the caller to ensure that some_ent was in the LRU queue in the first place. Likewise, NASD_ODC_Q_INS(some_queue,some_ent,list) inserts some_ent in queue queue linked with the uniquifier list. NASD_ODC_Q_DEQ_TAIL(queue,ent,list) behaves similarly, except that ent is assigned to be the tail element of queue at the beginning of the operation. Each of these macros takes and releases the queue lock to serialize its operation. If the caller is managing this locking, then _NOLOCK variants of the calls should be used. For example, NASD_ODC_Q_INS_NOLOCK() takes the same arguments as NASD_ODC_Q_INS() and performs the same tasks, except that it does not take or release the queue lock.

The lock within nasd_odc_lru is also known as the LRU lock. This lock may be taken and released with NASD_ODC_LRU_LOCK() and NASD_ODC_LRU_UNLOCK(), respectively. These macros take no arguments.

References

To simplify bookkeeping, cache blocks contain reference counts to indicate how many concurrent logical users there are of a block. This allows multiple service threads and modules to access a single cache block without being aware of one another. The refcnt field of the nasd_odc_ent_t is this reference count, also known variously as the in-core refcount and the external refcount. Access to this field is protected by the LRU lock.

In addition to the external refcount, cache blocks contain another field, irefcnt, which is known as the internal refcount. This is used by the internals of the cache mechanism for its own purposes- internal refers to the internals of the cache. The primary use of this is for inode blocks. Whenever a block that is logically a member of an object but not the inode block (ie, NASD_ODC_T_DATA and NASD_ODC_T_IND) is cached non-anonymously (NASD_ODC_T_ANON), a pointer to the inode block is maintained in the cache entry (node_ent). For each such pointer extant in a valid cache block, an internal reference is held on the inode block. This allows the cache to invalidate these pointers when necessary, and to avoid mistakenly dereferencing a different block in a reused cache entry as the inode block.

Block state and locking discipline

Each cache block contains two locks, a mutex and a readers/writers lock. The mutex is locked and released with NASD_ODC_LOCK_BLOCK() and NASD_ODC_UNLOCK_BLOCK(), which both take as their sole argument a pointer to a cache block (nasd_odc_ent_t *). If the NASD_ODC_RECORD_BLOCK_LOCKS option is enabled, then each entry will track where the lock was taken (if it is currently held) in the locker_file and locker_line fields. This is for debugging purposes only. The readers/writers lock is taken with NASD_ODC_RLOCK_BLOCK_DATA() for reading, and NASD_ODC_WLOCK_BLOCK_DATA() for writing. Similarly, it is released with NASD_ODC_RUNLOCK_BLOCK_DATA() and NASD_ODC_WUNLOCK_BLOCK_DATA(). If the NASD_ODC_RECORD_BLOCK_WLOCKS option is enabled, then the w_locker_file and w_locker_line fields of the cache entry track who, if anyone, currently holds this lock for writing. These macros also take as their sole argument a pointer to the cache block being manipulated.

After obtaining a block from the cache, before manipulating it further, one should take a read or a write lock on the block (depending on whether or not one intends to modify the data contents of the block).

To avoid deadlock, it is very important to strictly observe the restrictions on what locks should be taken in what order, and which locks may not be concurrently held with other locks, as described in this document.

data_flags

The basic state of the data in a block is represented by the flags word data_flags. After a block is acquired from the cache, before its data may be examined, altered, or used, the data_flags word must be checked. This word is protected by the block mutex. Values for this word include:
ValueMeaning
NASD_CD_BUSY This block is busy (I/O is in progress).
NASD_CD_INVALID Data contents of this block are not valid.
NASD_CD_NZ Data contents of block are logically zero, but not initialized- readers may treat the global array nasd_odc_zeroblk as the data contents of this block; writers should explicitly zero portions of the data they do not overwrite, and unset this flag.
NASD_CD_MBUSY The lookup operation on this block marked it busy, but the I/O has not been launched. The caller of the lookup operation is responsible for launching the I/O.
NASD_CD_DELETING The block is an inode which is being deleted.
NASD_CD_SECURITY The current state of the block is the result of security processing.
NASD_CD_ANONF An anonymous fetch of this block is in progress.

Changes in the contents of the data_flags word are heralded by broadcasting the condition cond in the cache block. Users of cache blocks may await such changes by waiting on cond, atomically releasing and retaking lock. Two such operations which are common are waiting for a block to not be busy, and waiting for a block to not be busy or invalid. The preferred way to accomplish this is to call nasd_odc_wait_not_busy() or nasd_odc_wait_not_busy_invalid(). In addition to blocking until the correct status is achieved, these operations will instruct I/O modules that support priority queues to elevate the priority of the I/Os which are keeping the blocks busy.

For example, let's say that we have a code fragment which wishes to read the first four bytes of data from a block. That might look like:

/*
 * Block acquired from cache here
 * 
 * Note that we do not check for NASD_CD_NZ below.
 * The reason for this, and an explanation of when that
 * check is and is not necessary, appears below.
 */

NASD_ODC_RLOCK_BLOCK_DATA(ent);

NASD_ODC_LOCK_BLOCK(ent);
nasd_odc_wait_not_busy_invalid(ent);
NASD_ODC_UNLOCK_BLOCK(ent);

printf("Bytes are: 0x%02x 0x%02x 0x%02x 0x%02x\n",
  ent->data.buf[0],
  ent->data.buf[1],
  ent->data.buf[2],
  ent->data.buf[3]);

NASD_ODC_RUNLOCK_BLOCK_DATA(ent);

/* indicate that we're done with the block here */

lru_flags

Another flags word in the cache entry structure is lru_flags. Operations on this word are protected by the LRU lock. The state represented by this word is entirely internal to the cache - code outside the cache mechanism itself should not be concerned with this. Values for this word include:
ValueMeaning
NASD_CL_ALLOC The data for this block is currently being allocated.
NASD_CL_NOALLOC The data allocation for this block failed.
NASD_CL_LRU_Q This block is enqueued on nasd_odc_lru.
NASD_CL_REMOVING This block is being ejected from the cache.
NASD_CL_DELETING This block is being deleted (valid only for inode blocks).
NASD_CL_FALLOC Force the allocation (recovery case for NASD_CL_NOALLOC).
NASD_CL_AERROR Forced allocation (NASD_CL_FALLOC) failed.
NASD_CL_WIRED This block may not be ejected from the cache.
The uses of these values will be explained in cache operation below.

dirty_flags

The current state of a block with respect to the dirty block tracking system is maintained in the dirty_flags field of its cache block. Access to this word is protected by a lock maintained by the dirty block tracking system (nasd_odc_dirtyq_lock). Values for this word are:
ValueMeaning
NASD_CR_DIRTY_Q This block is on the dirty list.
NASD_CR_DIRTYW_Q This block is on the dirty-write list.
NASD_CR_DIRTY This block is dirty.
The details of how dirty blocks are handled are described below in cache operation.

io_flags

This flags word exists for the use of the I/O module. Its contents, and the locking protocol for reading or altering them, are defined by that module. No one outside the I/O module should reference this field.

Cache operation

During initialization, the drive calls nasd_cache_init() to initialize the cache. At this time, nasd_odc_unusedq and nasd_odc_lru are initialized. Next, a hash table for looking up blocks that are currently in-core is initialized. This table has nasd_odc_buckets buckets, each of which has type nasd_odc_oq_t. The cache contains nasd_odc_size buckets, which are statically allocated at this time, and relegated to nasd_odc_unusedq. Finally, the dirty block tracker is initialized.

The most common way to retrieve a block from the cache is to call:

nasd_status_t nasd_odc_block_get(
  nasd_odc_ent_t   *node_ent,
  nasd_blkno_t      blkno,
  int               flags,
  nasd_odc_ent_t  **entp,
  int               type,
  nasd_odc_ent_t   *ichain)
node_ent is a pointer to an associated inode block, if any. This is relevant when finding object-data (NASD_ODC_T_DATA) or indirect (NASD_ODC_T_IND) blocks. blkno is the block number of this block. type is its block type. The block will be returned in *entp. ichain represents the I/O chain when I/O queueing is enabled; this will be explained below. Valid values for flags are:
FlagMeaning
NASD_ODC_L_FORCE If an entry for the block is not found in the cache, create it.
NASD_ODC_L_BLOCK The operation may block.
NASD_ODC_L_LOAD If the block is invalid (not yet fetched), launch the I/O to validate it if necessary.
NASD_ODC_L_MLOAD If the block is invalid (not yet fetched), mark it as requiring the fetch, queue it on ichain, and set NASD_CD_MBUSY in the block's data_flags.

Fundamentally, this operation may be thought of as being broken into two phases: getting the block, and performing any necessary I/O-related activity. These two parts are implemented by:

nasd_status_t nasd_odc_block_get_part1(
  nasd_odc_ent_t   *node_ent,
  nasd_blkno_t      blkno,
  int               flags,   nasd_odc_ent_t  **entp,
  int               type,
  nasd_odc_ent_t   *ichain,
  int              *crp)

nasd_status_t nasd_odc_block_get_part2(
  nasd_odc_ent_t   *node_ent,
  nasd_blkno_t      blkno,
  int               flags,
  nasd_odc_ent_t  **entp,
  int               type,
  nasd_odc_ent_t   *ichain,
  int              *crp)

with help from nasd_status_t nasd_odc_block_lookup(
  nasd_odc_ent_t   *node_ent,
  nasd_blkno_t      blkno,
  int               flags,
  nasd_odc_ent_t  **entp,
  int               type,
  int              *creatorp)
The first logical portion of this operation is finding the block in the cache, or adding it to the cache if it is not there and the user has specified NASD_ODC_L_FORCE. This activity is performed by nasd_odc_block_get_part1(). The first thing done by nasd_odc_block_get_part1() is a call to nasd_odc_block_lookup().

nasd_odc_block_lookup() begins by taking the LRU lock to serialize its work. Next, it checks the hash table to determine if an entry for the block exists. If the block is found, but is being ejected from the cache, then nasd_odc_block_lookup() will "rescue" the block from ejection if and only if NASD_ODC_L_BLOCK was specified. Otherwise, the lookup will fail with NASD_EJECTING as the status. If the block is currently cached as an anonymous block, and the caller has specified another type for it, then the type of the block is changed to the caller's type.

If nasd_odc_block_lookup() does not find the block in the hash table, and the caller has specified NASD_ODC_L_FORCE, then it will attempt to add this entry to the cache. First, it calls nasd_odc_get_thread_ent(), which is defined by the I/O module. This operation yields a nasd_odc_ent_t with no associated data. The I/O module should define this operation in such a manner as to be nonblocking, yet reliably yield a successful result. This entry is initialized with the correct block number and type, and inserted in the hash table. NASD_CL_ALLOC is set in the lru_flags for the block. Next, nasd_odc_block_lookup() calls nasd_odc_block_grab(). If there are blocks in nasd_odc_unusedq, nasd_odc_block_grab() will return one of these. If not, it will select a block for replacement (from the tail of nasd_odc_lru) and return that. nasd_odc_block_grab() is instructed by its caller as to whether or not it may block. nasd_odc_block_lookup() uses the presense or absence of NASD_ODC_L_BLOCK to set this parameter. After an eligible block is selected for replacement, its data page is removed and attached to the original entry added to the hash table by nasd_odc_block_lookup(). The newly-stripped cache entry is handed off to the I/O module by calling nasd_odc_put_thread_ent(). Before nasd_odc_lookup() returns any entries, it increments their external refcount.

The astute will note that nasd_odc_block_lookup() may yield a block in an undesirable state under certain circumstances. Specifically, it returns success if it finds a block in the cache. However, the block it found could be a block with no associated data, because another thread is in the process of allocating data for this block. It is the job of the rest of the code in nasd_odc_block_get_part1() to deal with this case. When a block without an associated data page is returned, nasd_odc_block_get_part1() waits on the condition acond in the cache entry. When the allocation operation has completed, the nasd_odc_lookup() responsible for the allocation broadcasts this condition. If the allocation has succeeded, there will be a data page. If the allocation has failed, NASD_CL_NOALLOC will be set in the lru_flags of the cache block. If NASD_CL_AERROR is set in lru_flags, the drive has experienced an internal failure which prevented this operation from succeeding. This is usually a sign of memory corruption, or some other serious error. This condition should ultimately result in a drive reset. If NASD_CL_AERROR is not set, this indicates that the nasd_odc_lookup() responsible for the allocation was called nonblocking, and it would be necessary to block waiting for a page to complete the allocation successfully. If NASD_ODC_L_BLOCK was not specified to nasd_odc_block_get_part1(), the operation fails at this point, because further action would require long-term blocking. If NASD_ODC_L_BLOCK is specified, nasd_odc_block_get_part1() sets NASD_CL_FALLOC in the cache block's lru_flags, and uses nasd_odc_force_alloc() to block and obtain a page. If this fails, then the drive is experiencing some sort of severe internal error, and a drive reset should ultimately result. nasd_odc_block_get_part1() sets NASD_CL_AERROR on the cache block if this has happened. The LRU lock serializes who performs the allocation when multiple threads are waiting for the allocation to complete on the same block. The external refcount tracks how many threads are currently examining the block. nasd_odc_block_get_part1() only gives up the external reference on the block added by nasd_odc_lookup() if it fails for some reason, so callers of nasd_odc_block_get_part1() or nasd_odc_block_get() get back a block with an external reference held on it.

The thread responsible for launching an I/O on a block is the thread that added the block to the cache. This is what the creatorp result of nasd_odc_block_lookup() is for. If *creatorp is nonzero after a call to nasd_odc_block_lookup(), then it is the caller's thread that is responsible for starting the I/O. nasd_odc_get_part1() manages transfers of this responsibility if the thread that added an entry to the cache has yielded responsibility for the block, as with a failed page allocation due to a lack of NASD_ODC_L_BLOCK. *crp as returned by nasd_odc_get_part1() indicates if the caller is responsible for starting I/O on this block or not, if such is necessary.

nasd_odc_block_get_part2() takes a block resulting from a call to nasd_odc_block_get_part1() and determines if an I/O should be started. The same crp passed to nasd_odc_block_get_part1() may be passed to nasd_odc_block_get_part2(). If *crp is nonzero and the block is invalid (NASD_CD_INVALID), it is necessary to launch an I/O. If NASD_ODC_L_MLOAD is specified, the block will be marked NASD_CD_MBUSY and queued on ichain, which should be a pointer to a placeholder nasd_odc_ent_t. The queueing will use the inext and iprev fields to form a circular queue of I/Os. This mechanism is used by callers to retrieve many blocks from the cache, and batch necessary I/Os. If prefetching is enabled, nasd_odc_block_get_part2() may enqueue prefetch blocks on this chain as well. The caller is responsible for launching I/Os on these blocks as well. If NASD_ODC_L_MLOAD is not specified, but NASD_ODC_L_LOAD is specified, then nasd_odc_block_get_part2() will start the I/O itself. Note that callers who do not specify either NASD_ODC_L_LOAD or NASD_ODC_L_MLOAD are responsible themselves for either launching an I/O to validate the block, or otherwise making the data contents valid.

If NASD_CD_NZ is set on a block returned by the cache, the caller should treat the data field as if it were full of zeroes, even though the memory itself is uninitialized. If the caller modifies the data, this flag must be cleared, and any locations not explicitly overwritten must be zeroed.

When a block is no longer of use, its external reference is surrendered by calling nasd_odc_block_release() with a pointer to the cache block. When a block's external refcount goes to zero, the cache examines its state. If the block is dirty, it is enqueued on the list nasd_odc_dirtyq. Otherwise, it is enqueued at the head of nasd_odc_lru, because it is eligible for replacement. If a call to lookup occurs to make the external refcount become nonzero again, it is removed from the dirty list or the LRU.

The dirty list is maintained by the dirty block tracker. This is a simple state machine which periodically flushes dirty blocks to disk. Additionally, this state machine tracks what threads are waiting for which blocks to be written to disk, and can flush individual blocks, entire objects, or the entire in-core state to disk on demand.

A cache entry's data should only be modified by a holder of the entry's write lock. Before the entry is released, and while the write lock is still held, a call to nasd_odc_dirty_ent() with a pointer to the cache block will inform the dirty block tracker that a particular block is dirty.

To see how this fits together, lets revisit our earlier example. This time, we will fill in the code which gets the block from the cache and returns it, and instead of reading the first four bytes, we will initialize them.

nasd_odc_ent_t *ent;
nasd_status_t rc;
int state_changed;

state_changed = 0;

rc = nasd_odc_block_get(inode_ent, data_blkno,
  NASD_ODC_L_FORCE|NASD_ODC_L_BLOCK, &ent,
  NASD_ODC_T_DATA, NULL);
if (rc)
  return(rc);

NASD_ODC_WLOCK_WLOCK_DATA(ent);

NASD_ODC_LOCK_BLOCK(ent);
nasd_odc_wait_not_busy(ent);
nasd_odc_dirty_ent(ent);
if (ent->data_flags&NASD_CD_INVALID) {
  ent->data_flags &= ~NASD_CD_INVALID;
  state_changed = 1;
}
if (ent->data_flags&NASD_CD_NZ) {
  bzero((char *)&ent->data.buf[4], NASD_OD_BASIC_BLOCKSIZE-4);
  ent->data_flags &= ~NASD_CD_NZ;
  state_changed = 1;
}
NASD_ODC_UNLOCK_BLOCK(ent);

ent->data.buf[0] = 'n';
ent->data.buf[1] = 'a';
ent->data.buf[2] = 's';
ent->data.buf[3] = 'd';

NASD_ODC_WUNLOCK_BLOCK_DATA(ent);

/*
 * We must broadcast this condition if we changed the
 * data_flags state above. We defer the broadcast from
 * then to now as an optimization - otherwise, those interested
 * in using the block would wake up, only to discover that
 * we are still manipulating it.
 */
if (nz_state_changed) {
  NASD_BROADCAST_COND(ent->cond);
}

nasd_odc_block_release(ent);


<--- ---> ^<br>|<br>|
In-core extents Changing physical refcounts NASD Programmer's Documentation