Understanding the cache in the drive is important if you intend to modify or extend drive functionality. In addition to maintaining in-core copies of on-disk blocks, the cache mechanism also defines the locking and serialization protocol for many operations.
Structures and definitions used by the cache may be found in
nasd_cache.h
.
nasd_odc_state_t
nasd_odc_state_t
structure.
Drive code may access a global pointer, nasd_odc_state
,
which is a pointer to a structure of this type.
The disk
field of this structure is a copy of the
most recent nasd_od_disk_t
available for describing
this drive. When the disk state is written, as described in the
section on I/O modules, it is the
contents of this memory which are written.
The nvstate
field is a pointer to a
nasd_odc_nvstate_t
, which is intended to be the contents
of the header, if not the entire contents, of the NVRAM available
on the drive. Although this structure is supported (and used) in software,
at the time this document is written we have not constructed prototypes
with NVRAM available for any purpose but boot firmware.
The parts
array is an array of NASD_OD_MAXPARTS
structures of type
nasd_odc_icpart_t
, which is a per-partition in-core structure.
This structure contains a read/write lock (see threads)
which is used to serialize inode-level operations on this partition.
Changing the set of inodes in a partition (that is, creating and removing objects)
requires holding this lock for writing. Examining the set of inodes in a
partition, which includes mapping an inode number to a block number
(as with nasd_odc_node_get()
) requires holding this lock for reading.
This structure also contains several fields related to the operation of
the list-of-objects control object, which will be described in the
section on control objects.
The section on inodes describes the inode
hash table. The npt_sz
field of nasd_odc_state_t
structure represents the number of blocks in each copy of the inode hash
table. The cr_ind
field of the nasd_odc_state_t
structure represents a block offset in this table. Each time a new object
is created on the disk, this cr_ind
value is incremented modulo
npt_sz
. When the inode creation code starts looking for an
empty hash table slot, it starts looking at
(first_hash_table_block + cr_ind) % npt_sz
. This ensures that
a series of inode creations rotates through the blocks of the inode pagetable.
The goal is to increase performance of concurrent creations by avoiding
serialization on busy page table blocks.
nasd_odc_ent_t
nasd_odc_ent_t
. Among other
fields, this contains several locks for protecting different components of
the structure, a variety of flags words for use by various subsystems which
use or manage cache blocks, sets of pointers for maintaining lists of blocks,
and a description of the actual contents of the block. All blocks are
represented identically in the cache - there are no separate types for inodes,
indirect blocks, direct blocks, etc. The type
field of the
nasd_odc_ent_t
indicates what kind of block this is. Valid values
for this field are:
Type | Meaning |
---|---|
NASD_ODC_T_NODE
| inode |
NASD_ODC_T_IND
| indirect block |
NASD_ODC_T_FREE
| unused block |
NASD_ODC_T_REFCNT
| block from the refcount region of the disk |
NASD_ODC_T_NPT
| inode pagetable block |
NASD_ODC_T_DATA
| data block of an object |
NASD_ODC_T_ANON
| any block within the data region |
NASD_ODC_T_LAYOUT
| layout bookkeeping block (swappable) |
NASD_ODC_T_LAYOUT_STATIC
| layout bookkeeping block (non-swappable) |
The physical block number of the block (within its region) is stored in
the blkno
field. Additionally, the sector number of the first
on-platter sector of this block is stored in real_sectno
.
The data
field of the nasd_odc_ent_t
represents
the actual data contents of the indicated block on disk. This field is a
union of several pointer types which various drive components may wish to
dereference this data as (to avoid excessive casting within the drive code).
Additionally, one of the members of this union (buf
) has type
char *
, and this may be coerced to any other types which may
be needed (it is a convention of the drive code that if any such coercion
is necessary, it will be performed upon the buf
element of
the union, and no other).
nasd_odc_oq_t
. Each of these structures contains a
mutex (lock
), a counter of the number of blocks in the
queue (size
), and a dummy cache nasd_odc_ent_t
(head
). This dummy element has no valid data
field. It exists so that blocks in the queue may be stored in doubly-linked
lists, and that adds and removes do not need to special-case the
beginning and/or end of the list; the list is circular, and the head
element is always present.
The cache maintains two very important queues: the unused queue and the LRU
queue. Blocks in the unused queue, nasd_odc_unusedq
, are not
currently assigned to any purpose. The contents of their data
page are completely invalid. Blocks in the LRU queue,
nasd_odc_lru
, contain valid data. The tail of this queue is the
least-recently-used block in the cache.
Queues are initialized by calling nasd_queue_init()
with a pointer
to the nasd_odc_oq_t
to initialize. This function returns
nasd_status_t
indicating whether or not the structure has
successfully initialized. If the return is NASD_SUCCESS
initialization
has completed successfully, and destruction of the mutex has been registered
on the nsad_odc_shutdown
list.
The mutex of a nasd_odc_oq_t
should be taken with
NASD_ODC_Q_LOCK()
and released with NASD_ODC_Q_UNLOCK()
.
Both of these operations take as their sole argument a nasd_odc_oq_t *
.
In addition to managing the mutex, they keep track of who, if anyone, is the current
mutex holder. This is useful for debugging purposes.
Different queues use different pointers within the nasd_odc_ent_t
structure, so that a single entry may be maintained in multiple queues
simultaneously. For each queue that a block may belong to, two pointers,
next
and prev
, are maintained. These names are made
unique by prepending a single letter. For example, blocks in the LRU queue
are linked on fields named lnext
and lprev
.
Several other macros manipulate queues as well. NASD_ODC_Q_SIZE()
returns the number of elements in a queue, given a pointer to the queue structure.
NASD_ODC_Q_DEQ(ent,list)
removes ent
from a queue,
where it was linked with list
as the uniquifying character for
the next and prev fields. For example, NASD_ODC_Q_DEQ(some_ent,l)
removes some_ent
from the LRU queue. It is the responsibility of the caller
to ensure that some_ent
was in the LRU queue in the first place.
Likewise, NASD_ODC_Q_INS(some_queue,some_ent,list)
inserts
some_ent
in queue queue
linked with the uniquifier
list
. NASD_ODC_Q_DEQ_TAIL(queue,ent,list)
behaves
similarly, except that ent
is assigned to be the tail element of
queue
at the beginning of the operation. Each of these macros
takes and releases the queue lock to serialize its operation. If the caller
is managing this locking, then _NOLOCK
variants of the calls should
be used. For example, NASD_ODC_Q_INS_NOLOCK()
takes the same arguments
as NASD_ODC_Q_INS()
and performs the same tasks, except that it
does not take or release the queue lock.
The lock within nasd_odc_lru
is also known as the
LRU lock
. This lock may be taken and released with
NASD_ODC_LRU_LOCK()
and NASD_ODC_LRU_UNLOCK()
,
respectively. These macros take no arguments.
refcnt
field of the nasd_odc_ent_t
is this reference count, also known variously as the in-core refcount and
the external refcount. Access to this field is protected by the LRU lock.
In addition to the external refcount, cache blocks contain another field,
irefcnt
, which is known as the internal refcount. This
is used by the internals of the cache mechanism for its own purposes-
internal refers to the internals of the cache. The primary use of
this is for inode blocks. Whenever a block that is logically a member of an
object but not the inode block (ie, NASD_ODC_T_DATA
and
NASD_ODC_T_IND
) is cached non-anonymously
(NASD_ODC_T_ANON
), a pointer to the inode block is maintained
in the cache entry (node_ent
). For each such pointer extant
in a valid cache block, an internal reference is held on the inode block. This
allows the cache to invalidate these pointers when necessary, and to avoid
mistakenly dereferencing a different block in a reused cache entry as the
inode block.
NASD_ODC_LOCK_BLOCK()
and NASD_ODC_UNLOCK_BLOCK()
, which both take as their
sole argument a pointer to a cache block (nasd_odc_ent_t *
). If the
NASD_ODC_RECORD_BLOCK_LOCKS
option is enabled, then each
entry will track where the lock was taken (if it is currently held)
in the locker_file
and locker_line
fields.
This is for debugging purposes only. The readers/writers lock is taken
with NASD_ODC_RLOCK_BLOCK_DATA()
for reading, and
NASD_ODC_WLOCK_BLOCK_DATA()
for writing. Similarly, it is
released with NASD_ODC_RUNLOCK_BLOCK_DATA()
and
NASD_ODC_WUNLOCK_BLOCK_DATA()
. If the
NASD_ODC_RECORD_BLOCK_WLOCKS
option is enabled, then the
w_locker_file
and w_locker_line
fields of the
cache entry track who, if anyone, currently holds this lock for writing.
These macros also take as their sole argument a pointer to the cache
block being manipulated.
After obtaining a block from the cache, before manipulating it further, one should take a read or a write lock on the block (depending on whether or not one intends to modify the data contents of the block).
To avoid deadlock, it is very important to strictly observe the restrictions on what locks should be taken in what order, and which locks may not be concurrently held with other locks, as described in this document.
data_flags
data_flags
. After a block is acquired from the cache, before
its data may be examined, altered, or used, the data_flags
word must be checked. This word is protected by the block mutex. Values
for this word include:
Value | Meaning |
---|---|
NASD_CD_BUSY
| This block is busy (I/O is in progress). |
NASD_CD_INVALID
| Data contents of this block are not valid. |
NASD_CD_NZ
| Data contents of block are logically zero, but not initialized-
readers may treat the global array nasd_odc_zeroblk
as the data contents of this block; writers should explicitly zero
portions of the data they do not overwrite, and unset this flag.
|
NASD_CD_MBUSY
| The lookup operation on this block marked it busy, but the I/O has not been launched. The caller of the lookup operation is responsible for launching the I/O. |
NASD_CD_DELETING
| The block is an inode which is being deleted. |
NASD_CD_SECURITY
| The current state of the block is the result of security processing. |
NASD_CD_ANONF
| An anonymous fetch of this block is in progress. |
Changes in the contents of the data_flags
word are heralded by broadcasting the
condition cond
in the cache block. Users of cache blocks
may await such changes by waiting on cond
, atomically
releasing and retaking lock
. Two such operations which
are common are waiting for a block to not be busy, and waiting for a
block to not be busy or invalid. The preferred way to accomplish this
is to call nasd_odc_wait_not_busy()
or
nasd_odc_wait_not_busy_invalid()
. In addition to blocking
until the correct status is achieved, these operations will instruct
I/O modules that support priority queues
to elevate the priority of the I/Os which are keeping the blocks busy.
For example, let's say that we have a code fragment which wishes to read the first four bytes of data from a block. That might look like:
lru_flags
lru_flags
.
Operations on this word are protected by the LRU lock. The state represented
by this word is entirely internal to the cache - code outside the cache
mechanism itself should not be concerned with this. Values for this
word include:
Value | Meaning |
---|---|
NASD_CL_ALLOC
| The data for this block is currently being allocated. |
NASD_CL_NOALLOC
| The data allocation for this block failed. |
NASD_CL_LRU_Q
| This block is enqueued on nasd_odc_lru .
|
NASD_CL_REMOVING
| This block is being ejected from the cache. |
NASD_CL_DELETING
| This block is being deleted (valid only for inode blocks). |
NASD_CL_FALLOC
| Force the allocation (recovery case for NASD_CL_NOALLOC ).
|
NASD_CL_AERROR
| Forced allocation (NASD_CL_FALLOC ) failed.
|
NASD_CL_WIRED
| This block may not be ejected from the cache. |
dirty_flags
dirty_flags
field of its cache block. Access to
this word is protected by a lock maintained by the dirty block
tracking system (nasd_odc_dirtyq_lock
). Values
for this word are:
Value | Meaning |
---|---|
NASD_CR_DIRTY_Q
| This block is on the dirty list. |
NASD_CR_DIRTYW_Q
| This block is on the dirty-write list. |
NASD_CR_DIRTY
| This block is dirty. |
io_flags
I/O module
. Its
contents, and the locking protocol for reading or altering them, are defined
by that module. No one outside the I/O module should reference this field.
nasd_cache_init()
to
initialize the cache. At this time, nasd_odc_unusedq
and
nasd_odc_lru
are initialized. Next, a hash table for looking
up blocks that are currently in-core is initialized. This table has
nasd_odc_buckets
buckets, each of which has type
nasd_odc_oq_t
. The cache contains nasd_odc_size
buckets, which are statically allocated at this time, and relegated
to nasd_odc_unusedq
. Finally, the dirty block tracker is
initialized.
The most common way to retrieve a block from the cache is to call:
node_ent
is a pointer to an associated inode block, if any.
This is relevant when finding object-data (NASD_ODC_T_DATA
)
or indirect (NASD_ODC_T_IND
) blocks. blkno
is
the block number of this block. type
is its block type.
The block will be returned in *entp
. ichain
represents the I/O chain when I/O queueing is enabled; this will be explained
below. Valid values for
flags
are:
Flag | Meaning |
---|---|
NASD_ODC_L_FORCE
| If an entry for the block is not found in the cache, create it. |
NASD_ODC_L_BLOCK
| The operation may block. |
NASD_ODC_L_LOAD
| If the block is invalid (not yet fetched), launch the I/O to validate it if necessary. |
NASD_ODC_L_MLOAD
| If the block is invalid (not yet fetched), mark it as requiring
the fetch, queue it on ichain , and set NASD_CD_MBUSY
in the block's data_flags .
|
Fundamentally, this operation may be thought of as being broken into two phases: getting the block, and performing any necessary I/O-related activity. These two parts are implemented by:
with help from The first logical portion of this operation is finding the block in the cache, or adding it to the cache if it is not there and the user has specifiedNASD_ODC_L_FORCE
. This activity is performed by
nasd_odc_block_get_part1()
. The first thing done by
nasd_odc_block_get_part1()
is a call to
nasd_odc_block_lookup()
.
nasd_odc_block_lookup()
begins by taking the LRU lock to serialize
its work. Next, it checks the hash table to determine if an entry for the
block exists. If the block is found, but is being ejected from the cache,
then nasd_odc_block_lookup()
will "rescue" the block from
ejection if and only if NASD_ODC_L_BLOCK
was specified. Otherwise,
the lookup will fail with NASD_EJECTING
as the status. If
the block is currently cached as an anonymous block, and the caller has
specified another type for it, then the type of the block is changed to
the caller's type.
If nasd_odc_block_lookup()
does not find the block in the hash
table, and the caller has specified NASD_ODC_L_FORCE
, then it
will attempt to add this entry to the cache. First, it calls
nasd_odc_get_thread_ent()
, which is defined by the I/O module. This operation yields a
nasd_odc_ent_t
with no associated data. The I/O module should
define this operation in such a manner as to be nonblocking, yet reliably
yield a successful result. This entry is initialized with the correct block
number and type, and inserted in the hash table. NASD_CL_ALLOC
is
set in the lru_flags
for the block. Next,
nasd_odc_block_lookup()
calls nasd_odc_block_grab()
.
If there are blocks in nasd_odc_unusedq
,
nasd_odc_block_grab()
will return one of these. If not, it will
select a block for replacement (from the tail of nasd_odc_lru
)
and return that. nasd_odc_block_grab()
is instructed by its
caller as to whether or not it may block. nasd_odc_block_lookup()
uses the presense or absence of NASD_ODC_L_BLOCK
to set this
parameter. After an eligible block is selected for replacement, its data
page is removed and attached to the original entry added to the hash table by
nasd_odc_block_lookup()
. The newly-stripped cache entry is
handed off to the I/O module by calling nasd_odc_put_thread_ent()
.
Before nasd_odc_lookup()
returns any entries, it increments their
external refcount.
The astute will note that nasd_odc_block_lookup()
may yield
a block in an undesirable state under certain circumstances. Specifically,
it returns success if it finds a block in the cache. However, the block it
found could be a block with no associated data, because another thread is
in the process of allocating data for this block. It is the job of
the rest of the code in nasd_odc_block_get_part1()
to deal
with this case. When a block without an associated data page is returned,
nasd_odc_block_get_part1()
waits on the condition acond
in the cache entry. When the allocation operation has completed, the
nasd_odc_lookup()
responsible for the allocation broadcasts
this condition. If the allocation has succeeded, there will be a data page.
If the allocation has failed, NASD_CL_NOALLOC
will be set in
the lru_flags
of the cache block. If NASD_CL_AERROR
is set in lru_flags
, the drive has experienced an internal failure
which prevented this operation from succeeding. This is usually a sign of
memory corruption, or some other serious error. This condition should ultimately
result in a drive reset. If NASD_CL_AERROR
is not set, this
indicates that the nasd_odc_lookup()
responsible for the allocation
was called nonblocking, and it would be necessary to block waiting for a page
to complete the allocation successfully. If NASD_ODC_L_BLOCK
was
not specified to nasd_odc_block_get_part1()
, the operation
fails at this point, because further action would require long-term blocking.
If NASD_ODC_L_BLOCK
is specified, nasd_odc_block_get_part1()
sets NASD_CL_FALLOC
in the cache block's lru_flags
, and
uses nasd_odc_force_alloc()
to block and obtain a page. If this fails,
then the drive is experiencing some sort of severe internal error, and a drive
reset should ultimately result. nasd_odc_block_get_part1()
sets
NASD_CL_AERROR
on the cache block if this has happened. The LRU lock
serializes who performs the allocation when multiple threads are waiting for the
allocation to complete on the same block. The external refcount tracks how many
threads are currently examining the block. nasd_odc_block_get_part1()
only gives up the external reference on the block added by nasd_odc_lookup()
if it fails for some reason, so callers of nasd_odc_block_get_part1()
or nasd_odc_block_get()
get back a block with an external reference
held on it.
The thread responsible for launching an I/O on a block is the thread that added
the block to the cache. This is what the creatorp
result of
nasd_odc_block_lookup()
is for. If *creatorp
is nonzero
after a call to nasd_odc_block_lookup()
, then it is the caller's
thread that is responsible for starting the I/O. nasd_odc_get_part1()
manages transfers of this responsibility if the thread that added an entry to the
cache has yielded responsibility for the block, as with a failed page allocation
due to a lack of NASD_ODC_L_BLOCK
. *crp
as returned by
nasd_odc_get_part1()
indicates if the caller is responsible for
starting I/O on this block or not, if such is necessary.
nasd_odc_block_get_part2()
takes a block resulting from a call to
nasd_odc_block_get_part1()
and determines if an I/O should be started.
The same crp
passed to nasd_odc_block_get_part1()
may
be passed to nasd_odc_block_get_part2()
. If *crp
is
nonzero and the block is invalid (NASD_CD_INVALID
), it is necessary
to launch an I/O. If NASD_ODC_L_MLOAD
is specified, the block will
be marked NASD_CD_MBUSY
and queued on ichain
, which should
be a pointer to a placeholder nasd_odc_ent_t
. The queueing will
use the inext
and iprev
fields to form a circular queue
of I/Os. This mechanism is used by callers to retrieve many blocks from the cache,
and batch necessary I/Os. If prefetching is enabled, nasd_odc_block_get_part2()
may enqueue prefetch blocks on this chain as well. The caller is responsible for
launching I/Os on these blocks as well. If NASD_ODC_L_MLOAD
is not
specified, but NASD_ODC_L_LOAD
is specified, then
nasd_odc_block_get_part2()
will start the I/O itself. Note that
callers who do not specify either NASD_ODC_L_LOAD
or
NASD_ODC_L_MLOAD
are responsible themselves for either launching an
I/O to validate the block, or otherwise making the data contents valid.
If NASD_CD_NZ
is set on a block returned by the cache, the
caller should treat the data field as if it were full of zeroes, even though
the memory itself is uninitialized. If the caller modifies the data, this
flag must be cleared, and any locations not explicitly overwritten must
be zeroed.
When a block is no longer of use, its external reference is surrendered by
calling nasd_odc_block_release()
with a pointer to the cache block.
When a block's external refcount goes to zero, the cache examines its state.
If the block is dirty, it is enqueued on the list nasd_odc_dirtyq
.
Otherwise, it is enqueued at the head of nasd_odc_lru
, because
it is eligible for replacement. If a call to lookup occurs to make the external
refcount become nonzero again, it is removed from the dirty list or the LRU.
The dirty list is maintained by the dirty block tracker. This is a simple state machine which periodically flushes dirty blocks to disk. Additionally, this state machine tracks what threads are waiting for which blocks to be written to disk, and can flush individual blocks, entire objects, or the entire in-core state to disk on demand.
A cache entry's data should only be modified by a holder of the entry's
write lock. Before the entry is released, and while the write lock is still
held, a call to nasd_odc_dirty_ent()
with a pointer to the cache
block will inform the dirty block tracker that a particular block is dirty.
To see how this fits together, lets revisit our earlier example. This time, we will fill in the code which gets the block from the cache and returns it, and instead of reading the first four bytes, we will initialize them.
![]() | ![]() | ![]() |
---|---|---|
In-core extents | Changing physical refcounts | NASD Programmer's Documentation |