Measurement, Modelling, and Analysis of Data Access Patterns Contacts: Prof. M. Satyanarayanan, Lily Mummert, Maria Ebling Goals Accurate characterization of data usage in real computer systems is an important component of the Storage and Computer Systems Integration thrust area. Specifically, our goal in this work is to o obtain a clear understanding of data usage patterns in distributed Unix environments, o obtain insights into a number of fundamental research questions pertaining to file access in distributed systems, o develop a body of techniques for efficient collection, postprocessing, and indexing of file reference traces, and o influence the other projects in this thrust area of the DSSC. In the course of pursuing these goals, we have made a number of technology advances which are important contributions in themselves: o a validated trace format that has been demonstrated to be of sufficient generality for a large number of important analyses, o software for the generation and collection of file reference traces, o a collection of comprehensive, high-quality long-term file reference traces, and o software for synthetic generation of file references both as workloads and as benchmarks. We have focused our efforts on distributed Unix file systems, since that is the dominant form of data usage in our environment. We look forward to collaborating with our industrial partners in conducting similar analyses in environments where data usage differs considerably. Accomplishments. In the three years from January 1990 to December 1992, we have met or exceeded all our original goals. Specifically, we have developed a set of tools for obtaining high-quality traces, used these tools to collect long-term traces, and analyzed a number of critical questions about distributed file systems. We have designed and made substantial progress toward the implementation of a synthetic file reference generator, and expect to complete this effort by late 1993. In the sections below, we describe each of these accomplishments in more detail. Tracing Tools We have built dfstrace, a system to collect long-term file reference data at the system call level in a distributed Unix workstation environment. The design of dfstrace is unique in that it pays particular attention to efficiency, portability and the logistics of long-term data collection in such an environment. The major components of dfstrace are a set of kernel hooks, a kernel buffer mechanism, a user-level agent, a set of collection servers, and a post-processing library. Kernel Hooks Trace data is generated by client workstations running kernels instrumented at the system call level. We have added hooks in the file system call code for transparency to users and applications. Relevant pieces of data are passed to a logging routine, which creates a record and writes it into an circular memory buffer. The agent extracts blocks of data from the buffer through a simple device driver interface. Collection Machinery The data is extracted by a user-level process, or agent, buffered locally in memory, and then sent to one of a small number of data collection servers, or collectors. A collector stages the data on disk; in the background the data is sent to tape. The data is post-processed at a later time to obtain a usable set of traces for analysis. Multiple servers may be used to balance load and maintain availability. All of the communication intelligence is in user code. The agent and collector do not interpret the data, thus their operation is independent from exactly what data is being collected. Indexing of Traces As the body of data we have collected grows larger, summary information of various kinds for each trace becomes necessary, so that a user confronted with 20 GB of this data has some idea where to begin. We have built an on-line index for the traces that contains for each trace information like composition by system calls, access characteristics, and activity levels. Although the traces themselves are archived on tape, the on-line index makes it relatively easy to identify which specific tape or tapes to one is interested in. Post-Processing Library The goals of the post-processing library are to provide a convenient programmer interface to the traces, and to implement common operations. The underlying structure of the trace is hidden behind a simple interface. Traces may be filtered in various ways, such as by opcode or user, and the library will present the user only with the records that fit the specification. It is structured to accommodate traces of various formats (other researchers, various versions of ours), while maintaining a consistent interface to the programmer. It maintains a good deal of bookkeeping on the trace, such as building and tracking process trees, so that groups of processes may be studied in aggregate, and simulating the kernel open file table. Trace-based Analysis The trace data has already been valuable in determining the disk requirements for portable machines disconnected from the network, in estimating the resources required to resolve inconsistent replicas of files arising from network partitions, and in analyzing the geometry of disks. Over the summer of 1991, an undergraduate student funded by NSF REU conducted a comparative study of our data with earlier data gathered by other researchers. Disk Requirements for Portable Computers To obtain an understanding of the cache size requirements for disconnected operation of portable computers, we used the traces we had collected in simulations of the Coda cache manager. From our data it appears that a disk of 50-60MB should be adequate for operating disconnected for a typical workday. Of course, user activity that is drastically different from what was recorded in our traces could produce significantly different results. The actual disk size needed for disconnected operation has to be larger, since both the explicit and implicit sources of hoarding information are imperfect. Full details of this analysis have been reported in [Kistler92]. Log Space Requirements for Directory Resolution We have used the traces to estimate the space requirements for log-based directory resolution. Full details of this work have been reported in [Kumar93]. Since a log grows linearly with work done during partition, any realistic estimate of log size has to be derived from empirical data. The traces were used as input to a simulation of the logging component of the resolution subsystem. The simulator assumes that all activity in a trace occurs while partitioned, and maintains a history of log growth at 15-minute intervals for each volume in the system. At the end of simulation, the average and peak log growth rates for each volume can be obtained from its history. Our analysis shows that long-term log growth is relatively low, averaging about 94 bytes per hour. Focusing only on long-term average log growth rate can be misleading, since user activity is often bursty. To estimate the log length induced by peak activity, we examined the statistical distribution of hourly log growth rates for all volumes in our simulation. Over 94% of all data points observed were less than 1KB, and over 99.5% are less than 10KB. Since hourly growth is less than 10KB in 99.5% of our data points, and since an hour-long partition could have straddled two consecutive hours of peak activity, we infer that a 20KB log will be adequate for most hour-long partitions in our environment. More generally, a partition of N hours could have straddled N+1 consecutive hours of peak activity. Hence a log of 10(N+1) KB would be necessary. If a Coda server were to hold 100 volumes (a typical number at AFS installations), the total log space needed on the server would be (N+1) MB. Analyzing Disk Geometry We have extended the high-level traces described above to obtain low-level I/O traces and have written a program that extracts disk geometry and performance characteristics for a wide variety of disks. This kind of tool is valuable for all measurement studies that employ disks because it allows the performance of the disk to be diagnosed independent of the application and operating system. The kind of information that this tool can produce includes sector organization, defect spare locations, controller overhead, zone layout in variable sector per track disks, track and cylinder skewing, etc. The tool works by reading or writing sectors in a fixed pattern (such as 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...) while timing the response from the disk. Synthetic Reference Generation File traces, while accurately characterizing the context they were collected in, suffer from a number of limitations. First, there is no obvious way to scale a trace so that it represents a workload of higher or lower intensity. Second, traces tend to be voluminous and often have to be stored off-line on tapes because of disk storage limitations. This is especially true of long-term traces. Third, traces can only be made of actually observed workloads. There is no way to perturb a trace to represent a slightly different workload, or a radically different workload. Thus the range of "what if" questions that can be answered with a trace are limited. To address these limitations, we have designed a synthetic reference generator, SynRGen. SynRGen provides a simple and extensible mechanism for modelling a wide variety of usage environments. Both locality of reference and data sharing among users can be modelled in SynRGen. Besides its use in stress testing a file system, SynRGen can be used as a parameterized benchmark for evaluating local or distributed Unix file systems. SynRGen is easily portable to any file system implementing Unix semantics and to a variety of hardware architectures and Unix platforms. SynRGen builds a program that stochastically combines micromodels of file reference sequences. The micromodels consist of actual code built by a modeler by observing distinctive signatures of applications in actual file reference traces. The input to SynRGen for stochastic combination of micromodels is in the form of configuration files. We use three different types of files: system, volume, and user class. The system configuration file describes an entire usage environment. The volume configuration file describes the physical characteristics of a particular type of volume, and the user class configuration file describes the behavior of a particular class of users. The mksynrgen program parses the system configuration file and builds a shell script. The mkvol program parses the volume configuration files. The mkclass program parses the user class configuration file and produces a C program which models the user behaviors described in the file. Modelling Locality Locality of reference occurs in both time and space. The principle of temporal locality states that recently accessed data is likely to be accessed again in the near future while the principle of spatial locality states that data near recently accessed data is likely to be referenced soon. Locality of reference to file data can be further broken into that which occurs within a file and that which occurs between files. SynRGen allows the experimenter to incorporate arbitrary locality behavior by providing code that models such behavior. In the case of temporal locality, SynRGen supports modelling of locality across files but not within a file. There is rarely a need for the latter, since the virtual memory system typically filters out such locality from the file system. SynRGen also allows modelling spatial locality of reference within a file. Spatial locality across files is not modelled, since notion is itself not well-formed. Use of SynRGen SynRGen may be used at three different levels. Each level requires more knowledge about SynRGen and about the environment being modelled. The first level requires little knowledge about either. The experimenter simply runs an instance of SynRGen provided in the distribution. At this level, the experimenter may vary only the run-time values of parameters in the configuration files. The second level requires more knowledge about SynRGen's configuration files and behaviors; the experimenter would create a configuration file which more closely models a particular environment using the user behaviors provided in the distribution. The experimenter might change the probability distribution function of either the volume choices or the action choices. The experimenter should have detailed knowledge of the environment being modelled. At the third level, a modeler could model a new user behavior or a new type of volume. This level requires the most knowledge about SynRGen and detailed information about the user behavior or the new type of volume is imperative. Plans Our efforts for 1993-94 will focus on bringing our work on modelling, measurement, and analysis of file usage to an end, and to use the tools developed in this project in new work on evaluating and improving the use of prefetching for availability in disconnected and weakly-connected environments. Specifically, we plan to complete our implementation of SynRGen and to build a collection of micromodels for it. We also plan to use it for workload generation in real file systems.