Q/U Release v1.0 README Overview: -------------------------------------------------------------------------------- This release contains a prototype implementing the Query/Update protocol. For more information on the protocol, see our SOSP 2005 paper. This prototype is the one used for that paper's experiments. While it is far from perfect, we are releasing it in the hope that it can foster further Q/U work and comparisons. July 2007 -Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, Jay J. Wylie License: -------------------------------------------------------------------------------- The Q/U source code is released under a BSD-like license. See the attached LICENSE file. Notice that we include source code from other outside sources: -RPC code from Sun (under the Sun license, similar to BSD) -StateThreads package from SGI (dual-licensed; we choose the Mozilla Public License) -MD5 hash functions from RSA Data Security (BSD-like license) -SHA1 hash functions from Dr Brian Gladman (BSD-like license). v1.2 Release Notes: -------------------------------------------------------------------------------- -Fixed a bug discovered by Atul Singh (thank you!). For debugging purposes, we'd often convert an IP to a hostname. But our logging mechanism would perform a reverse DNS lookup even if logging was disabled. For some environments, reverse DNS lookups weren't cached locally and would add a substantial latency. We now disabled the reverse DNS lookup. -Updated code so it compiles in 64-bit environments. v1.1 Release Notes: -------------------------------------------------------------------------------- -Changed the default benchmark configuration scripts so that query operations can work. Set RANDOMLY_SELECT_FID=0 in order to ensure that the objects used in query operations were created in the format phase. -Fixed the controller scripts to use paths relative to SOURCE_DIR, instead of assuming the user is executing from the tools/scripts directory. v1.0 Release Notes: -------------------------------------------------------------------------------- -This release was tested with GCC version 4.1.2 on the 2.6 kernels. We also have compiled and run this tree on Mac OS X. -A note about naming conventions: you will mostly see the Q/U protocol files, data structures, etc.., named with "md" or "cw", not "qu". This stems from Q/U's origins: it was intended as a metadata protocol to complement the existing PASIS Read/Write (R/W) protocol. -The prototype release does not include a backing store for the Q/U servers. Hence, the experiments' working set must fit in memory. By default, the Q/U servers use a 512MB data cache and a 64MB history cache size. This can be changed via the -M and -H command-line options to pasis_md_fe. -The Q/U server (often referred to as "the frontend") operates entirely in-memory. Originally, our design intended to handle large objects by splitting them into blocks. So if you look at src/s4/md_fe/pasis_md_memory_versioning.cpp you'll see that we can split objects into blocks. But we haven't exploited that much, and our experiments mostly run with small object sizes. -The NFS server writes its data locally, to /tmp. One could potentially easily change the NFS server to write the data using the Q/U protocol (not just the metadata). However, keep in mind that the Q/U protocol provides read-modify-write semantics that are _stronger_ than the semantics a data storage need (just read-write). One could use the PASIS Read/Write protocol for data storage, or a myriad of other atomic register protocols. Nevertheless, if one chose to make the NFS server use the Q/U protocol for data storage, one could get a lot of savings if we used read/write witnesses. (Contact us if you're interested.) -The prototype performs all necessary cryptographic hashes and consistency checks. But while it can detect Byzantine behavior, it doesn't always handle it: we often assert on catching an (unexpected) Byzantine behavior. So this prototype is not appropriate if you're performing experiments in which servers actually perform Byzantine actions. -We currently implement the following quorums: none, threshold, and tree quorums. Our threshold quorum implementation is a bit simplistic: we only pick consecutive machines in a quorum (i.e., 12,13,14). This should be improved to get better load balancing; for now, we just run with a lot of objects with randomly-assigned IDs, on which the quorum is chosen. This is usually good enough for load balancing. -Similarly, our handling of multi-object repair (which can occur during client or server failures, concurrency, or using non-preferred quorums) is not thoroughly tested. We have heavily tested single-object repair, but our experiments didn't heavily exercise the multi-object repair cases. -The source code constant PASIS_MAX_NUM_SERVERS is the maximum number of servers. It's set to 30 for now. If you want higher b and hence higher N, then change this variable in pasisio.h. pasisio.h also specifies which hash function to use (MD5 or SHA1). -Kerberos is currently disabled. But we don't think it'll affect our results: we're already hashing the operation (including the arguments, return values, timestamps, etc...). A good implementation would then hash over the rest of the message, which is not much. Directory structure: -------------------------------------------------------------------------------- apps Test programs benchmarks Two benchmark programs: one for the synthetic objects, and one for the metadata service common/lib Common utilities such as precise timers, caches, and MAC OS X compatiblity code common/sunrpc RPC code (some from glibc, some rewritten) ed_algs Hash functions (CRC, MD5, SHA1) kerberos Kerberos code. Unused for now. See below. md_fs Our NFS server that uses local disk for data storage and the Q/U protocol for metadata storage pasisio Q/U client code s4/md_fe Q/U server code /tools Imported code needed for compilation and/or running /rpcgen glibc Sun RPC stub generator /state_threads StateThreads user-level threading package To Compile (from pasis/src dir): -------------------------------------------------------------------------------- If running under Mac: -Edit configure.in and replace the value of ST_OS from "LINUX" to "DARWIN". autoconf Generate configure.in ./configure Runs configure. Configure options: --enable-debugging Enables debugging --disable-debugging Disables debugging --enable-optimize=# Enable optimization 0 for debug, 3 for optimized --help Help make Compile all libraries and applications make clean Removes made files (mostly .o but some rpcgen'ed files) make clean_config Removes all configured files (Makefiles and config*) Running experiments: -------------------------------------------------------------------------------- At the high-level, there are two types of experiments you can run: using the synthetic object service we described in our paper (which the 'counter' experiments are based on), or using our NFS server that uses Q/U servers for the metadata service and stores data locally. We have an automated experimental infrastructure for running experiments with the synthetic objects, since most of our experiments used that. Running the NFS server experiments is slightly less automated. We'll first describe how to run the synthetic operations. You can run these experiments in one of two ways: start the clients and servers manually (i.e., start src/s4/md_fe/pasis_md_fe on the server machines, and src/benchmarks/pasisio_md_bench on the client machines), or use our scripts which automate this process. We recommend the latter approach, for several reasons: 1) it saves you time from having to start each client and server for each experiment, and 2) the scripts can sweep over a large parameter space: e.g., varying b from 0 to 5, changing the number of outstanding requests, the number of clients, etc... In other words, a single invocation of a "controller" script can run hundreds or thousands of experiments. So we'll explain how to run our synthetic object operations using our scripts, as described in our paper. -To get the best performance, make sure you compiled the tree with optimized settings. (Use --disable-debugging and --enable-optimize=3 for ./configure.) -Setup password-less ssh for the user running the controller script. benchmark_controller.pl will use ssh to remotely start a daemon (cw_daemon.pl) that executes experiment commands (e.g., to start clients/servers). This is obviously a security issue if your machines are accessible. -Look at src/tools/scripts/cw_benchmark_controller.pl. This Perl script has two sections: a configuration section at the top, and a control section at the bottom that's responsible for running the experiments. -Here are the variables you'll want to edit: SOURCE_DIR - where your source tree is RESULTS_DIR - where your experiments results will reside NUM_RUNS - how many runs to perform for each experiment POSSIBLE_B - the different values of 'b' (the number of Byzantine servers faults tolerated) that you wish to run experiments for. POSSIBLE_OPERATIONS - which operation to benchmark. The operation definitions come later in the file (see $query_config, $update_config, etc...). Basically, you can define either a query or update operation (see OPERATION_TYPE). For, update operations, you can specify how many objects are involved (i.e., to get multi-object operations). You can specify the argument and return sizes. Leave NUM_READ_CHUNKS and NUM_WRITTEN_CHUNKS to be 1, for read and write operations, respectively. You can also add some client and server compute time (in ms). Overall, you don't have to change the operation definitions. If you want to repeat our SOSP05 experiments, just use these existing definitions and specify @possible_operations = (1); to use the small-update operation exclusively. POSSIBLE_NUM_CLIENTS - the different values of number of clients that you wish to run experiments for. POSSIBLE_NUM_OUTSTANDING - the different values of number of outstanding operations per client that you wish to run experiments for. ALL_CLIENTS and ALL_SERVERS - list of machine names to use for the clients and servers. If these machines have a common prefix, you can put that common prefix in the $rack variable and just put the rest in the all_clients and all_servers list. QUORUM_CONFIG - in case you want to use quorums beside threshold quorums. BENCHMARK_CONFIG - in case you want to change the experiment run times. When running with multiple clients, make sure that the WARMUP_TIME and SLOWDOWN_TIME are big enough (more than a few seconds) If you look above CONTROLLER_CONFIG, there are comments on tweaks you can do in order to run without calculating authenticators, without transferring authenticators, to force a read history before every update (i.e., 2 rounds for each update, not 1 in the "common" case where the OHS is cached), or to use random quorums not preferred quorums. -Once you've configured cw_benchmark_controller.pl, run it. It will first ssh into each client and server, and start the src/tools/scripts/cw_daemon.pl. It then iterates over all the experiments and executes each one. The results of each experiment will be in RESULTS_DIR. Files in the form of _0_cmd.out contain the output from each client, and _md_fe.out contain the output from the server. You can look at the individual response times and throughput from the clients, or you can run src/tools/scripts/cw_summarize_run.pl to get the overall throughput for a particular experiment. -The above described how to run our vanilla throughput-scalability with small updates experiments. If you want to run concurrency experiments, talk to us, since we didn't include the scripts for this. -The cw_benchmark_controller-nfs-micro.pl script can run latency numbers for the Q/U NFS metadata service. Disclaimer: it hasn't been tested after it was cleaned up as part of the code release. If you have difficulty with it, please contact us. -If you'd like to run full experiments with the Q/U NFS server, start the Q/U servers, then start the NFS server.