Q/U Release v1.0 README


Overview:
--------------------------------------------------------------------------------

This release contains a prototype implementing the Query/Update protocol.  For
more information on the protocol, see our SOSP 2005 paper.  This prototype is
the one used for that paper's experiments.  While it is far from perfect, we are
releasing it in the hope that it can foster further Q/U work and comparisons.


July 2007
-Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, Jay J. Wylie


License:
--------------------------------------------------------------------------------

The Q/U source code is released under a BSD-like license.  See the attached
LICENSE file.  Notice that we include source code from other outside sources:
-RPC code from Sun (under the Sun license, similar to BSD)
-StateThreads package from SGI (dual-licensed; we choose the Mozilla Public License)
-MD5 hash functions from RSA Data Security (BSD-like license)
-SHA1 hash functions from Dr Brian Gladman (BSD-like license).


v1.2 Release Notes:
--------------------------------------------------------------------------------

-Fixed a bug discovered by Atul Singh (thank you!).  For debugging purposes,
 we'd often convert an IP to a hostname.  But our logging mechanism would
 perform a reverse DNS lookup even if logging was disabled.  For some 
 environments, reverse DNS lookups weren't cached locally and would add
 a substantial latency.  We now disabled the reverse DNS lookup.
-Updated code so it compiles in 64-bit environments.
   
   
v1.1 Release Notes:
--------------------------------------------------------------------------------
 
-Changed the default benchmark configuration scripts so that query operations
 can work.  Set RANDOMLY_SELECT_FID=0 in order to ensure that the objects used
 in query operations were created in the format phase.
-Fixed the controller scripts to use paths relative to SOURCE_DIR, instead of
 assuming the user is executing from the tools/scripts directory.


v1.0 Release Notes:
--------------------------------------------------------------------------------

-This release was tested with GCC version 4.1.2 on the 2.6 kernels. We also have
 compiled and run this tree on Mac OS X.
-A note about naming conventions: you will mostly see the Q/U protocol files,
 data structures, etc.., named with "md" or "cw", not "qu".  This stems from
 Q/U's origins: it was intended as a metadata protocol to complement the existing
 PASIS Read/Write (R/W) protocol.
-The prototype release does not include a backing store for the Q/U servers.
 Hence, the experiments' working set must fit in memory. By default, the Q/U
 servers use a 512MB data cache and a 64MB history cache size.  This can be
 changed via the -M and -H command-line options to pasis_md_fe.
-The Q/U server (often referred to as "the frontend") operates entirely
 in-memory. Originally, our design intended to handle large objects by splitting
 them into blocks.  So if you look at
 src/s4/md_fe/pasis_md_memory_versioning.cpp you'll see that we can split
 objects into blocks.  But we haven't exploited that much, and our experiments
 mostly run with small object sizes.
-The NFS server writes its data locally, to /tmp.  One could potentially easily
 change the NFS server to write the data using the Q/U protocol (not just the
 metadata).  However, keep in mind that the Q/U protocol provides
 read-modify-write semantics that are _stronger_ than the semantics a data
 storage need (just read-write).  One could use the PASIS Read/Write protocol for
 data storage, or a myriad of other atomic register protocols.  Nevertheless, if
 one chose to make the NFS server use the Q/U protocol for data storage, one
 could get a lot of savings if we used read/write witnesses.  (Contact us if
 you're interested.)
-The prototype performs all necessary cryptographic hashes and consistency
 checks. But while it can detect Byzantine behavior, it doesn't always handle
 it: we often assert on catching an (unexpected) Byzantine behavior.  So
 this prototype is not appropriate if you're performing experiments in which
 servers actually perform Byzantine actions.
-We currently implement the following quorums: none, threshold, and tree
 quorums.  Our threshold quorum implementation is a bit simplistic: we only pick
 consecutive machines in a quorum (i.e., 12,13,14).  This should be improved to
 get better load balancing; for now, we just run with a lot of objects with
 randomly-assigned IDs, on which the quorum is chosen.  This is usually good
 enough for load balancing.
-Similarly, our handling of multi-object repair (which can occur during client or
 server failures, concurrency, or using non-preferred quorums) is not thoroughly
 tested.  We have heavily tested single-object repair, but our experiments
 didn't heavily exercise the multi-object repair cases.
-The source code constant PASIS_MAX_NUM_SERVERS is the maximum number of
 servers.  It's set to 30 for now.  If you want higher b and hence higher N,
 then change this variable in pasisio.h.  pasisio.h also specifies which hash
 function to use (MD5 or SHA1).
-Kerberos is currently disabled.  But we don't think it'll  affect our results:
 we're already hashing the operation (including the arguments, return values,
 timestamps, etc...).  A good implementation would then hash over the rest of
 the message, which is not much.


Directory structure:
--------------------------------------------------------------------------------

apps                   Test programs
benchmarks             Two benchmark programs: one for the synthetic objects,
                       and one for the metadata service
common/lib             Common utilities such as precise timers, caches, and
                       MAC OS X compatiblity code
common/sunrpc          RPC code (some from glibc, some rewritten)
ed_algs                Hash functions (CRC, MD5, SHA1)
kerberos               Kerberos code.  Unused for now.  See below.
md_fs                  Our NFS server that uses local disk for data storage and
                       the Q/U protocol for metadata storage
pasisio                Q/U client code
s4/md_fe               Q/U server code
/tools                  Imported code needed for compilation and/or running
      /rpcgen           glibc Sun RPC stub generator
      /state_threads    StateThreads user-level threading package


To Compile (from pasis/src dir):
--------------------------------------------------------------------------------

If running under Mac:
 -Edit configure.in and replace the value of ST_OS from  "LINUX" to "DARWIN".


autoconf                Generate configure.in

./configure             Runs configure. Configure options:
                        --enable-debugging      Enables debugging
                        --disable-debugging     Disables debugging
                        --enable-optimize=#     Enable optimization
                                                0 for debug, 3 for optimized
                        --help                  Help

make                    Compile all libraries and applications

make clean              Removes made files (mostly .o but some rpcgen'ed files)
make clean_config       Removes all configured files (Makefiles and config*)


Running experiments:
--------------------------------------------------------------------------------
At the high-level, there are two types of experiments you can run: using
the synthetic object service we described in our paper (which the 'counter'
experiments are based on), or using our NFS server that uses Q/U servers for the
metadata service and stores data locally.

We have an automated experimental infrastructure for running experiments with
the synthetic objects, since most of our experiments used that.  Running the NFS
server experiments is slightly less automated.

We'll first describe how to run the synthetic operations. You can run these
experiments in one of two ways: start the clients and servers manually (i.e.,
start src/s4/md_fe/pasis_md_fe on the server machines, and
src/benchmarks/pasisio_md_bench on the client machines), or use our scripts
which automate this process.  We recommend the latter approach, for several
reasons:

1) it saves you time from having to start each client and server for each
   experiment, and
2) the scripts can sweep over a large parameter space: e.g., varying b from 0 to
   5, changing the number of outstanding requests, the number of clients,
   etc...  In other words, a single invocation of a "controller" script can run
   hundreds or thousands of experiments.

So we'll explain how to run our synthetic object operations using our scripts,
as described in our paper.

-To get the best performance, make sure you compiled the tree with optimized
 settings.  (Use --disable-debugging and --enable-optimize=3 for ./configure.)
-Setup password-less ssh for the user running the controller script.  
 benchmark_controller.pl will use ssh to remotely start a daemon (cw_daemon.pl)
 that executes experiment commands (e.g., to start clients/servers).  This is
 obviously a security issue if your machines are accessible.
-Look at src/tools/scripts/cw_benchmark_controller.pl.  This Perl script has two
 sections: a configuration section at the top, and a control section at the
 bottom that's responsible for running the experiments. 
-Here are the variables you'll want to edit:
 SOURCE_DIR  - where your source tree is
 RESULTS_DIR - where your experiments results will reside
 NUM_RUNS    - how many runs to perform for each experiment
 POSSIBLE_B  - the different values of 'b' (the number of Byzantine servers
               faults tolerated) that you wish to run experiments for.
 POSSIBLE_OPERATIONS - which operation to benchmark.  The operation
                definitions come later in the file (see $query_config,
                $update_config, etc...). Basically, you can define either
                a query or update operation (see OPERATION_TYPE). For, update
                operations, you can specify how many objects are involved (i.e.,
                to get multi-object operations). You can specify the argument
                and return sizes.  Leave NUM_READ_CHUNKS and NUM_WRITTEN_CHUNKS
                to be 1, for read and write operations, respectively.  You can
                also add some client and server compute time (in ms).  Overall,
                you don't have to change the operation definitions.  If you want
                to repeat our SOSP05 experiments, just use these existing
                definitions and specify @possible_operations = (1); to use the
                small-update operation exclusively.
 POSSIBLE_NUM_CLIENTS - the different values of number of clients that you wish
                         to run experiments for.
 POSSIBLE_NUM_OUTSTANDING - the different values of number of outstanding
                             operations per client that you wish to run
                             experiments for.
 ALL_CLIENTS and ALL_SERVERS - list of machine names to use for the clients and
                               servers. If these machines have a common prefix,
                               you can put that common prefix in the $rack
                               variable and just put the rest in the all_clients
                               and all_servers list.
 QUORUM_CONFIG - in case you want to use quorums beside threshold quorums.
 BENCHMARK_CONFIG - in case you want to change the experiment run times.  When
                    running with multiple clients, make sure that the
                    WARMUP_TIME and SLOWDOWN_TIME are big enough (more than a
                    few seconds)
 If you look above CONTROLLER_CONFIG, there are comments on tweaks you can do in
 order to run without calculating authenticators, without transferring
 authenticators, to force a read history before every update (i.e., 2 rounds for
 each update, not 1 in the "common" case where the OHS is cached), or to use
 random quorums not preferred quorums.
-Once you've configured cw_benchmark_controller.pl, run it.  It will first ssh
 into each client and server, and start the src/tools/scripts/cw_daemon.pl.  It
 then iterates over all the experiments and executes each one.  The results of
 each experiment will be in RESULTS_DIR.  Files in the form of
 <CLIENT>_0_cmd.out contain the output from each client, and <SERVER>_md_fe.out
 contain the output from the server.  You can look at the individual response
 times and throughput from the clients, or you can run
 src/tools/scripts/cw_summarize_run.pl to get the overall throughput for a
 particular experiment.
-The above described how to run our vanilla throughput-scalability with small
 updates experiments. If you want to run concurrency experiments, talk to us,
 since we didn't include the scripts for this.
-The cw_benchmark_controller-nfs-micro.pl script can run latency numbers for the
 Q/U NFS metadata service.  Disclaimer: it hasn't been tested after it was
 cleaned up as part of the code release. If you have difficulty with it, please
 contact us.
-If you'd like to run full experiments with the Q/U NFS server, start the Q/U
 servers, then start the NFS server.