PARALLEL DATA LAB 

PDL Abstract

Routing, Disjoint Paths, and Classification

Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation CMU-PDL-06-109, August 2006.

Shuheng Zhou

Parallel Data Laboratory
School of Computer Science & Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

In this thesis, we study two classes of problems: routing and classification. Routing problems include those that concern the tradeoff between routing table size and short-path forwarding (Part I), and the classic Edge Disjoint Paths problem (Part II). Both have applications in communication networks, especially in overlay network, and in large and high-speed networks, such as optical networks. The third part of this thesis concerns a type of classification problem that is motivated by a computational biology problem, where it is desirable that a small amount of genotype data from each individual is sufficient to classify individuals according to their populations of origin.

In hierarchical routing, we obtain “near-optimal” routing table size and path stretch through a randomized hierarchical decomposition scheme in the metric space induced by a graph. We say that a metric ( X,d) has doubling dimension dim(X) at most α if every set of diameter D can be covered by 2α sets of diameter D/2. (A doubling metric is one whose doubling dimension dim(X) is a constant.) For a connected graph G, whose shortest path distances dG induce the doubling metric (X, dG), we show how to perform (1+τ) -stretch routing on G for any 0< t ≤ 1 with routing tables of size at most (α/τ)O(α) logΔlogδ bits with only (α/τ)O(α) logΔ entries, where Δ is the diameter of G and δ is the maximum degree of G. Hence, the number of routing table entries is just τ-O(1)logΔ for doubling metrics.

The Edge Disjoint Paths (EDP) problem in undirected graphs refers to the following: Given a graph G with n nodes and a set Τ of pairs of terminals, connect as many terminal pairs as possible using paths that are mutually edge disjoint. This leads to a variety of classic NP-complete problems, for which approximability is not well understood. We show a polylogarithmic approximation algorithm for the undirected EDP problem in general graphs with a moderate restriction on graph connectivity: we require the global minimum cut of G to be Ω(log5n). Previously, constant or polylogarithmic approximation algorithms were known for trees with parallel edges, expanders, grids and grid-like graphs, and, most recently, even-degree planar graphs. These graphs either have special structure (e.g., they exclude minors) or there are large numbers of short disjoint paths. Our algorithm extends previous techniques in that it applies to graphs with high diameters and asymptotically large minors.

In the classification problem, we are given a set of 2N diploid individuals from population P1 and P2 (with no admixture), and a small amount of multilocus genotype data from the same set of K loci for all 2N individuals, and we aim to partition P1 and P2 perfectly. Each population Pa, where a∈{1,2}, is characterized by a set of allele frequencies at each locus. In our model, given the population of origin of each individual, the genotypes are assumed to be generated by drawing alleles independently at random across the K loci, each from its own distribution. For example, each SNP (or Single Nucleotide Polymorphism) has two alleles, which we denote with bit 1 and bit 0 respectively. In addition, each locus contains two bits (one from each parent) that are assumed to be two random draws from the same Bernoulli distribution.

We use and , for all k = 1, ..., K to denote frequency of an allele mapping to bit 1 at locus k in P1 and P2, respectively. We use γ = (-)2 / K as the dissimilarity measure between P1 and P2. We compute the number of loci K that we need to perform different tasks, versus N and γ, and prove several theorems. Ultimately, we show that with probability 1-1/poly(N), given that K= Ω(logN log logN /2) and K= Ω(logN / γ), we can recognize the perfect partition (P1,P2) from among all other balanced partitions of the 2N individuals. We proved this theorem for two cases: either we are given two random draws for each attribute along each dimension, or only one.

FULL THESIS: pdf/ ps