*Carnegie
Mellon University Parallel Data Lab Ph.D. Dissertation CMU-PDL-06-109, August
2006. *

**Shuheng Zhou ^{}**

Parallel Data Laboratory

School of Computer Science & Electrical and Computer Engineering

Carnegie Mellon University

Pittsburgh, PA 15213

In this thesis, we study two classes of problems: routing and classification. Routing problems include those that concern the tradeoff between routing table size and short-path forwarding (Part I), and the classic Edge Disjoint Paths problem (Part II). Both have applications in communication networks, especially in overlay network, and in large and high-speed networks, such as optical networks. The third part of this thesis concerns a type of classification problem that is motivated by a computational biology problem, where it is desirable that a small amount of genotype data from each individual is sufficient to classify individuals according to their populations of origin.

In hierarchical routing, we obtain “near-optimal” routing table size and path
stretch through a randomized hierarchical decomposition scheme in the metric
space induced by a graph. We say that a metric ( *X,d*)
has *doubling dimension* dim(*X*)
at most α if every set of diameter D can be covered by 2^{α} sets of diameter *D*/2. (A *doubling metric* is one whose doubling dimension dim(*X*)
is a
constant.) For a connected graph *G, * whose shortest path distances *d _{G}* induce the
doubling metric (

The Edge Disjoint Paths (EDP) problem in undirected graphs refers to the following:
Given a graph *G* with *n* nodes and a set Τ of pairs of terminals, connect
as many terminal pairs as possible using paths that are mutually edge disjoint.
This leads to a variety of classic NP-complete problems, for which approximability is not well understood. We show a polylogarithmic approximation algorithm
for the undirected EDP problem in general graphs with a moderate restriction on
graph connectivity: we require the global minimum cut of *G* to be Ω(log^{5}*n*). Previously,
constant or polylogarithmic approximation algorithms were known for trees
with parallel edges, expanders, grids and grid-like graphs, and, most recently, even-degree
planar graphs. These graphs either have special structure (e.g., they exclude
minors) or there are large numbers of short disjoint paths. Our algorithm extends
previous techniques in that it applies to graphs with high diameters and asymptotically
large minors.

In the classification problem, we are given a set of 2*N* diploid individuals from
population *P*_{1} and *P*_{2} (with no admixture), and a small amount of multilocus genotype
data from the same set of *K* loci for all 2*N* individuals, and we aim to partition *P*_{1} and *P*_{2} perfectly. Each population *P*_{a}, where *a*∈{1,2}, is characterized by a
set of allele frequencies at each locus. In our model, given the population of origin
of each individual, the genotypes are assumed to be generated by drawing alleles
independently at random across the *K* loci, each from its own distribution. For example,
each SNP (or Single Nucleotide Polymorphism) has two alleles, which we
denote with bit 1 and bit 0 respectively. In addition, each locus contains two bits
(one from each parent) that are assumed to be two random draws from the same
Bernoulli distribution.

We use and , for all *k *= 1, ...,* K* to denote frequency of an allele mapping to bit
1 at locus k in *P*_{1} and *P*_{2}, respectively. We use γ = (-)^{2} */* *K* as the dissimilarity
measure between P_{1} and P_{2}. We compute the number of loci *K* that we need to perform different tasks, versus *N* and γ, and prove several theorems. Ultimately,
we show that with probability
1-1/poly(*N*), given that *K=* Ω(log*N* log log*N* */* Nγ^{2}) and *K=* Ω(log*N / * γ), we can *recognize* the perfect partition (*P*_{1},*P*_{2}) from among all other
balanced partitions of the 2*N* individuals. We proved this theorem for two cases:
either we are given two random draws for each attribute along each dimension, or
only one.