# CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

## Lecture #18: Oblivious routing on a hypercube

In celebration of Les Valiant’s winning of the Turing award, today we discussed a classic result of his on the problem of oblivious routing on the hypercube. (Valiant, “A scheme for fast parallel communication”, SIAM J. Computing 1982, and Valiant and Brebner, “Universal schemes for parallel communication”, STOC 1981). The high-level idea is that rather than routing immediately to your final destination, go to a random intermediate point first. The analysis is really beautiful — you define the right quantities and it just magically works out nicely. See today’s class notes. We also discussed the Butterfly and Benes networks as well. (See the excellent notes of Satish Rao for more on them).

At the end, we also briefly discussed Martingales: their definition, Azuma’s inequality, and McDiarmid’s inequality (which doesn’t talk about Martingales directly but is very convenient and can be proven using Azuma). The discussion at the end was extremely rushed but the key point was: suppose you have a complicated random variable $\phi$ you care about like “the running time of my randomized algorithm” where the random variable is a function of a series of random choices $z_1, z_2, ....$ Then the sequence $X_0, X_1, \ldots$ where $X_i = E[\phi | z_1, \ldots, z_i]$ is a Martingale. E.g., the expected running time of quicksort given that the first $i-1$ pivots are $z_1, \ldots z_{i-1}$ is the expected value, over the possible choices of $z_i$ of the expected running time of quicksort given that the first $i$ pivots are $z_1, \ldots, z_i$.

## Lecture #16: Odds and ends

Some examples for the CKR decomposition scheme

Varun asked why the CKR algorithm differs in so many ways from Bartal’s procedure:

• it chooses a single radius instead of choosing an independent radius for each piece it carves out,
• it chooses the radius from a uniform distribution,
• it picks the next vertex randomly (instead of arbitrarily), and moreover,
• it also grows regions from vertices that have been captured in previously carved out pieces.

Here are some examples that might clarify the situation:

Suppose we choose a random radius R uniformly from [r/4,r/2], but independently for each center. Then in the following example, each time a leaf is picked, the unit-length edge (u,v) will be cut with probability 2/r. If there are n leaves, the edge will be cut with probability about Theta(n/r) >> O(log n)/r.

Example 1

Now, suppose we choose a single radius R uniformly from [r/4,r/2], but choose the centers in arbitrary (adversarial) order. In the following example, if we choose the centers in the order l_{r/2-1}, …, l_2, l_1, then we cut the edge (u,v) for sure (with probability 1). Note that in this case, the random order means there’s a good chance that early on in the process, we pick some vertex l_i with some small value of $i$. Since this leaf is close to the edge (u,v), it will put both u and v in the same set.

Example 2

So yeah, once we go with a uniformly random choice of R (instead of choosing it from a geometric distribution), we run into trouble if we re-sample R independently each time time we grow a region, or if we choose the next vertex to grow from in an worst-case fashion.

Finally, what about the growing balls from centers that have already been carved out? That is important for the analysis, because it makes the process depend very mildly on the past evolution. Would the algorithm break down if we did not do that? I am blanking on a bad example right now, but I think there should be one. Let me know if you see it.

Some citations for k-server

In STOC 2008, Cote, Meyerson, and Poplawski gave a randomized algorithm for the k-server problem on certain special kinds of HSTs that achieved a poly-logarithmic competitive ratio. In SODA 2009, Bansal, Buchbinder, and Naor abstracted out certain “convexity” properties, which if we could prove, would give a polylogarithmic competitive ratio for general HSTs, and hence for all metric spaces using the Theorems we saw in class. (See also their ICALP 2009 paper.) It’d be great to make progress on this problem.

## Lecture #16: Distance-preserving trees (part II)

1. Embeddings into Distributions over Trees

In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to ${O(\log n)}$).

Theorem 1 Given any metric ${(V,d)}$ with ${|V| = n}$ and aspect ratio ${\Delta}$, there exists a efficiently sampleable distribution ${\mathcal{D}}$ over spanning trees of ${V}$ such that for all ${u,v\in V}$:

1. For all ${T \in \textrm{Support}(\mathcal{D})}$, ${d_T(u,v) \geq d(u,v)}$, and
2. ${\mathop{\mathbb E}_{T\sim \mathcal{D}}[d_T(u,v)] \leq O(\log n \log \Delta) \; d(u,v)}$.

To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space ${(V, d)}$ on ${|V| = n}$ points and a parameter ${r \in {\mathbb R}_+}$, a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of ${V}$ into ${S_1 \uplus S_2 \uplus S_3 \uplus \dots \uplus S_t}$ such that

1. (Low Radius/Diameter) For all ${S_i}$, there exists ${c_i \in V}$ such that for all ${u \in S_i}$, ${d(c_i, u) \leq r/2}$. Hence, for any ${u, v \in S_i}$, ${d(u,v) \leq r}$.
2. (Low Cutting Probability) For each pair ${u,v}$, ${\Pr[u,v \text{ lie in different } S_i's ] \leq \beta \; \frac{d(u,v)}{r}}$ with ${\beta = O(\log n)}$.

We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.

Consider the following recursive algorithm, which takes as input a pair ${(U,i)}$ where ${U \subseteq V}$ is a set of vertices of diameter at most ${2^i}$, and returns a rooted tree ${(T,r)}$.

TreeEmbed${(U,i)}$:

1. Apply the low-diameter decomposition to ${(U,d)}$ with the parameter ${r = 2^{i-1}}$ to get the partition ${S_1,\ldots,S_t}$.
2. Recurse: Let ${(T_j,root_j) \leftarrow}$ TreeEmbed(${S_j,i-1}$). As a base case, when ${S_i}$ is a single point, simply return that point.
3. For every tree ${T_j}$ with ${j > 1}$, add the edge ${(root_1,root_j)}$ with length ${2^i}$. This is a new tree which we denote ${T}$.
4. Return the tree/root pair ${(T,root_1)}$.

Recall that since the low diameter decomposition is randomized, this algorithm defines a distribution over trees over ${U}$. To build the tree for ${V}$, we first rescale so that for all ${u,v \in V}$, ${d(u,v) \geq 1}$ and ${d(u,v) \leq \Delta \approx 2^\delta}$. We define the distribution ${\mathcal{D}}$ as the one obtained by calling TreeEmbed${(V,\delta)}$.

Lemma 2 For all ${x,y \in V}$, ${d_T(x,y) \geq d(x,y)}$ for all ${T \in support(\mathcal{D})}$.

Proof: Fix ${x}$ and ${y}$, and let ${i}$ be such that ${d(x,y) \in (2^{i-1},2^i]}$. Consider the invocation of TreeEmbed${(U,i)}$ such that ${x \in U}$. First, we examine the case in which ${y \in U}$. By the definition of the low diameter decomposition, since ${d(x,y) > 2^{i-1}}$, ${x}$ and ${y}$ will fall into separate parts of the partition obtained in Step 1, and so we will have ${d_T(x,y) \geq 2^i}$, the length of the edge placed between different subtrees. In the case in which ${y \not\in U}$, then it must be that ${x}$ and ${y}$ have been separated at a higher level ${i'}$ of the recursion, are consequently separated by a higher subtree edge, and hence ${d_T(x,y) \geq 2^{i'} > 2^i}$. $\Box$

Lemma 3 For all ${x,y \in V}$, ${E_{T \sim \mathcal{D}}[d_T(x,y)] \leq d(x,y)\cdot O(\log \Delta \, \log n)}$

Proof: We begin with two easy subclaims. Suppose ${(T_U,root) \leftarrow}$ TreeEmbed${(U,i)}$:

1. Claim 1: ${d_{T_U}(root,x) \leq 2^{i+1}}$ for all ${x \in U}$. By induction, ${x}$ lies in some piece ${S_j}$ of the partition having diameter at most ${2^{i-1}}$ and hence inductively is at distance at most ${2^{i-1+1}}$ from its root ${root_j}$. That root is connected to the root ${root}$ by an intertree edge of weight ${2^i}$, giving us ${2^{i+1}}$ in total.
2. Claim 2: If ${x,y \in U}$, then ${d_{T_U}(x,y) \leq 2\cdot 2^{i+1}}$. From the previous claim, each ${x}$ and ${y}$ is at distance at most ${2^{i+1}}$ from ${root}$, distances are symmetric, and the triangle inequality applies.

We now have from the definition:

$\displaystyle \begin{array}{rcl} d_T(x,y) &\leq& \sum_{i=\delta}^0\Pr[(x,y)\textrm{ first separated at level i}]\cdot 4\cdot 2^i \\ &\leq& \sum_{i=\delta}^0 \beta\;\frac{d(x,y)}{2^{i-1}} \cdot 2^i\cdot 4 \\ &=& (\delta +1)\; 8\beta\, d(x,y) \end{array}$

where the first inequality follows from our subclaims, the second follows from the property of the low diameter decomposition. Setting ${\beta = O(\log n)}$ and ${\delta = O(\log \Delta)}$ completes the proof. $\Box$

The two lemmas above prove Theorem 1. How do we implement these low diameter decompositions? And how can we get the promised ${O(\log n)}$? Keep reading…

2. Low Diameter Decompositions

Recall the definition of a (randomized) low-diameter decomposition from above: given a metric ${(V,d)}$ and a bound ${r}$, we want a partition with pieces of radius at most ${r/2}$, and want vertices to be separated with “small” probability ${\beta \frac{d(x,y)}{r}}$ (i.e., proportional to their distance, and inversely proportional to ${r}$).

Before we proceed, think about how you’d get such a decomposition for a line metric, or a tree metric, with ${\beta = O(1)}$; moreover, you cannot hope to get subconstant ${\beta = o(1)}$ for even the line. So the theorem says that general graphs lose a factor ${O(\log n)}$ more, which is not bad at all! (And this factor is existentially optimal, we will show a tight example.)

2.1. Algorithm I: Geometric Region Growing

To make our life easier, we’ll assume that all distances in the metric are at least ${r/n^2}$. (We can enforce this via a pre-processing without much effort, I’ll come back to it.)

The algorithm just picks a “truncated” geometric distance ${R}$, carves out a piece of radius ${R}$ around some vertex, and repeats until the metric is eaten up.

Geom-Regions${(V, d, r)}$:

1. Choose ${R \sim \mathtt{Geom}(\frac{4 \ln n}{r})}$; if ${R > r/2}$, then set ${R \gets r/2}$.
2. Pick an arbitrary vertex ${u \in V}$, and set ${S_1 \gets \{ v \in V \mid d(u,v) \leq R \}}$
3. Return ${\{ S_1 \} \cup }$ Geom-Regions${(V \setminus S_1, d, r)}$.

Clearly, the radius bound is maintained by the fact that ${X \leq r/2}$ with probability ${1}$.

What’s the chance that ${x,y}$ lie in separate parts? So let’s view this process as picking a vertex ${u}$ and starting with a ball of radius zero around it; then we flip a coin with bias ${p = r/(4 \ln n)}$, increasing the radius by one after each tails, until either we see a heads or we reach ${r/2}$ tails, when we cut out the piece. And then we pick another vertex, and repeat the process.

Consider the first time when one of these lies in the current ball. Note that either this ball will eventually contain both of them, or will separate them. And to separate them, it must make a cut within the next ${d(x,y)}$ steps. The chance of this is at most the chance of seeing a heads from a bias-${p}$ coin in ${d(x,y)}$ steps, plus the chance that a ${\mathtt{Geom}(p)}$ r.v. sees more than ${(2 \ln n)/p}$ tails in a row. Using a naive union bound for the former, we get

$\displaystyle d(x,y) \cdot p + (1 - p)^{(2 \ln n)/p} \leq \frac{d(x,y)}{r} \cdot 2 \ln n + \frac{1}{n^2}.$

We now use the fact that all distances are at least ${r/n^2}$ to claim that ${1/n^2 \leq \frac{d(x,y)}{r}}$ and hence the probability of ${x,y}$ separated is at most ${(4 \ln n + 1) d(x,y)/r}$, which proves the second property of the decomposition.

Finally, the loose ends: to enforce the minimum-distance condition that ${d(x,y) \geq r/n^2}$, just think of the metric as a complete graph with edge-lengths ${\ell_{xy} = d(x,y)}$, contract all edges ${(x,y)}$ with ${d(x,y) < r/n^2}$, and recompute edge lengths to get the new metric ${d' \leq d}$. Running the decomposition Geom-Regions${(V, d', r/2)}$ on this shrunk metric, and then unshrinking the edges, will ensure that each pair is separated with probability either ${0}$ (if it has length ${< r/n^2}$), or probability at most ${(8 \ln n + 2) d(x,y)/r}$. And finally, since the output had radius at most ${r/4}$ according to ${d'}$, any path has at most ${n}$ nodes and its length can change by at most ${n\cdot r/n^2 \leq r/4}$ for ${n \geq 4}$, the new radius is at most ${r/2}$!.

Another advantage of this shrinking preprocessing: a pair ${x,y}$ is separated only when ${r \leq d(x,y) \cdot n^2}$, and it is separated for sure when ${r \geq d(x,y)}$. Using this observation in the calculation from the previous section can change the ${O(\log n \log \Delta)}$ to just ${O(\log n \min(\log n, \log \Delta))}$. But to get the ultimate ${O(\log n)}$ guarantee, we’ll need a different decomposition procedure.

2.2. Algorithm II: The CKR Decomposition

Theorem 4 (The Better Decomposition) There exists an efficiently sampleable probability distribution ${\mathcal{D}}$ over partitions with parts having radius at most ${r/2}$ such that

$\displaystyle \Pr[u,v\ \textrm{separated by the partition}] \leq \frac{4\;d(u,v)}{r}\log\left(\frac{|Ball(u,r/2)|}{|Ball(u,r/8)|}\right)$

where ${Ball(x,r) = \{ y : d(x,y) \leq r \}}$.

The procedure for the decomposition is a little less intuitive, but very easy to state:

CKR Decomposition${(V, d, r)}$:

1. Choose ${R \in_R [\frac{r}4, \frac{r}{2}]}$ uniformly at random.
2. Choose a random permutation ${\pi\!: V \rightarrow V}$ uniformly at random.
3. Consider the vertices one by one, in the order given by ${\pi}$. When we consider ${w}$, we assign all the yet-unassigned vertices ${v}$ with ${d(w,v) \leq R}$ to ${w}$‘s partition.

For example, suppose the ordering given by ${\pi}$ is ${v_1, v_2, v_3, v_4}$. The figure below illustrates the coverage when the vertices are visited by this process.

This construction directly implies the low-radius property, restated in the following claim.

Lemma 5 (Low Radius) The output of the algorithm has the property that for all ${S_i}$, there exists ${c_i \in V}$ such that for all ${u \in S_i}$, ${d(c_i, u) \leq r/2}$.

The real work is in showing that for each pair ${(u,v)}$, it is separated with small probability. Before proving this, let us state two definitions useful for the proof. For the analysis only: suppose we re-number the vertices ${w_1, w_2, \dots, w_n}$ in order of the distance from the closer of ${u,v}$.

• (Settling) At some time instant in this procedure, one (or both) of ${u}$ or ${v}$ gets assigned to some ${w_i}$. We say that ${w_i}$ settles the pair ${(u,v)}$
• (Cutting) At the moment the pair is settled, if only one vertex of this pair is assigned, then we say that ${w_i}$ cuts the pair ${(u,v)}$.

According to these definitions, each pair is settled at exactly one time instant in the procedure, and it may or may not be cut at that time. Of course, once the pair is settled (with or without being cut), it is never cut in the future.

Now to bound the separation probability. Consider ${w_j}$, and let ${d(w_j,u)=a_j}$ and ${d(w_j,v)=b_j}$. Assume ${a_j (the other case is identical). If ${w_j}$ cuts ${(u,v)}$ when the random values are ${R}$ and ${\pi}$, the following two properties must hold:

1. The random variable ${R}$ must lie in the interval ${[a_j,b_j]}$ (else either none or both endpoints of ${e}$ would get marked).
2. The node ${w_j}$ must come before ${w_1,\dots, w_{j-1}}$ in the permutation ${\pi}$.

Suppose not, and one of them came before ${w_j}$ in the permutation. Since all these vertices are closer to the pair than ${w_j}$ is, then for the current value of ${R}$, they would have settled the pair (either capturing one or both of the endpoints) at some previous time point, and hence ${w_j}$ would not settle—and hence not cut—the pair ${(u,v)}$.

With these two properties, we establish

$\displaystyle \begin{array}{rl} \Pr[\text{pair } (u,v) \text{ is separated}] &= \sum_j \Pr[w_j\text{ cuts the pair } (u,v)] \\ &\leq \sum_j \Pr[R \in [a_j, b_j]\text{ and } w_j\text{ comes before }w_1, \dots, w_{j-1}\text{ in }\pi] \\ &\leq \sum_j \frac{d(u,v)}{(r/2 - r/4)} \cdot \frac1j \leq \frac{4\,d(u,v)}{r}\cdot H_n = \frac{d(u,v)}{r}\;O(\log n) \end{array}$

But we wanted to do better than that! No worries, the fix is easy, but clever. First, note that if ${d(u,v) \geq r/8}$ then the probability of separating ${u,v}$ is at most ${1 \leq 8\,d(u,v)/r}$. So suppose ${d(u,v) < r/8}$. Now, for ${w_j}$ to cut ${(u,v)}$, it is not enough for ${R \in [a_j, b_j]}$ and ${w_j}$ comes before all ${w_i}$ for ${i < j}$. It also must be the case that ${w_j}$ be at most ${r/2}$ from the closer of the pair (say ${u}$) to even reach one of the vertices, let alone separate then. And at least ${r/4}$ from the further one (say ${v}$) so that some setting of ${R}$ would have a chance to separate the two. So the distance of ${w_j}$ from ${u}$ must be at most ${r}$, and at least ${r/4 - d(u,v) \geq r/8}$, and the same for its distance from ${v}$. If we restrict the harmonic sum in the final expression over just the vertices that satisfy these bounds, we get the bound

$\displaystyle \frac{d(u,v)}{r/4} \left( \frac1{|B(u, r/8)| + 1} + \frac1{|B(u, r/8)| + 2} + \cdots + \frac1{|B(u, r)|} \right),$

and hence the bound in Theorem 4.

Theorem 6 (FRT 2003) Using the decomposition procedure from Theorem 4 in the TreeEmbed algorithm, we get that for all ${x,y \in V}$:

$\displaystyle E_T[d_T(x,y)] \leq d(x,y)\cdot O(\log n)$

The proof for the TreeEmbed algorithm remains essentially unchanged, except for the final calculations:

$\displaystyle \begin{array}{rcl} E_T[d_T(x,y)] &\leq& \sum_{i=\delta}^0\Pr[(x,y)\textrm{ separated at level i}]\cdot 4\cdot 2^i \\ &\leq& \sum_{i=\delta}^0 \frac{4\,d(x,y)}{2^{i-1}}\cdot\log\left(\frac{|B(x,2^i)|}{|B(x,2^{i-3})|}\right)\cdot 2^i\cdot 4 \\ &=& 32\,d(x,y)\sum_{i=0}^\delta(\log(|B(x,2^i)|) - \log(|B(x,2^{i-3})|)) \\ &=&32\,d(x,y)(\log(|B(x,2^\delta)|) + \log(|B(x,2^{\delta-1})|) + \log(|B(x,2^{\delta-2})|) \\ &\leq& 96\,d(x,y)\log n \end{array}$

where the last equality follows from observing that we have a telescoping sum.

Citations: The ${O(\min(\log \Delta, \log n) \log n)}$ construction was due to Yair Bartal (1996); this substantially improved on the first non-trivial guarantee of ${\exp(\sqrt{\log n \log\log n})}$ due to Alon, Karp, Peleg and West (1992). The low-diameter decomposition is also from Bartal. The ${O(\log n)}$ algorithm is by Fakcharoenphol, Rao, and Talwar (2003), based on the improved decomposition scheme due to Calinescu, Karloff and Rabani (2000).

3. Lower Bounds

Let us show two lower bounds: first, that no randomized low-diameter decomposition can achieve better than ${\beta = O(\log n)}$ for general metrics. And that no random tree embeddings can do better than ${\alpha = O(\log n)}$ either.

3.1. Lower Bounds for Decompositions

First, given a graph ${G = (V,E)}$ with unit length edges, if we apply a ${\beta}$ decomposition with parameter ${r}$ to the graph metric ${d_G}$, we will cut each edge with probability ${\beta/r}$. The expected number of cut edges will be ${\beta m/r}$. So, for each ${r}$ the probabilistic method says there exists a diameter-${r}$ partition that cuts at most ${\beta m/r}$ edges.

Let ${G}$ be a graph with ${n}$ nodes and ${cn}$ edges (with ${c > 1}$), where the girth of the graph (the length of the shortest simple cycle) is at least ${g = c'\log n}$ (for constant ${c' < 1}$). Such graphs are known to exist, this can be shown by the probabilistic method.

Now, if we set ${r = \frac{c'}{3} \log n = g/3}$ and consider any diameter-${r}$ partition: we claim no set ${S}$ in this partition can induce a cycle. Indeed, since every cycle is of length ${g}$, two furthest points in the cycle would be ${g/2 = \frac{c'}{2} \log n > r}$ distance from each other. So all sets induce a forest, which means the number of internal edges is at most ${n-1 < n}$. This means at least ${(c-1)n}$ edges are cut.

Cool. For every diameter-${r}$ partition, at least ${(c-1)n}$ edges are cut because of the large girth property. But there exists one that cuts at most ${\beta\, m/r = \beta\, cn/r}$ edges, because we have a good decomposition algorithm. So now we put the two facts together.

$\displaystyle (c-1)n \leq \beta\, \frac{cn}{r} \implies \beta \geq \frac{c-1}{c}\, r = \Omega(\log n).$

3.2. Lower Bounds for Random Tree Embeddings

Suppose there is a distribution ${\mathcal{D}}$ that achieves expected stretch ${\alpha}$ for the large-girth graphs above. Let’s use this to obtain a low-diameter decomposition with cutting parameter ${\beta = O(\alpha)}$; this will mean ${\alpha = \Omega(\beta) = \Omega(\log n)}$.

Sample a tree ${T}$ from the distribution, pick an arbitrary vertex ${v}$, pick a random value ${R \in_R [0, r/2)}$. Delete all edges that contain points at distance exactly in ${\{ R, r/2+R, r+R, 3r/2+R, \ldots \}}$ from ${v}$. The remaining forest has components with radius at most ${r/2}$, and diameter ${r}$ in the tree. Since distances on the original graph are only smaller, the diameter of each part will only be less in the original graph.

Moreover, given the tree ${T}$, a pair will be separated with probability at most ${\frac{d_T(x,y)}{r/2}}$. Taking expectations, the total probability of ${x,y}$ separated is at most

$\displaystyle E_T[ \frac{2\, d_T(x,y)}{r} ] \leq 2\alpha \; \frac{d(x,y)}{r}.$

So we have a decomposition scheme with parameter ${2 \alpha}$. And combining this with the previous lower bound on any decomposition scheme, we get ${\alpha = \Omega(\log n)}$.

## Lecture #15: Distance-preserving trees (part I)

1. Metric Spaces

A metric space is a set ${V}$ of points, with a distance function ${d: V \times V \rightarrow {\mathbb R}_{\geq 0}}$ that satisfies ${d(x,x) = 0}$ for all ${x \in V}$, symmetry (i.e., ${d(x,y) = d(y,x)}$), and the triangle inequality (i.e., ${d(x,y) + d(y,z) \geq d(x,z)}$ for all ${x,y,z \in V}$). Most of the computer science applications deal with finite metrics, and then ${n}$ denotes the number of points ${|V|}$.

There are many popular problems which are defined on metric spaces:

• The Traveling Salesman Problem (TSP): the input is a metric space, and the goal is to find a tour ${v_1, v_2, \ldots, v_n, v_{n+1} = v_0}$ on all the ${n}$ nodes whose total length ${\sum_{i = 1}^n d(v_i, v_{i+1})}$ is as small as possible. This problem is sometimes defined on non-metrics as well, but most of the time we consider the metric version.

The best approximation algorithm for the problem is a ${(3/2 - \epsilon)}$-approximation due to Oveis-Gharan, Saberi and Singh (2010). Their paper uses randomization to beat the ${3/2}$-approximation of Cristofides (1976), and make progress on this long-standing open problem. The best hardness result for this problem is something like ${1.005}$ due to Papadimitriou and Vempala.

• The ${k}$-Center/${k}$-Means/${k}$-median problems: the input is a metric space ${(V,d)}$, and the goal is to choose some ${k}$ positions ${F}$ from ${V}$ as “facilities”, to minimize some objective function. In ${k}$-center, we minimize ${\max_{v \in V} d(v, F)}$, the largest distance from any client to its closest facility; here, we define the distance from a point ${v}$ to a set ${S}$ as ${d(v,S) := \min_{s \in S} d(v,s)}$. In ${k}$-median, we minimize ${\sum_{v \in V} d(v, F)}$, the total (or equivalently, the average) distance from any client to its closest facility. In ${k}$-means, we minimize ${\sum_{v \in V} d(v, F)^2}$, the average squared distance from any client to its closest facility. (Note: to see why these problems are called what they are, consider what happens for the ${1}$-means/medians problem on the line.)The best algorithms for ${k}$-center give us a ${2}$-approximation, and this is the best possible unless P=NP. The best ${k}$-median algorithm gives an ${(3+\epsilon)}$-approximation, whereas the best hardness known for the version of the problem stated above is ${(1 + 1/e)}$ unless P=NP. For ${k}$-means, gap between the best algorithm and hardness results is worse for general metric spaces. For geometric spaces, better algorithms are known for ${k}$-means/medians.
• The ${k}$-server problem: this is a classic online problem, where the input is a metric space (given up-front); a sequence of requests ${\sigma_1, \sigma_2, \cdots}$ arrives online, each request being some point in the metric space. The algorithm maintains ${k}$ servers, one each at some ${k}$ positions in the metric space. When the request ${\sigma_t}$ arrives, one of the servers must be moved to ${\sigma_t}$ to serve the request. The cost incurred by the algorithm in this step is the distance moved by the server, and the total cost is the sum of these per-step costs. The goal is to give a strategy that minimizes the total cost of the algorithm.The best algorithm for ${k}$-server is a ${2k-1}$-competitive deterministic algorithm due to Koutsoupias and Papadimitriou. Since ${k}$-server contains paging as a special case (why?), no deterministic algorithm can do better than ${k}$-competitive. It is a long-standing open problem whether we can do better than ${2k-1}$CC deterministically—but far more interesting is the question of whether randomization can help beat ${2k-1}$; the best lower bound against oblivious adversaries is ${\Omega(\log k)}$, again from the paging problem.

1.1. Approximating Metrics by Trees: Attempt I

A special kind of metric space is a tree metric: here we are given a tree ${T = (V,E)}$ where each edge ${e \in E}$ has a length ${\ell_e}$. This defines a metric ${(V, d_T)}$, where the distance ${d_T(x,y)}$ is the length of the (unique) shortest path between ${x}$ and ${y}$, according to the edge lengths ${\ell_e}$. In general, given any graph ${G = (V,E)}$ with edge lengths, we get a metric ${(V, d_G)}$.

Tree metrics are especially nice because we can use the graph theoretic idea that it is “generated” by a tree to understand the structure of the metric better, and hence give better algorithms for problems on tree metrics. For instance:

• TSP on tree metrics can be solved exactly: just take an Euler tour of the points in the tree.
• ${k}$-median can be solved exactly on tree metrics using dynamic programming.
• ${k}$-server on trees admits a simple ${k}$-competitive deterministic algorithm.

So if all metrics spaces were well-approximable by trees (e.g., if there were some small factor ${\alpha}$ such that for every metric ${M = (V,d)}$ we could find a tree ${T}$ such that

$\displaystyle d(x,y) \leq d_T(x,y) \leq \alpha d(x,y) \ \ \ \ \ (1)$

for every ${x,y \in V}$, then we would have an ${\alpha}$-approximation for TSP and ${k}$-median, and an ${\alpha k}$-competitive algorithm for ${k}$-server on all metrics. Sadly, this is not the case: for the metric generated by the cycle graph ${C_n}$, the best factor we can get in~(1) is ${\alpha = n-1}$. This is what we would get if we just approximated the tree by a line.

So even for simple metrics like that generated by the cycle (on which we can solve these problems really easily), this approach hits a dead-end really fast. Pity.

1.2. Approximating Metrics by Trees: Attempt II

Here’s where randomization will come to our help: let’s illustrate the idea on a cycle. Suppose we delete a uniformly random edge of the cycle, we get a tree ${T}$ (in fact, a line). Note that the distances in the line are at least those in the cycle.

How much more? For two vertices ${x, y}$ adjacent in the cycle, the edge ${(x,y)}$ still exists in the tree with probability ${1 - 1/n}$, in which case ${d_T(x,y) = d(x,y)}$; else, with probability ${1/n}$, ${x}$ and ${y}$ lie at distance ${n-1}$ from each other. So the expected distance between the endpoints of an edge ${(x,y)}$ of the cycle is

$\displaystyle (1 - 1/n) \cdot 1 + 1/n \cdot n-1 = 2(1-1/n) \cdot d(x,y)$

And indeed, this also holds for any pair ${x, y\in V}$ (check!),

$\displaystyle E_T[ d_T(x,y) ] \leq 2(1-1/n) \cdot d(x,y)$

But is this any good for us?

Suppose we wanted to ${k}$-median on the cycle, and let ${F^*}$ be the optimal solution. For each ${x}$, let ${f_x^*}$ be the closest facility in ${F^*}$ to ${x}$; hence the cost of the solution is:

$\displaystyle OPT = \sum_{x \in V} d(x, f_x).$

By the expected stretch guarantee, we get that

$\displaystyle \sum_{x \in V} E_T[ d_T(x, f_x) ] < 2\, OPT.$

I.e., the expected cost of this solution ${F^*}$ on the random tree is at most ${2\, OPT}$. And hence, if ${OPT_T}$ is the cost of the optimal solution on ${T}$, we get

$\displaystyle E_T[ OPT_T ] \leq 2\, OPT$

Great—we know that the optimal solution on the random tree does not cost too much. And we know we can find the optimal solution on trees in poly-time.

Let’s say ${F_T \subseteq V}$ is the optimal solution for the tree ${T}$, where the closest facility in ${F_T}$ to ${x}$ is ${f_x^T}$, giving ${OPT_T = \sum_x d_T(x, f_x^T)}$. How does this solution ${F_T}$ perform back on the cycle? Well, each distance in the cycle is less than that in the tree ${T}$, so the expected cost of solution ${F_T}$ on the cycle will be

$\displaystyle E \left[ \sum_x d(x,f_x^T) \right] = \sum_x E[ d(x,f_x^T) ] \leq \sum_x E[ d_T(x, f_x^T) ] = E_T [ OPT_T ] \leq 2\, OPT .$

And we have a randomized ${2}$-approximation for ${k}$-median on the cycle!

1.3. Popping the Stack

To recap, here’s the algorithm: pick a random tree ${T}$ from some nice distribution. Find an optimal solution ${F_T}$ for the problem, using distances according to the tree ${T}$, and output this set as the solution for the original metric.

And what did we use to show this was a good solution? That we had a distribution over trees such that

• every tree in the distribution had distances no less than that in the original metric, and
• the expected tree distance between any pair ${x, y \in V}$ satisfies ${E_T[ d_T(x,y) ] \leq \alpha \cdot d(x,y)}$ for some small ${\alpha}$; here ${\alpha = 2}$.

And last but not least

• that the objective function was linear in the distances, and so we could use linearity of expectations.

Note that TSP, ${k}$-median, ${k}$-server, and many other metric problems have cost functions that are linear in the distances, so as long as the metrics we care about can be “embedded into random trees” with small ${\alpha}$, we can translate algorithms on trees for these problems into (randomized) algorithms for general metrics! This approach gets used all the time, and is worth remembering. (BTW, note that this general approach does not work for non-linear objective functions, like ${k}$-center, or ${k}$-means.)

But can we get a small ${\alpha}$ in general? In the next section, we show that for any ${n}$-point metric with aspect ratio ${\frac{\max d(x,y)}{\min d(x,y)} = \Delta}$, we can get ${\alpha = O(\log n \log \Delta)}$; and we indicate how to improve this to ${O(\log n)}$, which is the best possible!

2. Embeddings into Trees

In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to ${O(\log n)}$).

Theorem 1 Given any metric ${(V,d)}$ with ${|V| = n}$ and aspect ratio ${\Delta}$, there exists a efficiently sampleable distribution ${\mathcal{D}}$ over spanning trees of ${V}$ such that for all ${u,v\in V}$:

1. For all ${T \in \textrm{Support}(\mathcal{D})}$, ${d_T(u,v) \geq d(u,v)}$, and
2. ${\mathop{\mathbb E}_{T\sim \mathcal{D}}[d_T(u,v)] \leq O(\log n \log \Delta) \; d(u,v)}$.

To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space ${(V, d)}$ on ${|V| = n}$ points and a parameter ${r \in {\mathbb R}_+}$, a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of ${V}$ into ${S_1 \uplus S_2 \uplus S_3 \uplus \dots \uplus S_t}$ such that

1. (Low Radius/Diameter) For all ${S_i}$, there exists ${c_i \in V}$ such that for all ${u \in S_i}$, ${d(c_i, u) \leq r/2}$. Hence, for any ${u, v \in S_i}$, ${d(u,v) \leq r}$.
2. (Low Cutting Probability) For each pair ${u,v}$, ${\Pr[u,v \text{ lie in different } S_i's ] \leq \beta \; \frac{d(u,v)}{r}}$ with ${\beta = O(\log n)}$.

We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.

## Lecture #13: Learning Theory II

Today we talked about online learning.Â  We discussed the Weighted Majority and Randomized Weighted Majority algorithms for the problem of “combining expert advice”, showing for instance that the RWM algorithm satisfies the bound $E[\# mistakes] \leq (1+\epsilon)OPT + \frac{1}{\epsilon}\log n$, where $n$ is the number of “experts” andÂ  $OPT$ is the number of mistakes of the best expert in hindsight.Â  Also, this can be used when the experts are not predictors but rather just different options (like whether to play Rock, Paper, or Scissors in the Rock-Paper-Scissors game).Â  In this case, “# mistakes” becomes “total cost” and all costs are scaled to be in the range [0,1] each round.

We then discussed the “multiarmed bandit” problem, which is like the experts problem except you only find out the payoff for the expert you chose and not for those you didn’t choose.Â  For motivation, we discussed this in the context of the problem of selling lemonade to an online series of buyers, where the “experts” correspond to different possible prices you might choose for selling your lemonade.Â  We then went through an analysis of the EXP3 algorithm (though we did a simpler version of the analysis that gets a $T^{2/3}$ dependence on $T$ in the regret bound rather than the optimal $T^{1/2}$).

See the lecture notes (2nd half)

## Lecture #12: Learning Theory 1

Today we talked about the problem of learning a target function from examples, where examples are drawn from some distribution D, and the goal is to use a labeled sample S (a set of examples drawn from D and labeled by the target f) to produce a function h such that $Pr_{x \sim D}[h(x)\neq f(x)]$ is low. We gave a simple efficient algorithm for learning decision-lists in this setting, a basic “Occam’s razor” bound, and then a more interesting bound using the notion of shatter coefficients and a “ghost sample” argument. See 1st half of these lecture notes.

• One way to interpret the basic Occam bound is that in principle, anything you can represent in a polynomial number of bits you can learn from a polynomial number of examples (if running time is not a concern). Also “data compression implies learning”: if you can take a set of m examples and find a prediction rule that is correct on the sample and requires < m/10 bits to write down, then you can be confident it will have low error on future points.
• On the other hand, we would really like to learn from as few examples as possible, which is the reason for wanting bounds based on more powerful notions of the “underlying complexity” of the target function, such as shatter coefficients. Other very interesting bounds are based on a notion called “Rademacher complexity” which is even tighter.

## Lecture #11: Online algorithms

Today we discussed randomized online algorithms, and in particular, algorithms for the ski-rental (elevator-or-stairs) and paging problems. See lecture notes as well as Chapter 13 of the MR book. Also, Claire Mathieu has a very nice set of notes on the randomized ski-rental problem.

## Lecture #10: Polynomial Identity Testing, and Parallel Matchings

1. Matrix multiplication checking

The matrix multiplication checking problem is to verify the process of matrix multiplication: Given three ${n \times n}$ matrices ${A}$, ${B}$, and ${C}$, is it the case that ${AB = C}$? The fastest known deterministic algorithm is to actually multiply ${A}$ and ${B}$ and compare the result to ${C}$—this takes ${O(n^\omega)}$ time, where ${\omega}$ is the exponent of matrix multiplication, and currently ${\omega = 2.376}$ due to an algorithm of Coppersmith and Winograd. Note that an easy lower bound on the running time of any randomized algorithm for matrix multiplication verification is ${\Omega(n^2)}$ since the input has to at least be read (see Lecture 4 for more details on this). We will now give a randomized algorithm (in co-${RP}$) which takes only ${O(n^2)}$ time.

Let us introduce some notation for the rest of the course: ${x \in_{R} X}$ means “choose ${x}$ uniformly at random from the set ${X}$”. Our checking problem algorithm is as follows:

• Pick a vector ${x \in_R \{0,1\}^n}$.
• Compare ${ABx}$ with ${Cx}$. This takes ${O(n^2)}$ time, since we can first compute ${y = Bx}$, and then compute ${Ay = ABx}$. Each matrix-vector product takes only ${O(n^2)}$ time.
• If ${ABx = Cx}$, then output Yes, otherwise output No.

(We’re imagining working over the reals; if the matrices are over the field ${\mathbb{F}}$, the computations should also carried out over the field ${\mathbb{F}}$. The proofs remain unchanged.) Now if ${AB=C}$, our algorithm always outputs the correct answer. If ${AB\neq C}$, the algorithm may output the wrong answer. We now bound the probability of such an error.

First, we need a simple lemma:

Lemma 1 Given ${n}$-digit strings ${a,b \in {\mathbb R}^n}$ and ${x \in_R \{0,1\}^n}$, ${\Pr[a \cdot x = b \cdot x] \le \frac{1}{2}}$.

Proof: Suppose ${a_i \ne b_i}$. Let ${\alpha = \sum_{j \ne i} a_j x_j}$ and ${\beta = \sum_{j \ne i} b_j x_j}$. We can write ${a \cdot x = \alpha + a_i x_i}$ and ${b \cdot x = \beta + b_i x_i}$. This gives us

$\displaystyle a \cdot x - b \cdot x = (\alpha - \beta) + (a_i - b_i)x_i.$

We can invoke the Principle of Deferred Decisions (see Section 3.5 of M&R) to assume that we’ve first picked all the values ${x_j}$ for ${j \ne i}$. Then we can write

$\displaystyle \Pr[a \cdot x - b \cdot x = 0] = \Pr\left[x_i = \frac{\alpha - \beta}{b_i - a_i}\right] \le \frac{1}{2},$

where we use the fact that ${(\alpha - \beta)/(b_i - a_i)}$ can only be either ${0}$ or ${1}$ (or neither), and a randomly chosen ${x_j}$ will take that value with probability at most half. $\Box$

Theorem 2 (Freivalds) If ${AB \ne C}$, our algorithm fails with probability at most ${\frac{1}{2}}$.

Proof: If ${AB\neq C}$, then there is at least one row in ${AB}$, say ${(AB)_i}$, that differs from the corresponding row ${C_i}$ in C. Apply Lemma 1 with ${a=(AB)_i}$ and ${b=C_i}$. The probability that ${a\cdot x=b\cdot x}$ is at most ${1/2}$. For the algorithm to output Yes, we must have ${a\cdot x=b\cdot x}$. Therefore, the probability of failure for the algorithm is at most ${1/2}$. $\Box$

2. Polynomial identity checking

In the polynomial identity checking problem, we are given two multi-variate polynomials ${f(x_1, \ldots, x_n)}$ and ${g(x_1, \ldots, x_n)}$ each with degree ${d}$; again we are computing in some field ${\mathbb{F}}$. We may not be given the polynomials explicity, so we may not be able to read the polynomials in poly-time — we just have “black-box” access for evaluating a polynomial. Given these two polynomials, the problems is to determine if the polynomials are equal: i.e., if ${f = g}$, or equivalently, ${f - g = 0}$. Letting ${Q = f - g}$, it suffices to check if a given polynomial is identically zero. There is no known poly-time algorithm for this problem. But we will now show that it is in co-${RP}$.

First consider the univariate case. We can pick ${d+1}$ distinct values at random from ${\mathbb{F}}$. If ${Q(x) = 0}$ for all ${d+1}$ values for ${x}$, then ${Q = 0}$. This follows from the basic and super-useful fact, that for any field ${\mathbb{F}}$, a polynomial of degree at most ${d}$ over that field can have at most ${d}$ roots.

This approach does not directly apply to the multivariate case; in fact, the polynomial over two variables ${f(x,y) = xy - 3}$ over the reals has an infinite number of roots. Over the finite field ${\mathbb{F}_q}$, the degree-${d}$ polynomial over ${n}$ variables

$\displaystyle Q(x_1, x_2, \ldots, x_n) = (x_1 - 1)(x_1 - 2) \cdots (x_1 - d)$

has ${dq^{n-1}}$ roots (when ${d \leq q = |\mathbb{F}|}$).

However, things still work out. Roughly speaking, we can handle the multivariate case by fixing ${n-1}$ variables and applying the result from the univariate case. Consider the following algorithm, which assumes we have some subset ${S \subset \mathbb{F}}$ with ${|S| \ge 2d}$.

• Pick ${r_1, \ldots, r_n \in_R S}$.
• Evaluate ${Q(r_1, \ldots, r_n)}$.
• If 0, return ${Q = 0}$.

Theorem 3 (Schwartz (1980), Zippel (1979)) If, in the above algorithm, the polynomial ${Q \ne 0}$, we have

$\displaystyle \Pr[Q(r_1, \ldots, r_n) = 0] \le \frac{d}{|S|}.$

Proof: By induction on ${n}$. The base case is the univariate case described above. With ${Q \ne 0}$, we want to compute ${\Pr[Q(r_1, \ldots, r_n) = 0]}$. Let ${k}$ be the largest power of ${x_1}$. We can rewrite

$\displaystyle Q(x_1, \ldots, x_n) = x_1^k A(x_2, \ldots, x_n) + B(x_1, \ldots, x_n)$

for some polynomials ${A}$ and ${B}$. Now we consider two events. Let ${\mathcal{E}_1}$ be the event that ${Q(r_1,\cdots,r_n)}$ evaluates to ${0}$, and ${\mathcal{E}_2}$ be the event that ${A(r_2,\cdots,r_n)}$ evaluates to ${0}$.

We can rewrite the probability that ${Q(r_2,\cdots,r_n)}$ is ${0}$ as:

$\displaystyle \begin{array}{rcl} \Pr[Q(r) = 0] = \Pr[\mathcal{E}_1] & = & \Pr[ \mathcal{E}_1 \mid \mathcal{E}_2 ] \Pr[\mathcal{E}_2] + \Pr[\mathcal{E}_1 \mid \lnot \mathcal{E}_2 ] \Pr[\lnot \mathcal{E}_2 ] \\ & \leq & \Pr[\mathcal{E}_2] + \Pr[\mathcal{E}_1 \mid \lnot \mathcal{E}_2] \end{array}$

Let us first bound the probability of ${\mathcal{E}_2}$, or the probability that ${A(r_2,\cdots,r_n)=0}$. The polynomial ${A}$ has degree ${d-k}$ and fewer varaibles, so we can use the inductive hypothesis to obtain

$\displaystyle \Pr[\mathcal{E}_2]=\Pr[A(r_2, \ldots, r_n) = 0] \le \frac{d-k}{|S|}.$

Similarly, given ${\lnot \mathcal{E}_2}$ (or ${A(r_2,\cdots,r_n)\neq 0}$), the univariate polynomial ${Q(x_1, r_2, \ldots, r_n)}$ has degree ${k}$. Therefore, again by inductive hypothesis,

$\displaystyle \Pr[\mathcal{E}_1 \mid \lnot \mathcal{E}_2]=\Pr[Q(x_1, r_2, \ldots, r_n) = 0 \; | \; A(r_2, \ldots, r_n) \ne 0] \le \frac{k}{|S|}.$

We can substitute into the expression above to get

$\displaystyle \begin{array}{rcl} \Pr[Q(r) = 0] & \leq & \Pr[\mathcal{E}_2] + \Pr[\mathcal{E}_1 \mid \lnot \mathcal{E}_2] \\ & \le & \frac{d-k}{|S|} + \frac{k}{|S|} = \frac{d}{|S|} \end{array}$

This completes the inductive step. $\Box$

Polynomial identity testing is a powerful tool, both in algorithms and in complexity theory. We will use it to find matchings in parallel, but it arises all over the place. Also, as mentioned above, there is no poly-time deterministic algorithm currently known for this problem. A result of Impagliazzo and Kabanets (2003) shows that proving that the polynomial identity checking problem is in ${P}$ would imply that either ${NEXP}$ cannot have poly-size non-uniform circuits, or Permanent cannot have poly-size non-uniform circuits. Since we are far from proving such strong lower bounds, the Impagliazzo-Kabanets result suggest that deterministic algorithms for polynomial identity checking may require us to develop significantly new techniques.

For more on the polynomial identity checking problem, see Section 7.2 in M&R. Dick Lipton’s blog has an interesting post on the history of the theorem, as well as comparisons of the results of Schwartz, Zippel, and DeMillo-Lipton. One of the comments points out that the case when ${S = \mathbb{F}_q}$ was known at least as far back as Ore in 1922; his proof appears as Theorem 6.13 in the book Finite Fields by Lidl and Niederreiter; a different proof by Dana Moshkowitz appears here.

3. Perfect matchings in bipartite graphs

We will look at a simple sequential algorithm to determine whether a perfect matching exists in a given bipartite graph or not. The algorithm is based on polynomial identity testing from the previous section.

A bipartite graph ${G=(U,V,E)}$ is specified by two disjoint sets ${U}$ and ${V}$ of vertices, and a set ${E}$ of edges between them. A perfect matching is a subset of the edge set ${E}$ such that every vertex has exactly one edge incident on it. Since we are interested in perfect matchings in the graph ${G}$, we shall assume that ${|U|=|V|=n}$. Let ${U=\{u_1,u_2,\cdots,u_n\}}$ and ${V=\{v_1,v_2,\cdots,v_n\}}$. The algorithm we study today has no error if ${G}$ does not have a perfect matching (no instance), and errs with probability at most ${\frac12}$ if ${G}$ does have a perfect matching (yes instance). This is unlike the algorithms we saw in the previous lecture, which erred on no instances.

Definition 4 The Tutte matrix of bipartite graph ${G=(U,V,E)}$ is an ${n\times n}$ matrix ${M}$ with the entry at row ${i}$ and column ${j}$,

$\displaystyle M_{i,j}=\left\{ \begin{array}{ll} 0 & if (u_i,v_j)\notin E\\ x_{i,j} & if (u_i,v_j)\in E\\ \end{array} \right.$

(Apparently, Tutte came up a matrix for general graphs, and this one for bipartite graphs is due to Jack Edmonds, but we’ll stick with calling it the Tutte matrix.)

The determinant of the Tutte matrix is useful in testing whether a graph has a perfect matching or not, as the following lemma shows. Note that we do not think of this determinant as taking on some numeric value, but purely as a function of the variables ${x_{i,j}}$.

Lemma 5 ${\mathrm{det}(M) \neq 0 \iff}$ There exists a perfect matching in ${G}$

Proof: We have the following expression for the determinant :

$\displaystyle \mathrm{det}(M)=\sum_{\pi\in S_n} (-1)^{sgn(\pi)} \prod_{i=1}^{n} M_{i,\pi(i)}$

where ${S_n}$ is the set of all permutations on ${[n]}$, and ${sgn(\pi)}$ is the sign of the permutation ${\pi}$. There is a one to one correspondence between a permutation ${\pi\in S_n}$ and a (possible) perfect matching ${\{(u_1,v_{\pi(1)}),(u_2,v_{\pi(2)}),\cdots ,(u_n,v_{\pi(n)})\}}$ in ${G}$. Note that if this perfect matching does not exist in ${G}$ (i.e. some edge ${(u_i,v_{\pi(i)})\notin E}$) then the term corresponding to ${\pi}$ in the summation is 0. So we have

$\displaystyle \mathrm{det}(M) = \sum_{\pi\in P} (-1)^{sgn(\pi)} \prod_{i=1}^n x_{i,\pi(i)}$

where ${P}$ is the set of perfect matchings in ${G}$. This is clearly zero if ${P=\emptyset}$, i.e., if ${G}$ has no perfect matching. If ${G}$ has a perfect matching, there is a ${\pi\in P}$ and the term corresponding to ${\pi}$ is ${\prod_{i=1}^n x_{i,\pi(i)} \ne 0}$. Additionally, there is no other term in the summation that contains the same set of variables. Therefore, this term is not cancelled by any other term. So in this case, ${\mathrm{det}(M)\ne 0}$. $\Box$

This lemma gives us an easy way to test a bipartite graph for a perfect matching — we use the polynomial identity testing algorithm of the previous lecture on the Tutte matrix of ${G}$. We accept if the determinant is not identically 0, and reject otherwise. Note that ${\mathrm{det}(M)}$ has degree at most ${n}$. So we can test its identity on the field ${Z_p}$, where ${p}$ is a prime number larger than ${2n}$. From the analysis of the polynomial testing algorithm, we have the following :

• ${G}$ has no perfect matching ${\implies \Pr[accept]=0}$.
• ${G}$ has a perfect matching ${\implies \Pr[accept]\ge \frac12}$.

The above algorithm shows that Perfect Matching for bipartite graphs is in RP. (The non-bipartite case may appear as a homework exercise.) Also, this algorithm for checking the existence of a perfect matching can be easily converted to one that actually computes a perfect matching as follows:

1. Pick ${(u_i,v_j)\in E}$.
2. Check if ${G\backslash \{u_i,v_j\}}$ has a perfect matching.
3. If “Yes”, output ${(u_i,v_j)}$ to be in the matching and recurse on ${G\backslash \{u_i,v_j\}}$, the graph obtained after the removal of vertices ${u_i}$ and ${v_j}$.

4. If “No”, recurse on ${G-(u_i,v_j)}$, the graph obtained after removing the edge ${(u_i,v_j)}$.

Note that this algorithm seems inherently sequential; it’s not clear how to speed up its running time considerably by using multiple processors. We’ll consider the parallel algorithm in the next section.

Some citations: the idea of using polynomial identity testing to test for matchings is due to Lovász. The above algorithm to find the matching runs in time ${mn^\omega}$, where ${n^\omega}$ is the time to multiply two ${n\times n}$-matrices. (It is also the time to compute determinants, and matrix inverses.) Rabin and Vazirani showed how to compute perfect matchings in general graphs in time ${O(n^{\omega + 1})}$, where ${n}$. Recent work of Mucha and Sankowski, and Harvey show how to use these ideas (along with many other cool ones) to find perfect matchings in general graphs in time ${n^\omega}$.

4. A parallel algorithm for finding perfect matchings

However, we can give a slightly different algorithm (for a seemingly harder problem), that indeed runs efficiently on parallel processors. The model here is that there are polynomially many processors run in parallel, and we want to solve the problem in poly-logarithmic depth using polynomial work. We will use the fact that there exist efficient parallel algorithms for computing the determinant of a matrix to obtain our parallel algorithm for finding perfect matchings.

We could try the following “strawman” parallel algorithm:

Use a processor for every edge ${(u_i,v_j)}$ that tests (in parallel) if edge ${(u_i,v_j)}$ is in some perfect matching or not. For each edge ${(u_i,v_j)}$ that lies in some perfect matching, the processor outputs the edge, else it outputs nothing.

We are immediately faced with the problem that there may be several perfect matchings in the graph, and the resulting output is not a matching. The algorithm may in fact return all the edges in the graph. It will only work if there is a unique perfect matching.

So instead of testing whether an edge ${(u_i,v_j)}$ is in some perfect matching or not, we want to test whether an edge ${(u_i,v_j)}$ is in a specific perfect matching or not. The way we do this is to put random weights on the edges of the graph and test for the minimum weight perfect matching. Surprisingly, one can prove that the minimum weight perfect matching is unique with a good probability, even when the weights are chosen from a set of integers from a relatively small range.

Lemma 6 Let ${S=\{e_1,\cdots ,e_m\}}$ and ${S_1,\cdots S_k\subseteq S}$. For every element ${e_i}$ there is a weight ${w_i}$ picked u.a.r. from ${\{0,1,\cdots ,2m-1\}}$. The weight of subset ${S_j}$ is ${w(S_j)=\sum_{e_i\in S_j} w_i}$. Then

$\displaystyle \Pr[ \text{ minimum weight set among } S_1,\cdots ,S_k \text{ is unique } ] \ge\frac12.$

Proof: We will estimate the probability that the minimum weight set is not unique. Let us define an element ${e_i}$ to be tied if

$\displaystyle \min_{S_j | e_i\in S_j} w(S_j) = \min_{S_j | e_i\notin S_j} w(S_j)$

It is easy to see that there exists a tied element if and only if the minimum weight subset is not unique. Below we bound the probability that a fixed element ${e_i}$ is tied. The result will then follow using a union bound.\par We use the principle of deferred decisions. Let us fix the weights ${w_1,\cdots,w_m}$ of all the elements except ${w_i}$. We want to bound ${\Pr_{w_i}[ e_i \text{ is tied } \mid w_1,\cdots,w_{i-1},w_{i+1},w_m]}$. Let

$\displaystyle W^- = \min_{S_j | e_i\notin S_j} w(S_j) \qquad \text{and} \qquad W^+ = \min_{S_j | e_i\in S_j} w(S_j)$

with ${w_i}$ assigned the value 0. It is easy to see that ${e_i}$ is tied iff ${W^-=W^+ + w_i}$. So,

$\displaystyle \begin{array}{rcl} & & Pr_{w_i}[ e_i \text{ is tied } \mid w_1,\cdots,w_{i-1},w_{i+1},w_m] \\ &=& Pr_{w_i}[w_i=W^- - W^+ \mid w_1,\cdots,w_{i-1},w_{i+1},w_m] \\ &\le& \frac{1}{2m}. \end{array}$

The last inequality is because there is at most on value for ${w_i}$ for which ${W^-=W^+ + w_i}$. This holds irrespective of the particular values of the other ${w_{i'}}$s. So ${\Pr[e_i \text{ is tied } ] \le \frac{1}{2m}}$, and

$\displaystyle \Pr[ \exists \text{ a tied element }] \le\sum_{i=1}^m \Pr[e_i \text{ is tied} ] \le \frac12.$

Thus ${\Pr[ \text{ minimum weight set is unique } ] \ge\frac12}$. $\Box$

Now we can look at the parallel algorithm for finding a perfect matching. For each edge ${(u_i,v_j)}$, we pick a random weight ${w_{i,j}}$, from ${[2m-1]}$, where ${m=|E|}$ is the number of edges in ${G}$. Let the sets ${S_j}$ denote all the perfect matchings in ${G}$. Then the Isolation Lemma implies that there is a unique minimum weight perfect matching with at least a half probability. We assign the value ${x_{i,j}=2^{w_{i,j}}}$ to the variables in the Tutte matrix ${M}$. Let ${D}$ denote the resulting matrix. We use the determinant of ${D}$ to determine the weight of the min-weight perfect matching, if it is unique, as suggested by the following lemma.

Lemma 7 Let ${W_0}$ be the weight of the minimum weight perfect matching in ${G}$. Then,

• ${G}$ has no perfect matching ${\implies}$ ${\mathrm{det}(D)=0}$.
• ${G}$ has a unique min-weight perfect matching ${\implies}$ ${\mathrm{det}(D)\ne 0}$ and the largest power of 2 dividing ${\mathrm{det}(D)}$ is ${W_0}$.

• ${G}$ has more than one min-weight perfect matching ${\implies}$ either ${\mathrm{det}(D)=0}$ or the largest power of 2 dividing ${\mathrm{det}(D)}$ is at least ${W_0}$.

Proof: If ${G}$ has no perfect matching, it is clear from lemma 5 that ${\mathrm{det}(D)=0}$. \par Now consider that case when ${G}$ has a unique min-weight perfect matching. From the expression of the determinant, we have

$\displaystyle \mathrm{det}(D)=\sum_{\pi\in P} (-1)^{sgn(\pi)} \prod_{i=1}^n 2^{w_{i,\pi(i)}}=\sum_{\pi\in P} (-1)^{sgn(\pi)}2^{\sum_{i=1}^n w_{i,\pi(i)}}=\sum_{\pi\in P} (-1)^{sgn(\pi)} 2^{w(\pi)}$

where ${w(\pi)}$ is the weight of the perfect matching corresponding to ${\pi}$ and ${P}$ is the set of all perfect matchings in ${G}$. Since there is exactly one perfect matching of weight ${W_0}$ and other perfect matchings have weight at least ${W_0+1}$, this evaluates to an expression of the form ${\pm 2^{W_0} \pm 2^{W_0+1} \cdots \pm \textrm{other powers of 2 larger than } W_0}$. Clearly, this is non-zero, and the largest power of 2 dividing this is ${W_0}$.\par Now consider the case when ${G}$ has more than one min-weight perfect matchings. In this case, if the determinant is non-zero, every term in the sumation is a power of 2, at least ${2^{W_0}}$. So ${2^{W_0}}$ divides ${\mathrm{det}(D)}$. $\Box$

We refer to the submatrix of ${D}$ obtained by removing the ${i}$-th row and ${j}$-th column by ${D_{i,j}}$. Note that this is a matrix corresponding to the bipartite graph ${G\backslash \{u_i,v_j\}}$. The parallel algorithm would run as follows.

1. Pick random weights ${w_{i,j}}$ for the edges of ${G}$. (In the following steps, we assume that we’ve isolated the min-weight perfect matching.)
2. Compute the weight ${W_0}$ of the min-weight perfect matching from ${\mathrm{det}(D)}$ (using the parallel algorithm for computing the determinant): this is just the highest power of ${2}$ that divides ${\mathrm{det}(D)}$.
3. If ${\mathrm{det}(D)=0}$, output “no perfect matching”.
4. For each edge ${(u_i,v_j)\in E}$ do, in parallel,:
1. Evaluate ${\mathrm{det}(D_{i,j})}$.
2. If ${\mathrm{det}(D_{i,j})=0}$, output nothing.
3. Else, find the largest power of 2, ${W_{i,j}}$, dividing ${\mathrm{det}(D_{i,j})}$.
4. If ${W_{i,j} + w_{i,j} = W_0}$, output ${(u_i,v_j)}$.
5. Else, output nothing.

It is clear that, if ${G}$ has no perfect matching, this algorithm returns the correct answer. Now suppose ${G}$ has a unique minimum weight perfect matching, we claim Lemma 7 ensures that precisely all the edges in the unique min-weight perfect matching are output. To see this, consider an edge ${(u_i,v_j)}$ not in the unique min weight perfect matching. From the lemma, ${\mathrm{det}(D_{i,j})}$ is either zero (so the edge will not be output), or ${W_{i,j}}$ is at least as large as the min-weight perfect matching in ${G\backslash \{u_i,v_j\}}$. Since the min-weight perfect matching is unique and does not contain edge ${(u_i,v_j)}$, this implies ${w_{i,j} + W_{i,j}}$ will be strictly larger than ${W_0}$, and this edge will not be output in this case either. Finally, if an edge ${(u_i,v_j)}$ is in the unique min-weight perfect matching, removing this edge from the matching gives us the unique min-weight perfect matching in ${G\backslash \{u_i,v_j\}}$. So, in this case ${W_{i,j} = W_0 - w_{i,j}}$ and the edge is output.

Thus, if ${G}$ has a perfect matching, this algorithm will isolate one with probability at least ${\frac12}$, and will output it—hence we get an RNC algorithm that succeeds with probability at least ${1/2}$ on “Yes” instances, and never makes mistakes on “No” instances.

Finally, some more citations. This algorithm is due to Mulmuley, Vazirani and Vazirani; the first RNC algorithm for matchings had been given earlier by Karp, Upfal, and Wigderson. It is an open question whether we can find perfect matchings deterministically in parallel using poly-logarithmic depth and polynomial work, even for bipartite graphs.

## Lecture #8: Oh, and one more thing

I forgot to mention something about the two choices paradigm: recall from HW #2 that if you throw ${m}$ balls into ${n}$ bins randomly and ${m \gg n}$, the maximum load is about ${\frac{m}{n} + O(\sqrt{\frac{m}{n} \log n})}$. In fact, you can show that this variance term is about right—with high probability, the highest loaded bin will indeed be about ${O(\sqrt{\frac{m}{n} \log n})}$ above the average.

On the other hand, if you throw ${m}$ balls into ${n}$ bins using two-choices, then you can show that the highest load is about ${\frac{m}{n} + \log \log n + O(1)}$ with high probability. So not only do we gain in the low-load case (when ${m \approx n}$), we get more control over the variance in the high load case (when ${m \gg n}$): the additive gap between the average and the maximum loads is now independent of the number of balls! The proofs to show this require new ideas: check out the paper Balanced Allocations: the Heavily Loaded Case by Berenbrink, Czumaj, Steger and Vöcking for more details.

Here is a recent paper of Peres, Talwar and Weider that gives an analysis of the ${(1 + \epsilon)}$-choice process (where you invoke the two choices paradigm only on ${\epsilon}$ fraction of the balls). It also refers to more recent work in the area (weighted balls, weighted bins, etc), in case you’re interested.

## Lecture #8: Balls and Bins

1. Balls and Bins

The setting is simple: ${n}$ balls, ${n}$ bins. When you consider a ball, you pick a bin independently and uniformly at random, and add the ball to that bin. In HW #2 you proved:

Theorem 1 The max-loaded bin has ${O(\frac{\log n}{\log \log n})}$ balls with probability at least ${1 - 1/n}$.

One could use a Chernoff bound to prove this, but here is a more direct calculation of this theorem: the chance that bin ${i}$ has at least ${k}$ balls is at most

$\displaystyle \binom{n}{k} \left( \frac1n \right)^k \leq \frac{n^k}{k!} \cdot \frac1{n^k} \leq \frac{1}{k!} \leq 1/k^{k/2}$

which is (say) ${\leq 1/n^2}$ for ${k^* = \frac{8 \log n}{\log \log n}}$. To see this, note that

$\displaystyle k^{k/2} \geq (\sqrt{\log n})^{4 \log n/\log\log n} \geq 2^{2 \log n} = n^2.$

So union bounding over all the bins, the chance of some bin having more than ${k^*}$ balls is ${1/n}$. (I’ve been sloppy with constants, you can do better constants by using Stirling’s approximation.)

Here is a semantically identical way of looking at this calculation: let ${X_i}$ be the indicator r.v. for bin ${i}$ having ${k^*}$ or more balls. Then ${E[X_i] \leq 1/n^2}$. And hence if ${X = \sum_i X_i}$, then ${E[X] \leq 1/n}$. So by Markov, ${\Pr[ X > 1 ] \leq E[X] \leq 1/n}$. In other words, we again have

$\displaystyle \Pr[ \text{ max load is more than } \frac{8\log n}{\log \log n} ] \rightarrow 0.$

This idea of bounding the expectation of some variable ${X}$, and using that to upper bound some quantity (in this case the max-load) is said to use the first moment method.

1.1. Tightness of the Bound

In fact, ${\Theta(\frac{\log n}{\log\log n})}$ is indeed the right answer for the max-load with ${n}$ balls and ${n}$ bins.

Theorem 2 The max-loaded bin has ${\Omega(\frac{\log n}{\log \log n})}$ balls with probability at least ${1 - 1/n^{1/3}}$.

Here is one way to show this, via the second moment method. To begin, let us now lower bound the probability that bin ${i}$ has at least ${k}$ balls:

$\displaystyle \binom{n}{k} \left( \frac1n \right)^k \left(1 - \frac1n \right)^{n-k} \geq \left(\frac{n}{k}\right)^k \cdot \frac1{n^k} \cdot e^{-1} \geq 1/ek^k,$

which for ${k^{**} = \frac{\log n}{3\log \log n}}$ is at least ${1/en^{1/3}}$, since ${k^k \leq (\log n)^{\log n/3\log\log n} = n^{-1/3}}$. And so we expect ${\Omega(n^{2/3})}$ bins to have at least ${k^{**}}$ balls.

Let us define some random variables: if ${X_i}$ is the indicator for bin ${i}$ having at least ${k^{**}}$ balls, and ${X}$ is the expected number of bins with at least ${k^{**}}$ balls, we get that

$\displaystyle E[X_i] \geq 1/en^{1/3} \quad \text{ and } \quad E[X] = \Omega(n^{2/3}).$

Alas, in general, just knowing that ${E[X] \rightarrow \infty}$ will not imply ${\Pr[ X \geq 1 ] \rightarrow 1}$. Indeed, consider a random variable that is ${0}$ w.p. ${1-1/n^{1/3}}$, and ${n}$ otherwise—while its expectation is ${n^{2/3}}$, ${X}$ is more and more likely to be zero as ${n}$ increases. So we need some more information about ${X}$ to prove our claim. And that comes from the second moment.

Let’s appeal to Chebyshev’s inequality:

$\displaystyle \Pr[ X = 0 ] \leq \Pr[ |X -\mu| \geq \mu ] \leq \frac{\mathrm{Var}(X)}{\mu^2} = \frac{\sum_i \mathrm{Var}(X_i) + \sum_{i \neq j} \mathrm{Cov}(X_i,X_j)}{E[X]^2}.$

You have probably seen covariance before: ${\mathrm{Cov}(Y, Z) := E[(Y - E[Y])(Z - E[Z])]}$. But since the bins are negatively correlated (some bin having more balls makes it less likely for another bin to do so), the covariance ${\mathrm{Cov}(X_i, X_j) \leq 0}$. Moreover, since ${X_i \in \{0,1\}}$, ${\mathrm{Var}(X_i) \leq E[X_i] \leq 1}$; by the above calculations, ${E[X]^2 \geq n^{4/3}}$. So summarizing, we get

$\displaystyle \Pr[ X = 0 ] \leq \frac{\sum_i \mathrm{Var}(X_i) + \sum_{i \neq j} \mathrm{Cov}(X_i,X_j)}{E[X]^2} \leq \frac{n}{E[X]^2} \leq n^{-1/3}.$

In other words, there is a ${1 - 1/n^{1/3}}$ chance that some bin contains more than ${k^{**}}$ balls:

$\displaystyle \Pr[ \text{ max load is less than } \frac{\log n}{3\log \log n} ] \rightarrow 0.$

(Later, you will see how to use martingale arguments and Azuma-Hoeffding bounds to give guarantees on the max-load of bins. You can also use the “Poisson approximation” to show such a result, that’s yet another cool technique.)

1.2. So, in Summary

If you want to show that some non-negative random variable is zero with high probability, show that it’s expectation is tends to zero, and use Markov—the first moment method. If you want to show that it is non-zero with high probability, show that the variance divided by the squared mean tends to zero, and use Chebyshev—the second moment method.

1.3. Taking it to the Threshold

Such calculations often arise when you have a random process, and a random variable ${X}$ defined in terms of a parameter ${k}$. Often you want to show that ${X}$ is zero whp when ${k}$ lies much below some “threshold” ${\tau}$, and that ${X}$ is non-zero whp when ${k}$ is far above ${\tau}$. The first things you should try are to see if the first and second moment methods give you rough answers. E.g., take ${n}$ vertices and add each of the ${\binom{n}{2}}$ edges independently with probability ${1/2}$ (also called the Erdös-Rényi graph ${G(n,1/2)}$), and define ${X}$ to be the expected number of cliques on ${k}$ vertices. Show that ${\tau = 2 \log n}$ is such a threshold for ${X}$.

2. The Power of Two Choices

The setting now is: ${n}$ balls, ${n}$ bins. However, when you consider a ball, you pick two bins (or in general, ${d}$ bins) independently and uniformly at random, and put the ball in the less loaded of the two bins. The main theorem is:

Theorem 3 The two-choices process gives a maximum load of ${\frac{\ln \ln n}{\ln 2} + O(1)}$ with probability at least ${1 - O(\frac{\log^2 n}{n})}$.

The intuition behind the proof is the following picture:

The actual proof is not far from this intuition. The following lemma says that if at most ${\alpha}$ fraction of the bins have at least ${i}$ balls, then the fraction of bins having ${i+1}$ balls can indeed be upper bounded by ${Bin(n, \alpha^2)}$, where ${Bin(n,p)}$ is the Binomial random variable.

Lemma 4 If ${N_i}$ is the number of bins with load at least ${i}$, then ${\Pr[ N_{i+1} > t \mid N_i \leq \alpha n ] \leq \frac{\Pr[ Bin(n,\alpha^2) > t ]}{\Pr[ N_i \leq \alpha n ]}}$.

Proof: For the proof, let us consider the “heights” of balls: this is the position of the ball when it comes in, if it is the first ball in its bin then its height is ${1}$, etc. Observe that if there are ${t}$ bins with load ${i+1}$, then there must be at least ${t}$ balls with height ${i+1}$. I.e., if ${B_j}$ is the number of balls with height at least ${j}$, then ${N_j \leq B_j}$, and so we’ll now upper bound ${\Pr[ B_{i+1} > t \mid N_i \leq \alpha n ] = \frac{\Pr[ B_{i+1} > t \cap N_i \leq \alpha n ]}{\Pr[ N_i \leq \alpha n]}}$.

Consider the following experiment: just before a ball comes in, an adversary is allowed to “mark” at most ${\alpha n}$ bins. Call a ball marked if both its random bins are marked. Note that when we condition on ${N_i \leq \alpha n}$, we know that the final number of bins with load at least ${i}$ is at most ${\alpha n}$. In this case, we can imagine the adversary marking the bins with load at least ${i}$ (and maybe some more). Now the chance that a ball is marked is at least the chance that it has height ${i+1}$ and there are at most ${\alpha n}$ bins with height at least ${i}$. Hence, if ${M}$ is the number of marked balls, we get

$\displaystyle \frac{\Pr[ B_{i+1} > t \cap N_i \leq \alpha n ] }{ \Pr[ N_i \leq \alpha n]} \leq^{(*)} \frac{\Pr[ M > t ]}{\Pr[ N_i \leq \alpha n ]} = \frac{\Pr[ Bin(n,\alpha^2) > t ]}{\Pr[ N_i \leq \alpha n ]}.$

The second equality follows from the fact that ${M \sim Bin(n, \alpha^2)}$. $\Box$

If you’d like to be more precise about proving (*) above, see the details in the notes from the Mitzenmacher-Upfal. (CMU/Pitt access only.)

Now we can use Chernoff to prove tail bounds on the Binomial distribution.

Lemma 5 If ${\alpha^2 \geq 6 \frac{\ln n}{n}}$, then

$\displaystyle \Pr[ Bin(n, \alpha^2) > 2n\alpha^2 ] \leq 1/n^2.$

Moreover, if ${\alpha^2 < 6 \frac{\ln n}{n}}$, then

$\displaystyle \Pr[ Bin(n, \alpha^2) > 12 \ln n] \leq 1/n^2.$

Proof: We’re interested in ${X = \sum_{i = 1}^n X_i}$ where each ${X_i = 1}$ w.p. ${p = \alpha^2}$, and ${0}$ otherwise. The expectation ${\mu = np \geq 6 \ln n}$. And the chance that this number exceeds ${(1+1)\mu}$ is at most

$\displaystyle \exp( -\frac{\mu^2}{2\mu + \mu} ) \leq \exp( -\mu/3 ) \leq 1/n^2,$

which proves the first part. For the second part, ${\mu < 6 \ln n}$, and the probability that ${X}$ exceeds ${12 \ln n \geq \mu + 6 \ln n}$ is at most

$\displaystyle \exp( -\frac{(6 \ln n)^2}{2\mu + 6 \ln n} ) \leq \exp( -2\ln n ) \leq 1/n^2,$

as claimed. $\Box$

So, now let us define ${\alpha_i}$ to be the fraction of bins we’re aiming to show have load at least ${i}$. Define ${\alpha_4 = 1/4}$, and ${\alpha_{i+1} = 2\alpha_i^2}$. (The reason it is ${2\alpha_i^2}$ instead of ${\alpha_i^2}$, which is the expectation, is for some breathing room to apply Chernoff: in particular, the factor ${2}$ comes from the first part of Lemma 5.)

For each ${i \geq 4}$, let ${\mathcal{E}_i}$ be the good event “${N_i \leq n\alpha_i}$”; recall that ${N_i}$ is the number of bins with load at least ${i}$. We want to lower bound the probability that this good event does not happen.

Lemma 6 If ${\alpha_i^2 \geq 6 \frac{\ln n}{n}}$, then

$\displaystyle \Pr[ \lnot \mathcal{E}_{i+1} ] \leq i/n^2.$

Proof: We prove this by induction. The base case is when ${i = 4}$, when at most ${n/4}$ bins can have load at least ${4}$ (by Markov). So ${\Pr[ \lnot \mathcal{E}_4 ] = 0 < 4/n^2}$.

For the induction,

$\displaystyle \Pr[ \lnot \mathcal{E}_{i+1} ] \leq \Pr[ \lnot \mathcal{E}_{i+1} \mid \mathcal{E}_i ]\Pr[ \mathcal{E}_i ] + \Pr[ \lnot \mathcal{E}_i ].$

By Lemma 4 the former term is at most ${\frac{\Pr[ B(n, \alpha_i^2) \geq \alpha_{i+1}]}{\Pr[ \mathcal{E}_i ]} \cdot \Pr[ \mathcal{E}_i ]}$, which by Lemma 5 is at most ${1/n^2}$. And by induction, ${\Pr[\lnot \mathcal{E}_i] \leq i/n^2}$, which means ${\Pr[ \lnot \mathcal{E}_{i+1}] \leq (i+1)/n^2}$. $\Box$

Consider ${i^* = \min\{i \mid \alpha_i^2 < 6 \frac{\ln n}{n}\}}$. By the Lemma 6, ${\Pr[ \lnot \mathcal{E}_{i^*} ] \leq i^*/n^2 \leq 1/n}$. Hence, with probability ${1 - 1/n}$, we have the number of bins with more than ${i^*}$ balls in them is at most ${n\alpha_{i^*}}$.

We’re almost done, but there’s one more step to do. If this number ${n \alpha_{i^*}}$ were small, say ${O(\log n)}$, then we could have done a union bound, but this number may still be about ${\Omega(\sqrt{n \log n})}$. So apply Lemma 4 and the second part of Lemma 5 once more to get:

$\displaystyle \begin{array}{rcl} \Pr[ N_{i^* +1 } > 12 \ln n] &\leq& \Pr[ N_{i^* +1 } > 12 \ln n \mid \mathcal{E}_{i^*} ] \Pr[ \mathcal{E}_{i^*}] + \Pr[ \lnot \mathcal{E}_{i^*} ] \\ &\leq& \Pr[ Bin(n, \alpha_{i^*}^2) > 12 \ln n \mid \mathcal{E}_{i^*} ] \Pr[ \mathcal{E}_{i^*}] + \Pr[ \lnot \mathcal{E}_{i^*} ] \\ &\leq& 1/n^2 + \Pr[ \lnot \mathcal{E}_{i^*} ] \leq \frac{n+1}{n^2} \end{array}$

Finally, since ${N_{i^* + 1}}$ is so small whp, use Lemma 4 and a union bound to say that

$\displaystyle \begin{array}{rcl} \Pr[ N_{i^* + 2} > 1 ] &\leq& \Pr[ B(n, \frac{(12 \ln n)^2}{n}) > 1 ] + \Pr[ N_{i^* + 1} > 12 \ln n ] \\ &\leq& E[ B(n, \frac{(12 \ln n)^2}{n}) ] + \frac{n+1}{n^2} \\ &\leq& O(\frac{\log^2 n}{n}). \end{array}$

Finally, the calculations in Section 2 show that ${i^* = \frac{\ln \ln n}{\ln 2} + O(1)}$, which completes the proof.

2.1. A Calculation

Since ${\log_2 \alpha_4 = -2}$, and ${\log_2 \alpha_{i+1} = 1 + 2 \log_2 \alpha_i}$, we calculate

$\displaystyle \log_2 \alpha_{i} = - 2^{i-4} - 1.$

So, for ${\log_2 \alpha_i \approx - \frac12 \log_2 n}$, it suffices to set

$\displaystyle i = \log_2 \log_2 n + 3 = \frac{\ln \ln n}{\ln 2} + O(1).$

3. A Random Graphs Proof

Another way to show that the maximum load is ${O(\log\log n)}$—note that the constant is worse—is to use an first-priciples analysis based on properties of random graphs. We build a random graph ${G}$ as follows: the ${n}$ vertices of ${G}$ correspond to the ${n}$ bins, and the edges correspond to balls—each time we probe two bins we connect them with an edge in ${G}$. For technical reasons, we’ll just consider what happens if we throw fewer balls (only ${n/512}$ balls) into ${n}$ bins—also, let’s imagine that each ball chooses two distinct bins each time.

Theorem 7 If we throw ${\frac{n}{512}}$ balls into ${n}$ bins using the best-of-two-bins method, the maximum load of any bin is ${O(\log\log n)}$ whp.

Hence for ${n}$ balls and ${n}$ bins, the maximum load should be at most ${512}$ times as much, whp. (It’s as though after every ${n/512}$ balls, we forget about the current loads and zero out our counters—not zeroing out these counters can only give us a more evenly balanced allocation; I’ll try to put in a formal proof later.)

To prove the theorem, we need two results about the random graph ${G}$ obtained by throwing in ${n/512}$ random edges into ${n}$ vertices. Both the proofs are simple but surprisingly effective counting arguments, they appear at the end.

Lemma 8 The size of ${G}$‘s largest connected component is ${O(\log n)}$ whp.

Lemma 9 There exists a suitably large constant ${K > 0}$ such that for all subsets ${S}$ of the vertex set with ${|S| \ge K}$, the induced graph ${G[S]}$ contains at most ${5|S|/2}$ edges, and hence has average degree at most ${5}$, whp.

Given the graph ${G}$, suppose we repeatedly perform the following operation in rounds:

In each round, remove all vertices of degree ${\leq 10}$ in the current graph.

We stop when there are no more vertices of small degree.

Lemma 10 This process ends after ${O(\log\log n)}$ rounds whp, and the number of remaining vertices in each remaining component is at most ${K}$.

Proof: Condition on the events in the two previous lemmas. Any component ${C}$ of size at least ${K}$ in the current graph has average degree at most ${5}$; by Markov at least half the vertices have degree at most ${10}$ and will be removed. So as long as we have at least ${K}$ nodes in a component, we halve its size. But the size of each component was ${O(\log n)}$ to begin, so this takes ${O(\log \log n)}$ rounds before each component has size at most ${K}$. $\Box$

Lemma 11 If a node/bin survives ${i}$ rounds before it is deleted, its load due to edges that have already been deleted is at most ${10i}$. If a node/bin is never deleted, its load is at most ${10i^* + K}$, where ${i^*}$ is the total number of rounds.

Proof: Consider the nodes removed in round ${1}$: their degree was at most ${10}$, so even if all those balls went to such nodes, their final load would be at most ${10}$. Now, consider any node ${x}$ that survived this round. While many edges incident to it might have been removed in this round, we claim that at most ${10}$ of those would have contributed to ${x}$‘s load. Indeed, the each of the other endpoints of those edges went to bins with final load at most ${10}$. So at most ${10}$ of them would choose ${x}$ as their less loaded bin before it is better for them to go elsewhere.

Now, suppose ${y}$ is deleted in round ${2}$: then again its load can be at most ${20}$: ten because it survived the previous round, and 10 from its own degree in this round. OTOH, if ${y}$ survives, then consider all the edges incident to ${y}$ that were deleted in previous rounds. Each of them went to nodes that were deleted in rounds ${1}$ or ${2}$, and hence had maximum load at most ${20}$. Thus at most ${20}$ of these edges could contribute to ${y}$‘s load before it was better for them to go to the other endpoint. The same inductive argument holds for any round ${i \leq i^*}$.

Finally, the process ends when each component has size at most ${K}$, so the degree of any node is at most ${K}$. Even if all these edges contribute to the load of a bin, it is only ${10i^* + K}$. $\Box$

By Lemma 10, the number of rounds is ${i^* = O(\log \log n)}$ whp, so by Lemma 11 the maximum load is also ${O(\log \log n)}$ whp.

3.1. Missing Proofs of Lemmas

Lemma 12 The size of ${G}$‘s largest connected component is ${O(\log n)}$ whp.

Proof: We have a graph with ${n}$ vertices and ${m=\frac{n}{512}}$ edges where we connect vertices at random.

$\displaystyle \begin{array}{rcl} \Pr[ k+1\text{ vertices connected } ] &\le& \Pr[ \text{ at least } k \text{ edges fall within } k+1 \text{ nodes }] \\ &\le& \binom{m}{k}\left(\frac{\binom{k+1}{2}}{\binom{n}{2}}\right)^k = \binom{m}{k}\left(\frac{k(k+1)}{n(n-1)}\right)^k \\ &\leq& \binom{m}{k}\left(\frac{4k}{n}\right)^{2k}. \end{array}$

Since ${k(k+1) \leq 2k^2}$ and ${n(n-1) \geq n^2/2}$. Now the probability that any such set exists can bounded above by the union bound

$\displaystyle \begin{array}{rcl} \Pr[ \exists \text{ a connected set of size }k+1 ] &\le& \binom{n}{k+1}\binom{m}{k}\left(\frac{4k}{n}\right)^{2k}\\ &\le& n\left(\frac{ne}{k}\right)^k\left(\frac{ne}{512k}\right)^k\left(\frac{4k}{n}\right)^{2k}\\ &\le& n\left(\frac{e^2}{16}\right)^k \le \frac{1}{2n} \quad\text{ if }\; k=\Theta(\log n ) \end{array}$

which proves the claim. $\Box$

Lemma 13 There exists a suitably large constant ${K > 0}$ such that for all subsets ${S}$ of the vertex set with ${|S| \ge K}$, the induced graph ${G[S]}$ contains at most ${5|S|/2}$ edges, and hence has average degree at most ${5}$, whp.

Proof:

$\displaystyle \Pr[ \text{ a fixed set of }k\text{ nodes gets } > \frac{5k}{2} \text{ edges }] \leq \binom{m}{5k/2}\left(\frac{4k}{n}\right)^{2\cdot 5k/2} = \binom{m}{5k/2}\left(\frac{4k}{n}\right)^{5k}.$

By a union bound over all sets, the probability that such a set exists is

$\displaystyle \begin{array}{rcl} \Pr[ \exists \text{ a bad set } ] &\le& \sum_{k \geq K} \binom{n}{k}\binom{m}{5k/2}\left(\frac{4k}{n}\right)^{5k}\\ &\le& \sum_{k \geq K} \left(\frac{ne}{k}\right)^k\left(\frac{ne}{512(5k/2)}\right)^{5k/2}\left(\frac{k}{n}\right)^{5k} = \sum_{k \geq K} \left(\frac{k}{n}\right)^{3k/2}\alpha^k, \end{array}$

where ${\alpha = \frac{e^{7/2}}{80^{5/2}} < 1/2}$. Now, we can break this sum into two: for small values of ${k}$, the ${(k/n)^k}$ term would be very small, else the ${\alpha^k}$ term would be small. Indeed, for ${k \geq 2\log_2 n}$, we know that ${\alpha^k \leq 1/n^{2}}$, so

$\displaystyle \sum_{k = 2 \log n}^n \left(\frac{k}{n}\right)^{3k/2} \alpha^k \leq \sum_{k = 2 \log n}^n \alpha^k \leq 1/n.$

Now for the rest:

$\displaystyle \sum_{k = K}^{2 \log n} \left(\frac{k}{n}\right)^{3k/2} \alpha^k \leq \sum_{k = K}^{2 \log n} \left(\frac{k}{n}\right)^{3k/2} \leq 2 \log n \cdot \left( \frac{ 2 \log n }{n} \right)^{3K/2} \leq 1/n^4,$

for ${K = 3}$, say. $\Box$

Bibliographic Notes: The layered induction appears in Balanced Allocations Azar, Broder, Karlin, and Upfal. The random graph analysis is in the paper Efficient PRAM Simulation on a Distributed Memory Machine by Karp, Luby, and Meyer auf der Heide; I learned it from Satish Rao. The Always-go-left algorithm and analysis is due to How Asymmetry Helps Load Balancing by Berthold Vöcking.

Update: Here’s a survey on the various proof techniques by Mitzenmacher, Sitaraman and Richa.