# CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

## Monthly Archives: February 2011

## HW #4 Open Thread

Please ask your HW#4 questions here. Here’s one:

*For Exercise 2, Part A, the algorithm says to “Choose 2m+1 elements S uniformly at random…”. Does this mean with replacement, or without replacement? In other words, are we choosing 2m+1 elements one at a time, each with equal probability of picking any of the elements from A, such that S might be a multiset? Or, are we choosing a set of 2m+1 elements from A, where each set has probability 1 / binom(n, 2m+1) of being picked?*

It does not matter; you can do it either way. Though the analysis for the process where you choose independently with replacement is slightly simpler.

**Update: **the phrasing for problem #2(b) was ambiguous, and has been fixed. (Thanks Kevin!)

**Update #2: **the indices in the algorithm in problem #2 were off by 1 and have been fixed. (Thanks Favonia!)

## Lecture #14: Game Theory

Today we discussed some key concepts in game theory and connections between (some of) them and results in online learning.

We began by discussing 2-player zero-sum games. The Minimax Theorem states that these games have a well-defined value V such that (a) there exists a mixed strategy p for the row-player that guarantees the row player makes at *least* V (in expectation) no matter what column the column-player chooses, and (b) there exists a mixed strategy q for the column-player that guarantees the row player makes *at most* V (in expectation) no matter what row the row-player chooses. We then saw how we could use the results we proved about the Randomized Weighted Majority algorithm to prove this theorem.

We next discussed general-sum games and the notion of Nash equilibria. In a zero-sum game, a Nash equilibrium requires the two players to both be playing minimax optimal strategies, but in general there could be Nash equilibria of multiple quality-levels for each player. We then proved the existence of Nash equilibria. However, unlike in the case of zero-sum games, the proof gives no idea of how to *find* a Nash equilibrium or even to approach one. In fact, doing so in a 2-player nxn game is known to be PPAD-hard.

Finally we discussed the notion of *correlated* equilibria and the connection of these to *swap-regret*, a generalization of the kind of regret notion we discussed last time. In particular, any set of algorithms with swap-regret sublinear in T, when played against each other, will have an empirical distribution of play that approaches (the set of) correlated equilibria. See lecture notes.

## HW3 Graders

Hi, if you volunteered to help grade HW3, please send me mail (if you haven’t already done so). Thanks! –Anupam

## Lecture #13: Learning Theory II

Today we talked about online learning. We discussed the Weighted Majority and Randomized Weighted Majority algorithms for the problem of “combining expert advice”, showing for instance that the RWM algorithm satisfies the bound , where is the number of “experts” and is the number of mistakes of the best expert in hindsight. Also, this can be used when the experts are not predictors but rather just different options (like whether to play Rock, Paper, or Scissors in the Rock-Paper-Scissors game). In this case, “# mistakes” becomes “total cost” and all costs are scaled to be in the range [0,1] each round.

We then discussed the “multiarmed bandit” problem, which is like the experts problem except you only find out the payoff for the expert you chose and not for those you didn’t choose. For motivation, we discussed this in the context of the problem of selling lemonade to an online series of buyers, where the “experts” correspond to different possible prices you might choose for selling your lemonade. We then went through an analysis of the EXP3 algorithm (though we did a simpler version of the analysis that gets a dependence on in the regret bound rather than the optimal ).

See the lecture notes (2nd half)

## Lecture #12: Learning Theory 1

Today we talked about the problem of learning a target function from examples, where examples are drawn from some distribution D, and the goal is to use a labeled sample S (a set of examples drawn from D and labeled by the target f) to produce a function h such that is low. We gave a simple efficient algorithm for learning decision-lists in this setting, a basic “Occam’s razor” bound, and then a more interesting bound using the notion of shatter coefficients and a “ghost sample” argument. See 1st half of these lecture notes.

A few additional comments:

- One way to interpret the basic Occam bound is that in principle, anything you can represent in a polynomial number of bits you can learn from a polynomial number of examples (if running time is not a concern). Also “data compression implies learning”: if you can take a set of m examples and find a prediction rule that is correct on the sample and requires < m/10 bits to write down, then you can be confident it will have low error on future points.
- On the other hand, we would really like to learn from as few examples as possible, which is the reason for wanting bounds based on more powerful notions of the “underlying complexity” of the target function, such as shatter coefficients. Other very interesting bounds are based on a notion called “Rademacher complexity” which is even tighter.
- For more info, see notes for 15-859(B) machine learning theory

## Lecture #11: Online algorithms

Today we discussed randomized online algorithms, and in particular, algorithms for the ski-rental (elevator-or-stairs) and paging problems. See lecture notes as well as Chapter 13 of the MR book. Also, Claire Mathieu has a very nice set of notes on the randomized ski-rental problem.

## Homework #3 Open Thread

Homework #3‘s now online.

Please post questions and clarifications in the comments…

## Lecture #10: Polynomial Identity Testing, and Parallel Matchings

**1. Matrix multiplication checking **

The *matrix multiplication checking* problem is to verify the process of matrix multiplication: Given three matrices , , and , is it the case that ? The fastest known deterministic algorithm is to actually multiply and and compare the result to —this takes time, where is the exponent of matrix multiplication, and currently due to an algorithm of Coppersmith and Winograd. Note that an easy lower bound on the running time of any randomized algorithm for matrix multiplication verification is since the input has to at least be read (see Lecture 4 for more details on this). We will now give a randomized algorithm (in co-) which takes only time.

Let us introduce some notation for the rest of the course: means “choose uniformly at random from the set ”. Our checking problem algorithm is as follows:

- Pick a vector .
- Compare with . This takes time, since we can first compute , and then compute . Each matrix-vector product takes only time.
- If , then output
*Yes*, otherwise output*No*.

(We’re imagining working over the reals; if the matrices are over the field , the computations should also carried out over the field . The proofs remain unchanged.) Now if , our algorithm always outputs the correct answer. If , the algorithm may output the wrong answer. We now bound the probability of such an error.

First, we need a simple lemma:

*Proof:* Suppose . Let and . We can write and . This gives us

We can invoke the Principle of Deferred Decisions (see Section 3.5 of M&R) to assume that we’ve first picked all the values for . Then we can write

where we use the fact that can only be either or (or neither), and a randomly chosen will take that value with probability at most half.

Theorem 2 (Freivalds)If , our algorithm fails with probability at most .

*Proof:* If , then there is at least one row in , say , that differs from the corresponding row in C. Apply Lemma 1 with and . The probability that is at most . For the algorithm to output *Yes*, we must have . Therefore, the probability of failure for the algorithm is at most .

**2. Polynomial identity checking **

In the *polynomial identity checking* problem, we are given two multi-variate polynomials and each with degree ; again we are computing in some field . We may not be given the polynomials explicity, so we may not be able to read the polynomials in poly-time — we just have “black-box” access for evaluating a polynomial. Given these two polynomials, the problems is to determine if the polynomials are equal: i.e., if , or equivalently, . Letting , it suffices to check if a given polynomial is identically zero. There is no known poly-time algorithm for this problem. But we will now show that it is in co-.

First consider the univariate case. We can pick distinct values at random from . If for all values for , then . This follows from the basic and super-useful fact, that for any field , a polynomial of degree at most over that field can have at most roots.

This approach does not directly apply to the multivariate case; in fact, the polynomial over two variables over the reals has an infinite number of roots. Over the finite field , the degree- polynomial over variables

has roots (when ).

However, things still work out. Roughly speaking, we can handle the multivariate case by fixing variables and applying the result from the univariate case. Consider the following algorithm, which assumes we have some subset with .

- Pick .
- Evaluate .
- If 0, return .

Theorem 3 (Schwartz (1980), Zippel (1979))If, in the above algorithm, the polynomial , we have

*Proof:* By induction on . The base case is the univariate case described above. With , we want to compute . Let be the largest power of . We can rewrite

for some polynomials and . Now we consider two events. Let be the event that evaluates to , and be the event that evaluates to .

We can rewrite the probability that is as:

Let us first bound the probability of , or the probability that . The polynomial has degree and fewer varaibles, so we can use the inductive hypothesis to obtain

Similarly, given (or ), the univariate polynomial has degree . Therefore, again by inductive hypothesis,

We can substitute into the expression above to get

This completes the inductive step.

Polynomial identity testing is a powerful tool, both in algorithms and in complexity theory. We will use it to find matchings in parallel, but it arises all over the place. Also, as mentioned above, there is no poly-time deterministic algorithm currently known for this problem. A result of Impagliazzo and Kabanets (2003) shows that proving that the polynomial identity checking problem is in would imply that either cannot have poly-size non-uniform circuits, or *Permanent* cannot have poly-size non-uniform circuits. Since we are far from proving such strong lower bounds, the Impagliazzo-Kabanets result suggest that deterministic algorithms for polynomial identity checking may require us to develop significantly new techniques.

For more on the polynomial identity checking problem, see Section 7.2 in M&R. Dick Lipton’s blog has an interesting post on the history of the theorem, as well as comparisons of the results of Schwartz, Zippel, and DeMillo-Lipton. One of the comments points out that the case when was known at least as far back as Ore in 1922; his proof appears as Theorem 6.13 in the book *Finite Fields* by Lidl and Niederreiter; a different proof by Dana Moshkowitz appears here.

**3. Perfect matchings in bipartite graphs **

We will look at a simple sequential algorithm to determine whether a perfect matching exists in a given bipartite graph or not. The algorithm is based on polynomial identity testing from the previous section.

A bipartite graph is specified by two disjoint sets and of vertices, and a set of edges between them. A *perfect matching* is a subset of the edge set such that every vertex has exactly one edge incident on it. Since we are interested in perfect matchings in the graph , we shall assume that . Let and . The algorithm we study today has no error if does not have a perfect matching (no instance), and errs with probability at most if does have a perfect matching (yes instance). This is unlike the algorithms we saw in the previous lecture, which erred on no instances.

Definition 4The Tutte matrix of bipartite graph is an matrix with the entry at row and column ,

(Apparently, Tutte came up a matrix for general graphs, and this one for bipartite graphs is due to Jack Edmonds, but we’ll stick with calling it the Tutte matrix.)

The determinant of the Tutte matrix is useful in testing whether a graph has a perfect matching or not, as the following lemma shows. Note that we do not think of this determinant as taking on some numeric value, but purely as a function of the variables .

*Proof:* We have the following expression for the determinant :

where is the set of all permutations on , and is the sign of the permutation . There is a one to one correspondence between a permutation and a (possible) perfect matching in . Note that if this perfect matching does not exist in (*i.e.* some edge ) then the term corresponding to in the summation is 0. So we have

where is the set of perfect matchings in . This is clearly zero if , *i.e.*, if has no perfect matching. If has a perfect matching, there is a and the term corresponding to is . Additionally, there is no other term in the summation that contains the same set of variables. Therefore, this term is not cancelled by any other term. So in this case, .

This lemma gives us an easy way to test a bipartite graph for a perfect matching — we use the polynomial identity testing algorithm of the previous lecture on the Tutte matrix of . We accept if the determinant is not identically 0, and reject otherwise. Note that has degree at most . So we can test its identity on the field , where is a prime number larger than . From the analysis of the polynomial testing algorithm, we have the following :

- has no perfect matching .
- has a perfect matching .

The above algorithm shows that Perfect Matching for bipartite graphs is in RP. (The non-bipartite case may appear as a homework exercise.) Also, this algorithm for checking the existence of a perfect matching can be easily converted to one that actually computes a perfect matching as follows:

- Pick .
- Check if has a perfect matching.
- If “Yes”, output to be in the matching and recurse on , the graph obtained after the removal of vertices and .
- If “No”, recurse on , the graph obtained after removing the edge .

Note that this algorithm seems inherently sequential; it’s not clear how to speed up its running time considerably by using multiple processors. We’ll consider the parallel algorithm in the next section.

Some citations: the idea of using polynomial identity testing to test for matchings is due to Lovász. The above algorithm to find the matching runs in time , where is the time to multiply two -matrices. (It is also the time to compute determinants, and matrix inverses.) Rabin and Vazirani showed how to compute perfect matchings in general graphs in time , where . Recent work of Mucha and Sankowski, and Harvey show how to use these ideas (along with many other cool ones) to find perfect matchings in general graphs in time .

**4. A parallel algorithm for finding perfect matchings **

However, we can give a slightly different algorithm (for a seemingly harder problem), that indeed runs efficiently on parallel processors. The model here is that there are polynomially many processors run in parallel, and we want to solve the problem in poly-logarithmic depth using polynomial work. We will use the fact that there exist efficient parallel algorithms for computing the determinant of a matrix to obtain our parallel algorithm for finding perfect matchings.

We could try the following “strawman” parallel algorithm:

Use a processor for every edge that tests (in parallel) if edge is in some perfect matching or not. For each edge that lies in some perfect matching, the processor outputs the edge, else it outputs nothing.

We are immediately faced with the problem that there may be several perfect matchings in the graph, and the resulting output is not a matching. The algorithm may in fact return all the edges in the graph. It will only work if there is a *unique* perfect matching.

So instead of testing whether an edge is in some perfect matching or not, we want to test whether an edge is in a specific perfect matching or not. The way we do this is to put random weights on the edges of the graph and test for the minimum weight perfect matching. Surprisingly, one can prove that the minimum weight perfect matching is unique with a good probability, even when the weights are chosen from a set of integers from a relatively small range.

Lemma 6Let and . For every element there is a weight picked u.a.r. from . The weight of subset is . Then

*Proof:* We will estimate the probability that the minimum weight set is *not* unique. Let us define an element to be *tied* if

It is easy to see that there exists a tied element if and only if the minimum weight subset is not unique. Below we bound the probability that a fixed element is tied. The result will then follow using a union bound.\par We use the principle of deferred decisions. Let us fix the weights of all the elements except . We want to bound . Let

with assigned the value 0. It is easy to see that is tied iff . So,

The last inequality is because there is at most on value for for which . This holds irrespective of the particular values of the other s. So , and

Thus .

Now we can look at the parallel algorithm for finding a perfect matching. For each edge , we pick a random weight , from , where is the number of edges in . Let the sets denote all the perfect matchings in . Then the Isolation Lemma implies that there is a unique minimum weight perfect matching with at least a half probability. We assign the value to the variables in the Tutte matrix . Let denote the resulting matrix. We use the determinant of to determine the weight of the min-weight perfect matching, if it is unique, as suggested by the following lemma.

Lemma 7Let be the weight of the minimum weight perfect matching in . Then,

- has no perfect matching .
- has a unique min-weight perfect matching and the largest power of 2 dividing is .
- has more than one min-weight perfect matching either or the largest power of 2 dividing is at least .

*Proof:* If has no perfect matching, it is clear from lemma 5 that . \par Now consider that case when has a unique min-weight perfect matching. From the expression of the determinant, we have

where is the weight of the perfect matching corresponding to and is the set of all perfect matchings in . Since there is exactly one perfect matching of weight and other perfect matchings have weight at least , this evaluates to an expression of the form . Clearly, this is non-zero, and the largest power of 2 dividing this is .\par Now consider the case when has more than one min-weight perfect matchings. In this case, if the determinant is non-zero, every term in the sumation is a power of 2, at least . So divides .

We refer to the submatrix of obtained by removing the -th row and -th column by . Note that this is a matrix corresponding to the bipartite graph . The parallel algorithm would run as follows.

- Pick random weights for the edges of . (In the following steps, we assume that we’ve isolated the min-weight perfect matching.)
- Compute the weight of the min-weight perfect matching from (using the parallel algorithm for computing the determinant): this is just the highest power of that divides .
- If , output “no perfect matching”.
- For each edge do, in parallel,:
- Evaluate .
- If , output nothing.
- Else, find the largest power of 2, , dividing .
- If , output .
- Else, output nothing.

It is clear that, if has no perfect matching, this algorithm returns the correct answer. Now suppose has a unique minimum weight perfect matching, we claim Lemma 7 ensures that precisely all the edges in the unique min-weight perfect matching are output. To see this, consider an edge not in the unique min weight perfect matching. From the lemma, is either zero (so the edge will not be output), or is at least as large as the min-weight perfect matching in . Since the min-weight perfect matching is unique and does not contain edge , this implies will be strictly larger than , and this edge will not be output in this case either. Finally, if an edge is in the unique min-weight perfect matching, removing this edge from the matching gives us the unique min-weight perfect matching in . So, in this case and the edge is output.

Thus, if has a perfect matching, this algorithm will isolate one with probability at least , and will output it—hence we get an RNC algorithm that succeeds with probability at least on “Yes” instances, and never makes mistakes on “No” instances.

Finally, some more citations. This algorithm is due to Mulmuley, Vazirani and Vazirani; the first RNC algorithm for matchings had been given earlier by Karp, Upfal, and Wigderson. It is an open question whether we can find perfect matchings *deterministically* in parallel using poly-logarithmic depth and polynomial work, even for bipartite graphs.

## Lecture #8: Oh, and one more thing

I forgot to mention something about the two choices paradigm: recall from HW #2 that if you throw balls into bins randomly and , the maximum load is about . In fact, you can show that this variance term is about right—with high probability, the highest loaded bin will indeed be about above the average.

On the other hand, if you throw balls into bins using two-choices, then you can show that the highest load is about with high probability. So not only do we gain in the low-load case (when ), we get more control over the variance in the high load case (when ): the additive gap between the average and the maximum loads is now independent of the number of balls! The proofs to show this require new ideas: check out the paper *Balanced Allocations: the Heavily Loaded Case* by Berenbrink, Czumaj, Steger and Vöcking for more details.

Here is a recent paper of Peres, Talwar and Weider that gives an analysis of the -choice process (where you invoke the two choices paradigm only on fraction of the balls). It also refers to more recent work in the area (weighted balls, weighted bins, etc), in case you’re interested.

## Lecture #8: Balls and Bins

**1. Balls and Bins **

The setting is simple: balls, bins. When you consider a ball, you pick a bin independently and uniformly at random, and add the ball to that bin. In HW #2 you proved:

Theorem 1The max-loaded bin has balls with probability at least .

One could use a Chernoff bound to prove this, but here is a more direct calculation of this theorem: the chance that bin has at least balls is at most

which is (say) for . To see this, note that

So union bounding over all the bins, the chance of some bin having more than balls is . (I’ve been sloppy with constants, you can do better constants by using Stirling’s approximation.)

Here is a semantically identical way of looking at this calculation: let be the indicator r.v. for bin having or more balls. Then . And hence if , then . So by Markov, . In other words, we again have

This idea of bounding the expectation of some variable , and using that to upper bound some quantity (in this case the max-load) is said to use the *first moment method*.

** 1.1. Tightness of the Bound **

In fact, is indeed the right answer for the max-load with balls and bins.

Theorem 2The max-loaded bin has balls with probability at least .

Here is one way to show this, via the *second moment method*. To begin, let us now lower bound the probability that bin has at least balls:

which for is at least , since . And so we expect bins to have at least balls.

Let us define some random variables: if is the indicator for bin having at least balls, and is the expected number of bins with at least balls, we get that

Alas, in general, just knowing that will not imply . Indeed, consider a random variable that is w.p. , and otherwise—while its expectation is , is more and more likely to be zero as increases. So we need some more information about to prove our claim. And that comes from the second moment.

Let’s appeal to Chebyshev’s inequality:

You have probably seen *covariance* before: . But since the bins are negatively correlated (some bin having more balls makes it less likely for another bin to do so), the covariance . Moreover, since , ; by the above calculations, . So summarizing, we get

In other words, there is a chance that some bin contains more than balls:

(Later, you will see how to use martingale arguments and Azuma-Hoeffding bounds to give guarantees on the max-load of bins. You can also use the “Poisson approximation” to show such a result, that’s yet another cool technique.)

** 1.2. So, in Summary **

If you want to show that some non-negative random variable is zero with high probability, show that it’s expectation is tends to zero, and use Markov—the *first moment method*. If you want to show that it is non-zero with high probability, show that the variance divided by the squared mean tends to zero, and use Chebyshev—the *second moment method*.

** 1.3. Taking it to the Threshold **

Such calculations often arise when you have a random process, and a random variable defined in terms of a parameter . Often you want to show that is zero whp when lies much below some “threshold” , and that is non-zero whp when is far above . The first things you should try are to see if the first and second moment methods give you rough answers. E.g., take vertices and add each of the edges independently with probability (also called the Erdös-Rényi graph ), and define to be the expected number of cliques on vertices. Show that is such a threshold for .

**2. The Power of Two Choices **

The setting now is: balls, bins. However, when you consider a ball, you pick *two* bins (or in general, bins) independently and uniformly at random, and put the ball in the less loaded of the two bins. The main theorem is:

Theorem 3The two-choices process gives a maximum load of with probability at least .

The intuition behind the proof is the following picture:

The actual proof is not far from this intuition. The following lemma says that if at most fraction of the bins have at least balls, then the fraction of bins having balls can indeed be upper bounded by , where is the Binomial random variable.

Lemma 4If is the number of bins with load at least , then .

*Proof:* For the proof, let us consider the “heights” of balls: this is the position of the ball when it comes in, if it is the first ball in its bin then its height is , etc. Observe that if there are bins with load , then there must be at least balls with height . I.e., if is the number of balls with height at least , then , and so we’ll now upper bound .

Consider the following experiment: just before a ball comes in, an adversary is allowed to “mark” at most bins. Call a ball marked if both its random bins are marked. Note that when we condition on , we know that the final number of bins with load at least is at most . In this case, we can imagine the adversary marking the bins with load at least (and maybe some more). Now the chance that a ball is marked is at least the chance that it has height and there are at most bins with height at least . Hence, if is the number of marked balls, we get

The second equality follows from the fact that .

*If you’d like to be more precise about proving (*) above, see the details in the notes from the Mitzenmacher-Upfal. (CMU/Pitt access only.)*

Now we can use Chernoff to prove tail bounds on the Binomial distribution.

Moreover, if , then

*Proof:* We’re interested in where each w.p. , and otherwise. The expectation . And the chance that this number exceeds is at most

which proves the first part. For the second part, , and the probability that exceeds is at most

as claimed.

So, now let us define to be the fraction of bins we’re aiming to show have load at least . Define , and . (The reason it is instead of , which is the expectation, is for some breathing room to apply Chernoff: in particular, the factor comes from the first part of Lemma 5.)

For each , let be the good event “”; recall that is the number of bins with load at least . We want to lower bound the probability that this good event does not happen.

*Proof:* We prove this by induction. The base case is when , when at most bins can have load at least (by Markov). So .

For the induction,

By Lemma 4 the former term is at most , which by Lemma 5 is at most . And by induction, , which means .

Consider . By the Lemma 6, . Hence, with probability , we have the number of bins with more than balls in them is at most .

We’re almost done, but there’s one more step to do. If this number were small, say , then we could have done a union bound, but this number may still be about . So apply Lemma 4 and the second part of Lemma 5 once more to get:

Finally, since is so small whp, use Lemma 4 and a union bound to say that

Finally, the calculations in Section 2 show that , which completes the proof.

** 2.1. A Calculation **

Since , and , we calculate

So, for , it suffices to set

**3. A Random Graphs Proof **

Another way to show that the maximum load is —note that the constant is worse—is to use an first-priciples analysis based on properties of random graphs. We build a random graph as follows: the vertices of correspond to the bins, and the edges correspond to balls—each time we probe two bins we connect them with an edge in . For technical reasons, we’ll just consider what happens if we throw fewer balls (only balls) into bins—also, let’s imagine that each ball chooses two distinct bins each time.

Theorem 7If we throw balls into bins using the best-of-two-bins method, the maximum load of any bin is whp.

Hence for balls and bins, the maximum load should be at most times as much, whp. (It’s as though after every balls, we forget about the current loads and zero out our counters—not zeroing out these counters can only give us a more evenly balanced allocation; I’ll try to put in a formal proof later.)

To prove the theorem, we need two results about the random graph obtained by throwing in random edges into vertices. Both the proofs are simple but surprisingly effective counting arguments, they appear at the end.

Lemma 9There exists a suitably large constant such that for all subsets of the vertex set with , the induced graph contains at most edges, and hence has average degree at most , whp.

Given the graph , suppose we repeatedly perform the following operation in rounds:

In each round, remove all vertices of degree in the current graph.

We stop when there are no more vertices of small degree.

Lemma 10This process ends after rounds whp, and the number of remaining vertices in each remaining component is at most .

*Proof:* Condition on the events in the two previous lemmas. Any component of size at least in the current graph has average degree at most ; by Markov at least half the vertices have degree at most and will be removed. So as long as we have at least nodes in a component, we halve its size. But the size of each component was to begin, so this takes rounds before each component has size at most .

Lemma 11If a node/bin survives rounds before it is deleted, its load due to edges that have already been deleted is at most . If a node/bin is never deleted, its load is at most , where is the total number of rounds.

*Proof:* Consider the nodes removed in round : their degree was at most , so even if all those balls went to such nodes, their final load would be at most . Now, consider any node that survived this round. While many edges incident to it might have been removed in this round, we claim that at most of those would have contributed to ‘s load. Indeed, the each of the other endpoints of those edges went to bins with final load at most . So at most of them would choose as their less loaded bin before it is better for them to go elsewhere.

Now, suppose is deleted in round : then again its load can be at most : ten because it survived the previous round, and 10 from its own degree in this round. OTOH, if survives, then consider all the edges incident to that were deleted in previous rounds. Each of them went to nodes that were deleted in rounds or , and hence had maximum load at most . Thus at most of these edges could contribute to ‘s load before it was better for them to go to the other endpoint. The same inductive argument holds for any round .

Finally, the process ends when each component has size at most , so the degree of any node is at most . Even if all these edges contribute to the load of a bin, it is only .

By Lemma 10, the number of rounds is whp, so by Lemma 11 the maximum load is also whp.

** 3.1. Missing Proofs of Lemmas **

*Proof:* We have a graph with vertices and edges where we connect vertices at random.

Since and . Now the probability that any such set exists can bounded above by the union bound

which proves the claim.

Lemma 13There exists a suitably large constant such that for all subsets of the vertex set with , the induced graph contains at most edges, and hence has average degree at most , whp.

*Proof:*

By a union bound over all sets, the probability that such a set exists is

where . Now, we can break this sum into two: for small values of , the term would be very small, else the term would be small. Indeed, for , we know that , so

Now for the rest:

for , say.

**Bibliographic Notes:** The layered induction appears in Balanced Allocations Azar, Broder, Karlin, and Upfal. The random graph analysis is in the paper Efficient PRAM Simulation on a Distributed Memory Machine by Karp, Luby, and Meyer auf der Heide; I learned it from Satish Rao. The *Always-go-left* algorithm and analysis is due to How Asymmetry Helps Load Balancing by Berthold Vöcking.

**Update:** Here’s a survey on the various proof techniques by Mitzenmacher, Sitaraman and Richa.