CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

Category Archives: Anupam

Lecture #8: Balls and Bins

1. Balls and Bins

The setting is simple: {n} balls, {n} bins. When you consider a ball, you pick a bin independently and uniformly at random, and add the ball to that bin. In HW #2 you proved:

Theorem 1 The max-loaded bin has {O(\frac{\log n}{\log \log n})} balls with probability at least {1 - 1/n}.

One could use a Chernoff bound to prove this, but here is a more direct calculation of this theorem: the chance that bin {i} has at least {k} balls is at most

\displaystyle  \binom{n}{k} \left( \frac1n \right)^k \leq \frac{n^k}{k!} \cdot \frac1{n^k} \leq \frac{1}{k!} \leq 1/k^{k/2}

which is (say) {\leq 1/n^2} for {k^* = \frac{8 \log n}{\log \log n}}. To see this, note that

\displaystyle  k^{k/2} \geq (\sqrt{\log n})^{4 \log n/\log\log n} \geq 2^{2 \log n} = n^2.

So union bounding over all the bins, the chance of some bin having more than {k^*} balls is {1/n}. (I’ve been sloppy with constants, you can do better constants by using Stirling’s approximation.)

Here is a semantically identical way of looking at this calculation: let {X_i} be the indicator r.v. for bin {i} having {k^*} or more balls. Then {E[X_i] \leq 1/n^2}. And hence if {X = \sum_i X_i}, then {E[X] \leq 1/n}. So by Markov, {\Pr[ X > 1 ] \leq E[X] \leq 1/n}. In other words, we again have

\displaystyle  \Pr[ \text{ max load is more than } \frac{8\log n}{\log \log n} ] \rightarrow 0.

This idea of bounding the expectation of some variable {X}, and using that to upper bound some quantity (in this case the max-load) is said to use the first moment method.

1.1. Tightness of the Bound

In fact, {\Theta(\frac{\log n}{\log\log n})} is indeed the right answer for the max-load with {n} balls and {n} bins.

Theorem 2 The max-loaded bin has {\Omega(\frac{\log n}{\log \log n})} balls with probability at least {1 - 1/n^{1/3}}.

Here is one way to show this, via the second moment method. To begin, let us now lower bound the probability that bin {i} has at least {k} balls:

\displaystyle  \binom{n}{k} \left( \frac1n \right)^k \left(1 - \frac1n \right)^{n-k} \geq \left(\frac{n}{k}\right)^k \cdot \frac1{n^k} \cdot e^{-1} \geq 1/ek^k,

which for {k^{**} = \frac{\log n}{3\log \log n}} is at least {1/en^{1/3}}, since {k^k \leq (\log n)^{\log n/3\log\log n} = n^{-1/3}}. And so we expect {\Omega(n^{2/3})} bins to have at least {k^{**}} balls.

Let us define some random variables: if {X_i} is the indicator for bin {i} having at least {k^{**}} balls, and {X} is the expected number of bins with at least {k^{**}} balls, we get that

\displaystyle  E[X_i] \geq 1/en^{1/3} \quad \text{ and } \quad E[X] = \Omega(n^{2/3}).

Alas, in general, just knowing that {E[X] \rightarrow \infty} will not imply {\Pr[ X \geq 1 ] \rightarrow 1}. Indeed, consider a random variable that is {0} w.p. {1-1/n^{1/3}}, and {n} otherwise—while its expectation is {n^{2/3}}, {X} is more and more likely to be zero as {n} increases. So we need some more information about {X} to prove our claim. And that comes from the second moment.

Let’s appeal to Chebyshev’s inequality:

\displaystyle  \Pr[ X = 0 ] \leq \Pr[ |X -\mu| \geq \mu ] \leq \frac{\mathrm{Var}(X)}{\mu^2} = \frac{\sum_i \mathrm{Var}(X_i) + \sum_{i \neq j} \mathrm{Cov}(X_i,X_j)}{E[X]^2}.

You have probably seen covariance before: {\mathrm{Cov}(Y, Z) := E[(Y - E[Y])(Z - E[Z])]}. But since the bins are negatively correlated (some bin having more balls makes it less likely for another bin to do so), the covariance {\mathrm{Cov}(X_i, X_j) \leq 0}. Moreover, since {X_i \in \{0,1\}}, {\mathrm{Var}(X_i) \leq E[X_i] \leq 1}; by the above calculations, {E[X]^2 \geq n^{4/3}}. So summarizing, we get

\displaystyle  \Pr[ X = 0 ] \leq \frac{\sum_i \mathrm{Var}(X_i) + \sum_{i \neq j} \mathrm{Cov}(X_i,X_j)}{E[X]^2} \leq \frac{n}{E[X]^2} \leq n^{-1/3}.

In other words, there is a {1 - 1/n^{1/3}} chance that some bin contains more than {k^{**}} balls:

\displaystyle  \Pr[ \text{ max load is less than } \frac{\log n}{3\log \log n} ] \rightarrow 0.

(Later, you will see how to use martingale arguments and Azuma-Hoeffding bounds to give guarantees on the max-load of bins. You can also use the “Poisson approximation” to show such a result, that’s yet another cool technique.)

1.2. So, in Summary

If you want to show that some non-negative random variable is zero with high probability, show that it’s expectation is tends to zero, and use Markov—the first moment method. If you want to show that it is non-zero with high probability, show that the variance divided by the squared mean tends to zero, and use Chebyshev—the second moment method.

1.3. Taking it to the Threshold

Such calculations often arise when you have a random process, and a random variable {X} defined in terms of a parameter {k}. Often you want to show that {X} is zero whp when {k} lies much below some “threshold” {\tau}, and that {X} is non-zero whp when {k} is far above {\tau}. The first things you should try are to see if the first and second moment methods give you rough answers. E.g., take {n} vertices and add each of the {\binom{n}{2}} edges independently with probability {1/2} (also called the Erdös-Rényi graph {G(n,1/2)}), and define {X} to be the expected number of cliques on {k} vertices. Show that {\tau = 2 \log n} is such a threshold for {X}.

2. The Power of Two Choices

The setting now is: {n} balls, {n} bins. However, when you consider a ball, you pick two bins (or in general, {d} bins) independently and uniformly at random, and put the ball in the less loaded of the two bins. The main theorem is:

Theorem 3 The two-choices process gives a maximum load of {\frac{\ln \ln n}{\ln 2} + O(1)} with probability at least {1 - O(\frac{\log^2 n}{n})}.

The intuition behind the proof is the following picture:

The actual proof is not far from this intuition. The following lemma says that if at most {\alpha} fraction of the bins have at least {i} balls, then the fraction of bins having {i+1} balls can indeed be upper bounded by {Bin(n, \alpha^2)}, where {Bin(n,p)} is the Binomial random variable.

Lemma 4 If {N_i} is the number of bins with load at least {i}, then {\Pr[ N_{i+1} > t \mid N_i \leq \alpha n ] \leq \frac{\Pr[ Bin(n,\alpha^2) > t ]}{\Pr[ N_i \leq \alpha n ]}}.

Proof: For the proof, let us consider the “heights” of balls: this is the position of the ball when it comes in, if it is the first ball in its bin then its height is {1}, etc. Observe that if there are {t} bins with load {i+1}, then there must be at least {t} balls with height {i+1}. I.e., if {B_j} is the number of balls with height at least {j}, then {N_j \leq B_j}, and so we’ll now upper bound {\Pr[ B_{i+1} > t \mid N_i \leq \alpha n ] = \frac{\Pr[ B_{i+1} > t \cap N_i \leq \alpha n ]}{\Pr[ N_i \leq \alpha n]}}.

Consider the following experiment: just before a ball comes in, an adversary is allowed to “mark” at most {\alpha n} bins. Call a ball marked if both its random bins are marked. Note that when we condition on {N_i \leq \alpha n}, we know that the final number of bins with load at least {i} is at most {\alpha n}. In this case, we can imagine the adversary marking the bins with load at least {i} (and maybe some more). Now the chance that a ball is marked is at least the chance that it has height {i+1} and there are at most {\alpha n} bins with height at least {i}. Hence, if {M} is the number of marked balls, we get

\displaystyle  \frac{\Pr[ B_{i+1} > t \cap N_i \leq \alpha n ] }{ \Pr[ N_i \leq \alpha n]} \leq^{(*)} \frac{\Pr[ M > t ]}{\Pr[ N_i \leq \alpha n ]} = \frac{\Pr[ Bin(n,\alpha^2) > t ]}{\Pr[ N_i \leq \alpha n ]}.

The second equality follows from the fact that {M \sim Bin(n, \alpha^2)}. \Box

If you’d like to be more precise about proving (*) above, see the details in the notes from the Mitzenmacher-Upfal. (CMU/Pitt access only.)

Now we can use Chernoff to prove tail bounds on the Binomial distribution.

Lemma 5 If {\alpha^2 \geq 6 \frac{\ln n}{n}}, then

\displaystyle  \Pr[ Bin(n, \alpha^2) > 2n\alpha^2 ] \leq 1/n^2.

Moreover, if {\alpha^2 < 6 \frac{\ln n}{n}}, then

\displaystyle  \Pr[ Bin(n, \alpha^2) > 12 \ln n] \leq 1/n^2.

Proof: We’re interested in {X = \sum_{i = 1}^n X_i} where each {X_i = 1} w.p. {p = \alpha^2}, and {0} otherwise. The expectation {\mu = np \geq 6 \ln n}. And the chance that this number exceeds {(1+1)\mu} is at most

\displaystyle  \exp( -\frac{\mu^2}{2\mu + \mu} ) \leq \exp( -\mu/3 ) \leq 1/n^2,

which proves the first part. For the second part, {\mu < 6 \ln n}, and the probability that {X} exceeds {12 \ln n \geq \mu + 6 \ln n} is at most

\displaystyle  \exp( -\frac{(6 \ln n)^2}{2\mu + 6 \ln n} ) \leq \exp( -2\ln n ) \leq 1/n^2,

as claimed. \Box

So, now let us define {\alpha_i} to be the fraction of bins we’re aiming to show have load at least {i}. Define {\alpha_4 = 1/4}, and {\alpha_{i+1} = 2\alpha_i^2}. (The reason it is {2\alpha_i^2} instead of {\alpha_i^2}, which is the expectation, is for some breathing room to apply Chernoff: in particular, the factor {2} comes from the first part of Lemma 5.)

For each {i \geq 4}, let {\mathcal{E}_i} be the good event “{N_i \leq n\alpha_i}”; recall that {N_i} is the number of bins with load at least {i}. We want to lower bound the probability that this good event does not happen.

Lemma 6 If {\alpha_i^2 \geq 6 \frac{\ln n}{n}}, then

\displaystyle  \Pr[ \lnot \mathcal{E}_{i+1} ] \leq i/n^2.

Proof: We prove this by induction. The base case is when {i = 4}, when at most {n/4} bins can have load at least {4} (by Markov). So {\Pr[ \lnot \mathcal{E}_4 ] = 0 < 4/n^2}.

For the induction,

\displaystyle  \Pr[ \lnot \mathcal{E}_{i+1} ] \leq \Pr[ \lnot \mathcal{E}_{i+1} \mid \mathcal{E}_i ]\Pr[ \mathcal{E}_i ] + \Pr[ \lnot \mathcal{E}_i ].

By Lemma 4 the former term is at most {\frac{\Pr[ B(n, \alpha_i^2) \geq \alpha_{i+1}]}{\Pr[ \mathcal{E}_i ]} \cdot \Pr[ \mathcal{E}_i ]}, which by Lemma 5 is at most {1/n^2}. And by induction, {\Pr[\lnot \mathcal{E}_i] \leq i/n^2}, which means {\Pr[ \lnot \mathcal{E}_{i+1}] \leq (i+1)/n^2}. \Box

Consider {i^* = \min\{i \mid \alpha_i^2 < 6 \frac{\ln n}{n}\}}. By the Lemma 6, {\Pr[ \lnot \mathcal{E}_{i^*} ] \leq i^*/n^2 \leq 1/n}. Hence, with probability {1 - 1/n}, we have the number of bins with more than {i^*} balls in them is at most {n\alpha_{i^*}}.

We’re almost done, but there’s one more step to do. If this number {n \alpha_{i^*}} were small, say {O(\log n)}, then we could have done a union bound, but this number may still be about {\Omega(\sqrt{n \log n})}. So apply Lemma 4 and the second part of Lemma 5 once more to get:

\displaystyle  \begin{array}{rcl}  \Pr[ N_{i^* +1 } > 12 \ln n] &\leq& \Pr[ N_{i^* +1 } > 12 \ln n \mid \mathcal{E}_{i^*} ] \Pr[ \mathcal{E}_{i^*}] + \Pr[ \lnot \mathcal{E}_{i^*} ] \\ &\leq& \Pr[ Bin(n, \alpha_{i^*}^2) > 12 \ln n \mid \mathcal{E}_{i^*} ] \Pr[ \mathcal{E}_{i^*}] + \Pr[ \lnot \mathcal{E}_{i^*} ] \\ &\leq& 1/n^2 + \Pr[ \lnot \mathcal{E}_{i^*} ] \leq \frac{n+1}{n^2} \end{array}

Finally, since {N_{i^* + 1}} is so small whp, use Lemma 4 and a union bound to say that

\displaystyle  \begin{array}{rcl}  \Pr[ N_{i^* + 2} > 1 ] &\leq& \Pr[ B(n, \frac{(12 \ln n)^2}{n}) > 1 ] + \Pr[ N_{i^* + 1} > 12 \ln n ] \\ &\leq& E[ B(n, \frac{(12 \ln n)^2}{n}) ] + \frac{n+1}{n^2} \\ &\leq& O(\frac{\log^2 n}{n}). \end{array}

Finally, the calculations in Section 2 show that {i^* = \frac{\ln \ln n}{\ln 2} + O(1)}, which completes the proof.

2.1. A Calculation

Since {\log_2 \alpha_4 = -2}, and {\log_2 \alpha_{i+1} = 1 + 2 \log_2 \alpha_i}, we calculate

\displaystyle  \log_2 \alpha_{i} = - 2^{i-4} - 1.

So, for {\log_2 \alpha_i \approx - \frac12 \log_2 n}, it suffices to set

\displaystyle  i = \log_2 \log_2 n + 3 = \frac{\ln \ln n}{\ln 2} + O(1).

3. A Random Graphs Proof

Another way to show that the maximum load is {O(\log\log n)}—note that the constant is worse—is to use an first-priciples analysis based on properties of random graphs. We build a random graph {G} as follows: the {n} vertices of {G} correspond to the {n} bins, and the edges correspond to balls—each time we probe two bins we connect them with an edge in {G}. For technical reasons, we’ll just consider what happens if we throw fewer balls (only {n/512} balls) into {n} bins—also, let’s imagine that each ball chooses two distinct bins each time.

Theorem 7 If we throw {\frac{n}{512}} balls into {n} bins using the best-of-two-bins method, the maximum load of any bin is {O(\log\log n)} whp.

Hence for {n} balls and {n} bins, the maximum load should be at most {512} times as much, whp. (It’s as though after every {n/512} balls, we forget about the current loads and zero out our counters—not zeroing out these counters can only give us a more evenly balanced allocation; I’ll try to put in a formal proof later.)

To prove the theorem, we need two results about the random graph {G} obtained by throwing in {n/512} random edges into {n} vertices. Both the proofs are simple but surprisingly effective counting arguments, they appear at the end.

Lemma 8 The size of {G}‘s largest connected component is {O(\log n)} whp.

Lemma 9 There exists a suitably large constant {K > 0} such that for all subsets {S} of the vertex set with {|S| \ge K}, the induced graph {G[S]} contains at most {5|S|/2} edges, and hence has average degree at most {5}, whp.

Given the graph {G}, suppose we repeatedly perform the following operation in rounds:

In each round, remove all vertices of degree {\leq 10} in the current graph.

We stop when there are no more vertices of small degree.

Lemma 10 This process ends after {O(\log\log n)} rounds whp, and the number of remaining vertices in each remaining component is at most {K}.

Proof: Condition on the events in the two previous lemmas. Any component {C} of size at least {K} in the current graph has average degree at most {5}; by Markov at least half the vertices have degree at most {10} and will be removed. So as long as we have at least {K} nodes in a component, we halve its size. But the size of each component was {O(\log n)} to begin, so this takes {O(\log \log n)} rounds before each component has size at most {K}. \Box

Lemma 11 If a node/bin survives {i} rounds before it is deleted, its load due to edges that have already been deleted is at most {10i}. If a node/bin is never deleted, its load is at most {10i^* + K}, where {i^*} is the total number of rounds.

Proof: Consider the nodes removed in round {1}: their degree was at most {10}, so even if all those balls went to such nodes, their final load would be at most {10}. Now, consider any node {x} that survived this round. While many edges incident to it might have been removed in this round, we claim that at most {10} of those would have contributed to {x}‘s load. Indeed, the each of the other endpoints of those edges went to bins with final load at most {10}. So at most {10} of them would choose {x} as their less loaded bin before it is better for them to go elsewhere.

Now, suppose {y} is deleted in round {2}: then again its load can be at most {20}: ten because it survived the previous round, and 10 from its own degree in this round. OTOH, if {y} survives, then consider all the edges incident to {y} that were deleted in previous rounds. Each of them went to nodes that were deleted in rounds {1} or {2}, and hence had maximum load at most {20}. Thus at most {20} of these edges could contribute to {y}‘s load before it was better for them to go to the other endpoint. The same inductive argument holds for any round {i \leq i^*}.

Finally, the process ends when each component has size at most {K}, so the degree of any node is at most {K}. Even if all these edges contribute to the load of a bin, it is only {10i^* + K}. \Box

By Lemma 10, the number of rounds is {i^* = O(\log \log n)} whp, so by Lemma 11 the maximum load is also {O(\log \log n)} whp.

3.1. Missing Proofs of Lemmas

Lemma 12 The size of {G}‘s largest connected component is {O(\log n)} whp.

Proof: We have a graph with {n} vertices and {m=\frac{n}{512}} edges where we connect vertices at random.

\displaystyle  \begin{array}{rcl}  \Pr[ k+1\text{ vertices connected } ] &\le& \Pr[ \text{ at least } k \text{ edges fall within } k+1 \text{ nodes }] \\ &\le& \binom{m}{k}\left(\frac{\binom{k+1}{2}}{\binom{n}{2}}\right)^k = \binom{m}{k}\left(\frac{k(k+1)}{n(n-1)}\right)^k \\ &\leq& \binom{m}{k}\left(\frac{4k}{n}\right)^{2k}. \end{array}

Since {k(k+1) \leq 2k^2} and {n(n-1) \geq n^2/2}. Now the probability that any such set exists can bounded above by the union bound

\displaystyle  \begin{array}{rcl}  \Pr[ \exists \text{ a connected set of size }k+1 ] &\le& \binom{n}{k+1}\binom{m}{k}\left(\frac{4k}{n}\right)^{2k}\\ &\le& n\left(\frac{ne}{k}\right)^k\left(\frac{ne}{512k}\right)^k\left(\frac{4k}{n}\right)^{2k}\\ &\le& n\left(\frac{e^2}{16}\right)^k \le \frac{1}{2n} \quad\text{ if }\; k=\Theta(\log n ) \end{array}

which proves the claim. \Box

Lemma 13 There exists a suitably large constant {K > 0} such that for all subsets {S} of the vertex set with {|S| \ge K}, the induced graph {G[S]} contains at most {5|S|/2} edges, and hence has average degree at most {5}, whp.

Proof:

\displaystyle  \Pr[ \text{ a fixed set of }k\text{ nodes gets } > \frac{5k}{2} \text{ edges }] \leq \binom{m}{5k/2}\left(\frac{4k}{n}\right)^{2\cdot 5k/2} = \binom{m}{5k/2}\left(\frac{4k}{n}\right)^{5k}.

By a union bound over all sets, the probability that such a set exists is

\displaystyle  \begin{array}{rcl}  \Pr[ \exists \text{ a bad set } ] &\le& \sum_{k \geq K} \binom{n}{k}\binom{m}{5k/2}\left(\frac{4k}{n}\right)^{5k}\\ &\le& \sum_{k \geq K} \left(\frac{ne}{k}\right)^k\left(\frac{ne}{512(5k/2)}\right)^{5k/2}\left(\frac{k}{n}\right)^{5k} = \sum_{k \geq K} \left(\frac{k}{n}\right)^{3k/2}\alpha^k, \end{array}

where {\alpha = \frac{e^{7/2}}{80^{5/2}} < 1/2}. Now, we can break this sum into two: for small values of {k}, the {(k/n)^k} term would be very small, else the {\alpha^k} term would be small. Indeed, for {k \geq 2\log_2 n}, we know that {\alpha^k \leq 1/n^{2}}, so

\displaystyle  \sum_{k = 2 \log n}^n \left(\frac{k}{n}\right)^{3k/2} \alpha^k \leq \sum_{k = 2 \log n}^n \alpha^k \leq 1/n.

Now for the rest:

\displaystyle  \sum_{k = K}^{2 \log n} \left(\frac{k}{n}\right)^{3k/2} \alpha^k \leq \sum_{k = K}^{2 \log n} \left(\frac{k}{n}\right)^{3k/2} \leq 2 \log n \cdot \left( \frac{ 2 \log n }{n} \right)^{3K/2} \leq 1/n^4,

for {K = 3}, say. \Box

Bibliographic Notes: The layered induction appears in Balanced Allocations Azar, Broder, Karlin, and Upfal. The random graph analysis is in the paper Efficient PRAM Simulation on a Distributed Memory Machine by Karp, Luby, and Meyer auf der Heide; I learned it from Satish Rao. The Always-go-left algorithm and analysis is due to How Asymmetry Helps Load Balancing by Berthold Vöcking.

Update: Here’s a survey on the various proof techniques by Mitzenmacher, Sitaraman and Richa.

Lecture #7: The Local Lemma

1. The Local Lemma: Various Forms

The symmetric version of the local lemma we saw in class:

Theorem 1 (Symmetric Form I)
Given a collection of events {B_1, B_2, \ldots, B_m} such that {\Pr[B_i] \leq p}, and each {B_i} is independent of all but {d} of these events, if {epd < 1} then {\Pr[ \cap \overline{B_i} ] > 0}.

Very often, the sample space is defined by some experiment which involves sampling from a bunch of independent random variables, and the bad events each depend on some subset of these random variables. In such cases we can state the local lemma thus:

Theorem 2 (Symmetric Form II)

Suppose {X_1, X_2, \ldots, X_n} are independent random variables, and events {B_1, B_2, \ldots, B_m} such that each {B_i} that depends only on some subset {\{X_j : j \in S_i\}} of these variables. Moreover, suppose {\Pr[ B_i ] \leq p} and each {S_i} intersects at most {d} of the {S_j}‘s. If {epd < 1} then {\Pr[ \cap \overline{B_i} ] > 0}.

Note that both the examples from class (the {k}-SAT and Leighton Maggs and Rao results) fall into this setting.

Finally, here’s the asymmetric form of the local lemma:

Theorem 3 (Asymmetric Form)

Given events {B_1, B_2, \ldots, B_m} with each {B_i} independent of all but the set {\Gamma_i} of these events, suppose there exist {x_i \in (0,1)} such that

\displaystyle  \Pr[ B_i ] \leq x_i \prod_{j \in \Gamma_i \setminus \{i\}} (1 - x_j).

Then {\Pr[ \cap \overline{B_i} ] \geq \prod_i (1 - x_i) > 0}.

Occasionally one needs to use the asymmetric form of the local lemma: one example is Uri Feige’s result showing a constant integrality gap for the Santa Claus problem, and the resulting approximation algorithm due to Heupler, Saha and Srinivasan.

1.1. Proofs of the Local Lemma

The original proof of the local lemma was based on a inductive argument. This was a non-constructive proof, and the work of Beck gave the first techniques to make some of the existential proofs algorithmic.

In 2009, Moser, and then Moser and Tardos gave new, intuitive, and more algorithmic proofs of the lemma for the case where there is an underlying set of independent random variables, and the bad events are defined over subsets of these variables. (E.g., the version of the Local Lemma given in Theorem~2, and its asymmetric counterpart). Check out notes on the proofs of the Local Lemma by Joel Spencer and Uri Feige. The paper of Heupler, Saha and Srinivasan gives algorithmic versions for some cases where the number of events is exponentially large.

1.2. Lower Bounds

The local lemma implies that if {d < 2^k/e} then the formula is satisfiable. This is complemented by the existence of unsatisfiable E{k}-SAT formulas with degree {d = 2^k(\frac1e + O(\frac{1}{\sqrt{k}}))}: this is proved in a paper of Gebauer, Szabo and Tardos (SODA 2011). This shows that the factor of {e} in the local lemma cannot be reduced, even for the special case of E{k}-SAT.

The fact that the factor {e} was tight for the symmetric form of the local lemma was known earlier, due to a result of Shearer (1985).

2. Local Lemma: The E{k}-SAT Version

Let me be clearer, and tease apart the existence question from the algorithmic one. (I’ve just sketched the main ideas in the “proofs”, will try to fill in details later; let me know if you see any bugs.)

Theorem 4
If {\varphi} is a E{k}-SAT formula with {m} clauses, {n} variables, and where the degree of each clause is at most {d \le 2^{k-3}}, then {\varphi} is satisfiable.

Proof: Assume there is no satisfying assignment. Then the algorithm we saw in class will run forever, no matter what random bits it reads. Let us fix {M = m \log m + 1}. So for every string {R} of {n+Mk} bits the algorithm reads from the random source, it will run for {M} iterarations.

But now one can encode the string {R} thus: use {m \log m} bits to encode the clauses at the roots of the recursion trees, {M(\log d + 2)} to encode the clauses lying within these recursion trees, and {n} bits for the final settings of the variables. As we argued, this is a lossless encoding, we can recover the {n+Mk} bits from this encoding. How long is this encoding? It is {M(\log d + 2) + n + m \log m}, which is strictly less than {n+Mk} for {M = m \log m + 1} and {d \leq 2^{k-3}}.

So this would give us a way to encode every string of length {n+Mk} into strings of shorter lengths. But since for every length {\ell}, there are {2^\ell} strings of length {\ell} and {1 + 2 + \ldots + 2^{\ell - 1} = 2^{\ell} - 1} strings of length strictly less than {\ell}, this is impossible. So this contradicts our assumption that there is no satisfying assignment.\Box

Now we can alter the proof to show that the expected running time of the algorithm is small:

Theorem 5
If {\varphi} is a E{k}-SAT formula with {m} clauses, {n} variables, and where the degree of each clause is at most {d \le 2^{k-3}}, then the algorithm FindSat finds a satisfying assignment in {O(m \log m)} time.

Proof: Assume that we run for at least {M + t} steps with probability at least {1/2^s}. (Again, think of {M = m \log m}.) Then for at least {1/2^s} of the {2^{n+(M+t)k}} strings, we compress each of these strings into strings of length {(M+t)(\log d + 2) + n + m \log m}.

But if we have any set of {2^{n+(M+t)k} \cdot 2^{-s}} strings, we must use at least {n + (M+t)k-s} bits to represent at least one of them. So

\displaystyle  n + (M+t)k - s \leq n + (M+t)(\log d + 2) + M.

If {d \leq 2^{k-3}}, we have {k - \log d - 2 \geq 1}, and

\displaystyle  (M+t)(k - \log d - 2) - s \leq M

or

\displaystyle  M+t-s \leq M \implies s \geq t.

So we get that the probability of taking more than {M+t} steps is at most {1/2^t}, which implies an expected running time of {M + O(1)}. \Box

And Lecture #6 references

Some references from class yesterday: a short-and-sweet analysis for Johnson’s algorithm for MAX-SAT satisfying {(m + OPT)/3 \geq 2OPT/3} clauses is by Engebretsen; this simplifies the original proof by Chen, Friesen and Zhang. Recall that this analysis is tight if you take the clauses {(x_1 \lor x_2), (x_1 \lor x_3), (\lnot x_1)} and consider the variables in the order {x_1, x_2, x_3}: we could break ties badly and set {x_1 = 1} and get {2} whereas OPT is {3}.

While it would seem that we’ve done all we can with this algorithm, the latest SODA conference has two interesting papers on the subject:

Several questions remain: can we pin down better the performance of Johnson’s algorithm using a random order? It would be interesting to understand these papers, and perhaps to simplify the algorithm or the analysis in the latter? Can it be viewed as implicitly solving a linear program?

Update: And here are some lecture notes for MAX-3-SAT from the previous time Avrim taught the course.

Notes on Lecture 6

1. The Derandomization for Independent Sets

Here’s the derandomization of the probabilistic argument for Turan’s theorem. It’s just a little trickier than I’d figured, but it also leads to a different algorithm! (See Section 2.)

Let’s recall the randomized version: Given graph {G} and permutation {\pi}, call a vertex {v} a winner if it appears before all its neighbors in the order {\pi}. Let {W_v} be the indicator of {v} being a winner according to permutation {\pi}, and hence {E_{\pi}[W_v] = 1/(d_G(v) + 1)}. Now the number of winners {W = \sum_{v \in V} W_v}, and hence {E_{\pi}[W] = \sum_{v \in V} 1/(d_G(v) +1)}. The greedy algorithm which considers the vertices in the order {\pi} will at least pick all the winners. Hence,

\displaystyle  E_\pi[Greedy] \geq E_\pi[W] = \sum_v 1/(d_v +1).

The issue now is that we now don’t have an exact expression for the expected size of the independent set found by greedy that we can track, we only have a lower bound on it, namely the expected number of winners. But that’s OK: we’ll just track this lower bound instead, which we will call the “pessimistic estimator”. Let’s now pin this down.

First, given any set of vertices {S}, if we run the random permutation algorithm on the induced graph {G[S]}, we can compute the expected number of winners exactly — it’s just {\sum_{v \in S} 1/(d_{G[S]}(v) + 1)}, where {d_{G[S]}(v)} is the degree of {v} in graph {G[S]}.

So let’s do the following: having added some nodes {I_t \subseteq V} to our independent set, let {R_t = V \setminus (I_t \cup N(I_t))} be the “rest” of the nodes which are not already chosen in {I_t} nor already ruled out because they are adjacent to some node in {I_t}. Now define the “pessimistic estimator” to be

\displaystyle  \begin{array}{rcl}  \mu(t) &=& |I_t| + E[ \text{ number of winners on } G[R_t]\; ] \\ &=& |I_t| + \sum_{v \in R_t} \frac{1}{d_{G[R_t]}(v) + 1}. \end{array}

Note that we start off with {I_0} empty, then we get {\mu(0) = \sum_v 1/(d_G(v) + 1) \geq n/(d_{avg} + 1)}. And finally, {R_t = \emptyset}, then this is just {|I_t|}. So if we could ensure that at every step the value never decreases, we’d be done. One can do it in two ways (at least), here’s the approach some of you mentioned at the end of lecture.

1.1. Showing a Good Vertex to Add

Consider all the possible vertices in {R_t}: we’ll show one of them can be added to {I_t} to cause {\mu(t+1) \geq \mu(t)}. For vertex {x}, let {N^+_x = (\{x\} \cup N(x)) \cap R_t} be its neighbors (including itself) in {R_t}. If we add in {x} to the I.S., the new value would be

\displaystyle  \begin{array}{rcl}  \mu(I_t \cup \{x\}) &=& |I_t| + 1 + \sum_{v \in R_t \setminus N^+(x)} \frac{1}{d_{G[R_t \setminus N^+(x)]}(v) + 1} \\ &\geq& |I_t| + 1 + \sum_{v \in R_t} \frac{1}{d_{G[R_t]}(v) + 1} - \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}. \end{array}

The inequality is because the degrees in {G[R_t \setminus N^+(x)]} can never be larger than those in {G[R_t]}. Now rewriting,

\displaystyle   \mu(I_t \cup \{x\}) - \mu(t) \geq 1 - \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}. \ \ \ \ \ (1)

If we average over all {x \in R_t}, and we get

\displaystyle   E[\mu(I_t \cup \{x\})] - \mu(t) \geq 1 - \frac{1}{R_t} \sum_{x \in R_t} \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}, \ \ \ \ \ (2)

which is non-negative since for each {v} (of which there are at most {|R_t|}, there are {d_{G[R_t]}(v) + 1} many {x}‘s. And hence, there is at least one {x} to add so that {\mu(I_t \cup \{x\} \geq mu(I_t)}.

The part that originally threw me off in the calculations was that averaging over the {x}‘s could potentially lead to a greater potential {E[\mu(I_t \cup \{x\})]} than what we started off {\mu(I_t)}. But that is only better for us, so all’s well.

1.2. So To Summarize

The final idea is this: for the probabilistic analysis, we didn’t know how to compute the expectation we wanted (i.e., the expected number of vertices picked by random-permutation greedy). So we used a lower bound—the expected number of winners.

And then, that lower bound was useful for the derandomization. We show that we could add in one vertex at a time to the independent set {I_t} while never decreasing {|I_t|} plus the expected number of winners in the rest.

2. BTW, A Simpler Algorithm

If we go back and stare at inequality (1) for a few minutes, we realize that this analysis also gives an analysis for the following greedy algorithm:

While there is a vertex left in the graph, add the vertex with lowest remaining degree into {I}, and remove it and all its neighbors.

(Do you see why?) We’ve shown that this will give an independent set of size at least {n/(\overline{d} + 1)}.

3. Pessimistic Estimators for Low Congestion

The general method of pessimistic estimators is attributed to Prabhakar Raghavan, whose book is the text for the course, and who’s currently the head of Yahoo! Research. (He did this work while he was a graduate student in the mid-80s.)

His original paper used pessimistic estimators to derandomize, among other problems, the randomized-rounding algorithm for low-congestion routing example we saw in the previous lecture (due to Raghavan and Thompson). Aravind Srinivasan has some great notes on the derandomization; here’s a high-level summary.

Prabhakar’s use of these estimators is very neat, since looking at the proof it’s not obvious what to use as the pessimistic estimator, since there’s the Chernoff bound, which seems somewhat mysterious and awkward to handle. So he unrolls the first few steps of the Chernoff-bound calculation, which you recall goes as follows: if {X_e} is the congestion on an edge {e}, then

\displaystyle  \begin{array}{rcl}  \sum_e \Pr[ X_e > k C ] &=& \sum_e \Pr[ X_e - kC > 0 ] \\ &= &\sum_e \Pr[ k^{X_e - kC} > 1 ] \\ &\leq &\sum_e E[ k^{X_e - kC}] \\ & &\ldots \\ &\leq &\sum_e 1/m^c \leq 1/m^{c-1}.\end{array}

where the sum over edges represents the union bound, we use Markov’s inequality in the middle, all the m.g.f. calculations are omitted here, and {k > 0} is suitably chosen to ensure the bound of {1/m^c}.

He then notes that the expression {\sum_e E[ k^{X_e - kC}]} upper bounds the actual probability of error, and also can be tracked efficiently. In particular, fixing flow-paths for some of the source-sink pairs, we can calculate the expectation over random choices for the rest of the pairs, since these {X_e}‘s are sums of independent random variables. And we can make choices that causes this quantity to not increase. So if the estimator starts off below {1/m^{c-1} < 1}, we ensure it always stays below that quantity.

And since the final probabilities are integers {0} or {1} (since we’ve chose one flow-path for each source-sink pair, either an edge has more than {kC} congestion, or it does not), and they sum to less than {1}, they must all be zero. This gives the deterministic algorithm with congestion at most {kC}.

Again, Aravind’s notes, or the original paper, give you more details, and a more general properties that suffice for pessimistic estimators.

Belated Notes on Lecture #2

(Sorry for the delayed notes: it made sense to wait until we’d seen Chernoff bounds before I posted these. — AG)

We saw in Lecture #2 that the expected number of comparisons for the randomized version of quicksort was at most {2nH_n}; the proof was via defining just the right random variables {X_{ij}} (“do we compare {i} and {j} at any step”), we could compute that {E[X_{ij}] = 2/(j-i+1)}, and hence the expected number of comparisons is {E[\sum_{i < j} X_{ij}] \leq 2nH_n}.

However, often it helps to get a quick ballpark estimate before going for the best analysis: here’s a quick-and-dirty way to see that the number of comparisons for Quicksort is {c n \log n} with probability at least {1 - 1/n^{c'}} for some other constant {c'} that depends on {c}. Not only does this imply that the expected number of comparisons is {O(n \log n)} (why?), but also that the number of comparisons is very unlikely to be much larger than the expectation.

Here’s the theorem for today:

Lemma 1 With probability at least {1 - n^{-2}}, the number of levels of recursion in Quicksort is at most {28 \ln n}.

Proof:
First, let level {1} comparisons be those done when we partition the entire input into two parts, level {2} comparisons be those when we partition these two parts into two parts each, and so on. Note that the number of comparisons made in level {i} is at most {n}, so it suffices to show that the number of levels before all the parts have only a single element each is {O(\log n)} with high probability.

Let’s prove this claim. Consider a single element {e}. Let {S_i} be the set containing {e} at level {i}. Note that {S_1} is the entire input. We call level {i} to be {e}-good if either {|S_i| = 1}, or the level-{i} pivot falls in the middle half of {S_i} and hence causes {|S_{i+1}|} to be at most {3/4 |S_i|}. Since we pick a pivot uniformly at random, the chance of any level {i} being {e}-good is at least {1/2}, independent of all other levels. Hence, probability of seeing at least {t} {e}-good levels within the top {\tau} levels is at least the probability of seeing {t} heads in {\tau} flips of an unbiased coin. (I.e., the number of {e}-good levels stochastically dominates the Geom({\tau, 1/2}) random variable.)

Note that if there are {3.5 \ln n \geq \log_{4/3} n} {e}-good levels, then since a good level reduces the size of the set by at least {3/4}, the element {e} must lie in a set of unit size. Setting {\tau = 28 \ln n}, the chance of seeing fewer than {3.5 \ln n} good levels is at most the chance of seeing fewer than {3.5 \ln n} heads when we expect to see {14 \ln n} heads. By a multiplicative Chernoff bound on the lower tail, the chance of this is at most

\displaystyle  \exp(-\frac{(3/4)^2 \times 14 \ln n}{2}) \leq n^{-3}

Thus the branch of the recursion containing the element {e} terminates before {28 \ln n} levels except with probability {1/n^3}. There are {n} possible elements, and we can take a trivial union bound over these {n} elements: with probability at least {1 - 1/n^2}, the Quicksort recursion must have bottomed out before {28 \ln n} levels.
\Box

Note that this shows the total number of comparisons is bounded by {28 n \ln n} with probability {1 - 1/n^2}.

But what about the expected number of comparisons? One can bound this in several ways. Here’s one: with probability {1- 1/n^2} the “good event” happens and the number of comparisons is at most {28 n \ln n}. If the good event does not happen (w.p. {1/n^2}), we know that Quicksort takes {{n \choose 2}} comparisons in the worst case, and hence the expected number of comparisons is at most

\displaystyle  28 n \ln n + \frac{1}{n^2} {n \choose 2} \leq 28 n \ln n + 1.

This is much worse than the bound of {\mu = 2nH_n} we showed in class, but it has two advantages: (a) it was quick and dirty, as promised, and (b) it also shows a concentration result—Quicksort is unlikely to do many more comparisons than you’d expect.

In fact, one can prove much better bounds: a result of McDiarmid and Hayward from SODA 1992 shows that if {Q} is r.v. denoting the number of comparisons for Quicksort, then {Q} falls outside the interval {(1 \pm \epsilon) 2nH_n} with probability

\displaystyle  n^{-2\epsilon(\ln \ln n + O(\ln \ln \ln n))}.

That’s pretty well-concentrated! (One class project could be to understand this paper, and the concentration results they use.)

Notes on Lecture #5

A couple of things I remembered about Chernoff bounds when sitting in Avrim’s lecture, which thought I’d post here.

The first thing is a (slightly different) bound that you may find easier to remember. To begin, let us state the multiplicative bounds mentioned in the handout, altered ever so slightly:

Theorem 1

Let {X_1, X_2, \ldots, X_n} be independent {\{0,1\}} variables, and let {X = \sum_i X_i}. Let {\mu = E[X] = \sum_i E[X_i]}. Then for any {\beta \in [0,1]},

\displaystyle  \Pr[ X \leq (1 - \beta) \mu ] \leq \exp( -\frac{\beta^2}{2} \mu)  \ \ \ \ \ (1)


and

\displaystyle  \Pr[ X \geq (1 + \beta) \mu ] \leq \exp( -\frac{\beta^2}{3} \mu).  \ \ \ \ \ (2)


Moreover, for all {\beta \geq 0},

\displaystyle  \Pr[ X \geq (1 + \beta) \mu ] \leq (e^{\beta}/(1+\beta)^{1 + \beta})^\mu.  \ \ \ \ \ (3)


Now, here’s the version of the general multiplicative bounds I find easiest to remember:

Theorem 2
Let {X_1, X_2, \ldots, X_n} be independent variables taking values in the range {[0,1]}, and let {X = \sum_i X_i}. Let {\mu = E[X] = \sum_i E[X_i]}. Then for any {\beta \in [0,1]},

\displaystyle  \Pr[ X \leq (1 - \beta) \mu ] \leq \exp( -\frac{\beta^2}{2} \mu). \ \ \ \ \ (4)


Moreover, for any {\beta \geq 0}, we have

\displaystyle  \Pr[ X \geq (1 + \beta) \mu ] \leq \exp( -\frac{\beta^2}{2+\beta} \mu). \ \ \ \ \ (5)


Note that we’ve relaxed the assumption that {X_i}‘s are {\{0,1\}}, and are allowing ourselves to lie in {[0,1]}. (This is often super-useful, you may want to go over the proof of Theorem 1 in the book or the handout and think about how you may prove it.)

Moreover, (4) is exactly the same as (1). And if we constrain {\beta \in [0,1]}, the exponent on the right of (5) is at most {-\beta^2 \mu /3}, which is the same as in (2), so (5) is a better bound than (2) for {\beta}‘s in this range. So we’re doing at least as well as the first two tail bounds in Theorem 1.

Cool! So does this subsume (3) as well? Sadly, no. We lose something in the translation. Here’s an example. Let the {X_i}‘s be i.i.d., taking on value {1} with probability {1/n} and {0} otherwise, so that {\mu = 1}. Now, what is the value of {\beta} which gives us a tail probability of {1/n^{10}}? If we set {\beta = c\log n/\log \log n} for suitably large {c} then (3) gives us {1/n^{10}}, but for (5) to fall below {1/n^{10}} we need {\beta = c \log n} for some constant {c}.

However, often the bound from (5) is in the “right ballpark”, and I find it easy to remember. Chances are, if I use a Chernoff bound in lecture, it’ll be the one in Theorem 2.

Secondly, some jargon: these bounds are often called concentration inequalities (since they show that “most of the probability mass is concentrated around the mean”) and tail bounds (they show that the “tails of the distribution are small”) or large deviation bounds (since they show that “large deviations from the mean are unlikely”). There are tons of other tail bounds, some relaxing the requirements of independence, others relaxing the boundedness requirements, others looking at functions other than the sum of the random variables—you’ll see more in the homeworks and future lectures.

Another Algorithm for SAT

BTW, when solving HW#1 Problem #2, it may be interesting to try solve the problem for {k}-SAT. The algorithm remains the same, and you should be able to show a success probability of approximately {(\frac{k}{2(k-1)})^n}. As a sanity check, for {k=3} this is again {(3/4)^n}.

 

So here’s a different randomized algorithm to solve {k}-SAT in less than {2^n} time. This one is from a paper titled Satisfiability Coding Lemma due to Ramamohan Paturi, Pavel Pudlak and Francis Zane. Hence, we’ll call this the PPZ algorithm.

Pick a random permutation {\pi} on the variables. Pick a uniformly random “start” assignment {A_0: [n] \rightarrow \{0,1\}} for the variables. Now consider the variables in the order {\pi}, and do the following:

If the current variable {x_i} is the only remaining variable for some clause, set it to satisfy that clause. (If {x_i} occurs positively in some clause and negatively in some other clause, you can abort the process at this point.) If the current variable {x_i} only occurs in clauses with more than one variable, set it equal to {A_0(i)}.

Now, setting this variable to {0} or {1} will cause some clauses to be satisfied: remove these clauses. In other clauses, this variable will just disappear. E.g., if a clause was {(x_1 \lor \lnot x_7 \lor x_9)} and we set {x_1} to {0}, then the new clause will just be {(\lnot x_7 \lor x_9)}. Do all these simplifications, and then go on to the next variable.

That’s it. To recap, we just pick a random variable, if it is forced (due to some clause of length {1}), set it to the forced value, else set it randomly. Continue until we have a complete assignment.

A theorem due to PPZ says:

Theorem 1
Given a satisfying {k}-SAT formula, the probability that this algorithm finds a satisfying assignment is at least {2^{-n(1-1/k)}}.

Hence for {k = 3}, this gives an algorithm with expected running time approximately {2^{0.66n} \approx 1.58^n}. Not quite as good as Sch\”{o}ning’s algorithm, but not too bad! And the proof is also very clean. In this post we’ll just consider the case where there is a unique satisfying assignment {A^*:[n] \rightarrow \{0,1\}}. (The general case is not much harder, see Section~4.1 in the PPZ paper.)

So yeah, there’s a unique satisfying assignment {A^*}. This means that for every variable {x_i}, changing the value of {A^*(i)} to {\lnot A^*(i)} will give us an unsatisfying assignment, and some clause will complain because it is not satisfied any more; if there are several, pick any one. Call this a critical clause for variable {x_i}, and name it {C_i}. OK, {n} variables, each with its very own critical clause.

Why these critical clauses? Suppose in the clause {C_i}, we’ve set the {k-1} other variables to be the same as in {A^*}. Then the only way to satisfy {C_i} will be to also set {x_i} equal to {A^*(i)}. And, if it turns out that in the random order we’re considering the variables, if {x_i} came after all the other variables in {C_i}, this is precisely what the PPZ algorithm will do! Indeed, it will not have to make a random choice for {x_i}: our hand will be forced to choose {A^*(i)}.

So suppose for a random permutation, call a variable {x_i} lucky if among the {k} variables in their critical clause, {x_i} places last in the ordering {\pi}. And {L(\pi)} be the set of lucky variables for permutation {\pi}. Then, assuming that we lucked out and chose {A_0(j) == A^*(j)} for all the variables {x_j \in [n] \setminus L(\pi)}, the algorithm will certainly also set {x_j} to {A^*(j)}. So, for the permutation {\pi}, the chance (over the choice of {A_0}) that we find the satisfying assignment {A^*} is {2^{-(n-I(\pi))}}.

Finally, what is {E[2^{-(n-|L(\pi)|)}]}, the expectation taken over random permutations {\pi}? Rewriting, it is {E[2^{|L(\pi)|}]/2^n}. By convexity, the numerator is at least {2^{E[|L(\pi)|]}}. And what is the expectation? See, there are {n} variables. Each one has a {1/k} chance to be lucky and placed last in its critical clause, and so the expected size of {L(\pi)} is {n/k}. So our success probability is at least {2^{n/k - n}}, proving the theorem for the unique assignment case.

 

A couple of notes: Firstly, you can think of the PPZ algorithm as a randomized non-backtracking Davis-Putnam procedure. You’ll probably hear about DPLL algorithms (named after Martin Davis, Hilary Putnam, George Logemann, and Donald Loveland) sooner or later, if you haven’t already; so if you’re interested, check out these notes by Satish Rao.

Secondly, the PPZ algorithm can be improved. After their paper, Paturi, Pudlak and Zane teamed up with Mike Saks to give us the PPSZ paper (here, here). Their new algorithm first does some simple preprocessing (a few steps of resolution, since you ask), and then runs the PPZ algorithm. The new PPSZ analysis beats Sch\”{o}ning’s performance of {(\frac{2(k-1)}{k})^n} for {k = 4} and higher.

Moreover, as we mentioned in this post, the two algorithms work better in somewhat orthogonal cases: Sch\”{o}ning’s algorithm works better when there are many “well-spread” satisfying assignments, and PPSZ works better with “few” solutions. And the papers of Iwama and Tamaki, and Hertli et al., combine these two algorithms carefully to get the current world record algorithms. Can you beat this world record?

Lecture 1 notes and references

The reference to the best known algorithm for 3SAT we mentioned in class is this paper: Improving PPSZ for 3-SAT using Critical Variables, by Hertli, Moser, and Scheder.

It gives a algorithm with an expected runtime of 1.321^n, and the idea to cleverly combine Schoening’s algorithm (the one you saw today and will analyze in HW1) and a previous algorithm of Paturi, Pudlak, Saks and Zane (PPSZ).

At a super-high level, Schoening’s algorithm does better if there are lots of (well-separated) solutions that the randomized process can make a beeline for. The PPSZ algorithm, another very simple randomized algorithm that we may see in a later post, works better if there are few satisfying assignments. The idea in the paper of Hertli and others is how to trade off the two guarantees to do better than either one individually; this idea of combining Schoening and PPSZ goes back to a paper of Iwama and Tamaki (SODA ’04).