# CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

## And Lecture #6 references

Some references from class yesterday: a short-and-sweet analysis for Johnson’s algorithm for MAX-SAT satisfying ${(m + OPT)/3 \geq 2OPT/3}$ clauses is by Engebretsen; this simplifies the original proof by Chen, Friesen and Zhang. Recall that this analysis is tight if you take the clauses ${(x_1 \lor x_2), (x_1 \lor x_3), (\lnot x_1)}$ and consider the variables in the order ${x_1, x_2, x_3}$: we could break ties badly and set ${x_1 = 1}$ and get ${2}$ whereas OPT is ${3}$.

While it would seem that we’ve done all we can with this algorithm, the latest SODA conference has two interesting papers on the subject:

• The paper Randomized greedy: new variants of some classic approximation algorithms by Costello, Shapira, Tetali, which shows that considering the variables in a random order in Johnson’s algorithm achieves a performance of ${(2/3 + \epsilon)OPT}$ for some constant ${\epsilon > 0}$.
• Another paper Randomized Variants of Johnson’s Algorithm for MAX SAT by Poloczek and Schnitger shows that this random order idea cannot achieve ${0.75}$, sadly. However, they give a variant of Johnson’s algorithm that does get ${0.75 OPT}$.

Several questions remain: can we pin down better the performance of Johnson’s algorithm using a random order? It would be interesting to understand these papers, and perhaps to simplify the algorithm or the analysis in the latter? Can it be viewed as implicitly solving a linear program?

Update: And here are some lecture notes for MAX-3-SAT from the previous time Avrim taught the course.

## Notes on Lecture 6

1. The Derandomization for Independent Sets

Here’s the derandomization of the probabilistic argument for Turan’s theorem. It’s just a little trickier than I’d figured, but it also leads to a different algorithm! (See Section 2.)

Let’s recall the randomized version: Given graph ${G}$ and permutation ${\pi}$, call a vertex ${v}$ a winner if it appears before all its neighbors in the order ${\pi}$. Let ${W_v}$ be the indicator of ${v}$ being a winner according to permutation ${\pi}$, and hence ${E_{\pi}[W_v] = 1/(d_G(v) + 1)}$. Now the number of winners ${W = \sum_{v \in V} W_v}$, and hence ${E_{\pi}[W] = \sum_{v \in V} 1/(d_G(v) +1)}$. The greedy algorithm which considers the vertices in the order ${\pi}$ will at least pick all the winners. Hence,

$\displaystyle E_\pi[Greedy] \geq E_\pi[W] = \sum_v 1/(d_v +1).$

The issue now is that we now don’t have an exact expression for the expected size of the independent set found by greedy that we can track, we only have a lower bound on it, namely the expected number of winners. But that’s OK: we’ll just track this lower bound instead, which we will call the “pessimistic estimator”. Let’s now pin this down.

First, given any set of vertices ${S}$, if we run the random permutation algorithm on the induced graph ${G[S]}$, we can compute the expected number of winners exactly — it’s just ${\sum_{v \in S} 1/(d_{G[S]}(v) + 1)}$, where ${d_{G[S]}(v)}$ is the degree of ${v}$ in graph ${G[S]}$.

So let’s do the following: having added some nodes ${I_t \subseteq V}$ to our independent set, let ${R_t = V \setminus (I_t \cup N(I_t))}$ be the “rest” of the nodes which are not already chosen in ${I_t}$ nor already ruled out because they are adjacent to some node in ${I_t}$. Now define the “pessimistic estimator” to be

$\displaystyle \begin{array}{rcl} \mu(t) &=& |I_t| + E[ \text{ number of winners on } G[R_t]\; ] \\ &=& |I_t| + \sum_{v \in R_t} \frac{1}{d_{G[R_t]}(v) + 1}. \end{array}$

Note that we start off with ${I_0}$ empty, then we get ${\mu(0) = \sum_v 1/(d_G(v) + 1) \geq n/(d_{avg} + 1)}$. And finally, ${R_t = \emptyset}$, then this is just ${|I_t|}$. So if we could ensure that at every step the value never decreases, we’d be done. One can do it in two ways (at least), here’s the approach some of you mentioned at the end of lecture.

1.1. Showing a Good Vertex to Add

Consider all the possible vertices in ${R_t}$: we’ll show one of them can be added to ${I_t}$ to cause ${\mu(t+1) \geq \mu(t)}$. For vertex ${x}$, let ${N^+_x = (\{x\} \cup N(x)) \cap R_t}$ be its neighbors (including itself) in ${R_t}$. If we add in ${x}$ to the I.S., the new value would be

$\displaystyle \begin{array}{rcl} \mu(I_t \cup \{x\}) &=& |I_t| + 1 + \sum_{v \in R_t \setminus N^+(x)} \frac{1}{d_{G[R_t \setminus N^+(x)]}(v) + 1} \\ &\geq& |I_t| + 1 + \sum_{v \in R_t} \frac{1}{d_{G[R_t]}(v) + 1} - \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}. \end{array}$

The inequality is because the degrees in ${G[R_t \setminus N^+(x)]}$ can never be larger than those in ${G[R_t]}$. Now rewriting,

$\displaystyle \mu(I_t \cup \{x\}) - \mu(t) \geq 1 - \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}. \ \ \ \ \ (1)$

If we average over all ${x \in R_t}$, and we get

$\displaystyle E[\mu(I_t \cup \{x\})] - \mu(t) \geq 1 - \frac{1}{R_t} \sum_{x \in R_t} \sum_{v \in N^+(x)} \frac{1}{d_{G[R_t]}(v) + 1}, \ \ \ \ \ (2)$

which is non-negative since for each ${v}$ (of which there are at most ${|R_t|}$, there are ${d_{G[R_t]}(v) + 1}$ many ${x}$‘s. And hence, there is at least one ${x}$ to add so that ${\mu(I_t \cup \{x\} \geq mu(I_t)}$.

The part that originally threw me off in the calculations was that averaging over the ${x}$‘s could potentially lead to a greater potential ${E[\mu(I_t \cup \{x\})]}$ than what we started off ${\mu(I_t)}$. But that is only better for us, so all’s well.

1.2. So To Summarize

The final idea is this: for the probabilistic analysis, we didn’t know how to compute the expectation we wanted (i.e., the expected number of vertices picked by random-permutation greedy). So we used a lower bound—the expected number of winners.

And then, that lower bound was useful for the derandomization. We show that we could add in one vertex at a time to the independent set ${I_t}$ while never decreasing ${|I_t|}$ plus the expected number of winners in the rest.

2. BTW, A Simpler Algorithm

If we go back and stare at inequality (1) for a few minutes, we realize that this analysis also gives an analysis for the following greedy algorithm:

While there is a vertex left in the graph, add the vertex with lowest remaining degree into ${I}$, and remove it and all its neighbors.

(Do you see why?) We’ve shown that this will give an independent set of size at least ${n/(\overline{d} + 1)}$.

3. Pessimistic Estimators for Low Congestion

The general method of pessimistic estimators is attributed to Prabhakar Raghavan, whose book is the text for the course, and who’s currently the head of Yahoo! Research. (He did this work while he was a graduate student in the mid-80s.)

His original paper used pessimistic estimators to derandomize, among other problems, the randomized-rounding algorithm for low-congestion routing example we saw in the previous lecture (due to Raghavan and Thompson). Aravind Srinivasan has some great notes on the derandomization; here’s a high-level summary.

Prabhakar’s use of these estimators is very neat, since looking at the proof it’s not obvious what to use as the pessimistic estimator, since there’s the Chernoff bound, which seems somewhat mysterious and awkward to handle. So he unrolls the first few steps of the Chernoff-bound calculation, which you recall goes as follows: if ${X_e}$ is the congestion on an edge ${e}$, then

$\displaystyle \begin{array}{rcl} \sum_e \Pr[ X_e > k C ] &=& \sum_e \Pr[ X_e - kC > 0 ] \\ &= &\sum_e \Pr[ k^{X_e - kC} > 1 ] \\ &\leq &\sum_e E[ k^{X_e - kC}] \\ & &\ldots \\ &\leq &\sum_e 1/m^c \leq 1/m^{c-1}.\end{array}$

where the sum over edges represents the union bound, we use Markov’s inequality in the middle, all the m.g.f. calculations are omitted here, and ${k > 0}$ is suitably chosen to ensure the bound of ${1/m^c}$.

He then notes that the expression ${\sum_e E[ k^{X_e - kC}]}$ upper bounds the actual probability of error, and also can be tracked efficiently. In particular, fixing flow-paths for some of the source-sink pairs, we can calculate the expectation over random choices for the rest of the pairs, since these ${X_e}$‘s are sums of independent random variables. And we can make choices that causes this quantity to not increase. So if the estimator starts off below ${1/m^{c-1} < 1}$, we ensure it always stays below that quantity.

And since the final probabilities are integers ${0}$ or ${1}$ (since we’ve chose one flow-path for each source-sink pair, either an edge has more than ${kC}$ congestion, or it does not), and they sum to less than ${1}$, they must all be zero. This gives the deterministic algorithm with congestion at most ${kC}$.

Again, Aravind’s notes, or the original paper, give you more details, and a more general properties that suffice for pessimistic estimators.

## Homework #2 open thread

Ask yr HW2 questions here…

## Heads-up for today’s lecture

Today’s lecture will not have notes (like the great ones Avrim’s been handing out), sorry! So you might want to bring along some paper and pencils to take your own notes. 🙂

Also, if you have a minute, look over the randomized rounding algorithm/proof for low-congestion routing we saw in the previous lecture: we’ll try to talk about its derandomization today (among other things).

## Belated Notes on Lecture #2

(Sorry for the delayed notes: it made sense to wait until we’d seen Chernoff bounds before I posted these. — AG)

We saw in Lecture #2 that the expected number of comparisons for the randomized version of quicksort was at most ${2nH_n}$; the proof was via defining just the right random variables ${X_{ij}}$ (“do we compare ${i}$ and ${j}$ at any step”), we could compute that ${E[X_{ij}] = 2/(j-i+1)}$, and hence the expected number of comparisons is ${E[\sum_{i < j} X_{ij}] \leq 2nH_n}$.

However, often it helps to get a quick ballpark estimate before going for the best analysis: here’s a quick-and-dirty way to see that the number of comparisons for Quicksort is ${c n \log n}$ with probability at least ${1 - 1/n^{c'}}$ for some other constant ${c'}$ that depends on ${c}$. Not only does this imply that the expected number of comparisons is ${O(n \log n)}$ (why?), but also that the number of comparisons is very unlikely to be much larger than the expectation.

Here’s the theorem for today:

Lemma 1 With probability at least ${1 - n^{-2}}$, the number of levels of recursion in Quicksort is at most ${28 \ln n}$.

Proof:
First, let level ${1}$ comparisons be those done when we partition the entire input into two parts, level ${2}$ comparisons be those when we partition these two parts into two parts each, and so on. Note that the number of comparisons made in level ${i}$ is at most ${n}$, so it suffices to show that the number of levels before all the parts have only a single element each is ${O(\log n)}$ with high probability.

Let’s prove this claim. Consider a single element ${e}$. Let ${S_i}$ be the set containing ${e}$ at level ${i}$. Note that ${S_1}$ is the entire input. We call level ${i}$ to be ${e}$-good if either ${|S_i| = 1}$, or the level-${i}$ pivot falls in the middle half of ${S_i}$ and hence causes ${|S_{i+1}|}$ to be at most ${3/4 |S_i|}$. Since we pick a pivot uniformly at random, the chance of any level ${i}$ being ${e}$-good is at least ${1/2}$, independent of all other levels. Hence, probability of seeing at least ${t}$ ${e}$-good levels within the top ${\tau}$ levels is at least the probability of seeing ${t}$ heads in ${\tau}$ flips of an unbiased coin. (I.e., the number of ${e}$-good levels stochastically dominates the Geom(${\tau, 1/2}$) random variable.)

Note that if there are ${3.5 \ln n \geq \log_{4/3} n}$ ${e}$-good levels, then since a good level reduces the size of the set by at least ${3/4}$, the element ${e}$ must lie in a set of unit size. Setting ${\tau = 28 \ln n}$, the chance of seeing fewer than ${3.5 \ln n}$ good levels is at most the chance of seeing fewer than ${3.5 \ln n}$ heads when we expect to see ${14 \ln n}$ heads. By a multiplicative Chernoff bound on the lower tail, the chance of this is at most

$\displaystyle \exp(-\frac{(3/4)^2 \times 14 \ln n}{2}) \leq n^{-3}$

Thus the branch of the recursion containing the element ${e}$ terminates before ${28 \ln n}$ levels except with probability ${1/n^3}$. There are ${n}$ possible elements, and we can take a trivial union bound over these ${n}$ elements: with probability at least ${1 - 1/n^2}$, the Quicksort recursion must have bottomed out before ${28 \ln n}$ levels.
$\Box$

Note that this shows the total number of comparisons is bounded by ${28 n \ln n}$ with probability ${1 - 1/n^2}$.

But what about the expected number of comparisons? One can bound this in several ways. Here’s one: with probability ${1- 1/n^2}$ the “good event” happens and the number of comparisons is at most ${28 n \ln n}$. If the good event does not happen (w.p. ${1/n^2}$), we know that Quicksort takes ${{n \choose 2}}$ comparisons in the worst case, and hence the expected number of comparisons is at most

$\displaystyle 28 n \ln n + \frac{1}{n^2} {n \choose 2} \leq 28 n \ln n + 1.$

This is much worse than the bound of ${\mu = 2nH_n}$ we showed in class, but it has two advantages: (a) it was quick and dirty, as promised, and (b) it also shows a concentration result—Quicksort is unlikely to do many more comparisons than you’d expect.

In fact, one can prove much better bounds: a result of McDiarmid and Hayward from SODA 1992 shows that if ${Q}$ is r.v. denoting the number of comparisons for Quicksort, then ${Q}$ falls outside the interval ${(1 \pm \epsilon) 2nH_n}$ with probability

$\displaystyle n^{-2\epsilon(\ln \ln n + O(\ln \ln \ln n))}.$

That’s pretty well-concentrated! (One class project could be to understand this paper, and the concentration results they use.)

## Notes on Lecture #5

A couple of things I remembered about Chernoff bounds when sitting in Avrim’s lecture, which thought I’d post here.

The first thing is a (slightly different) bound that you may find easier to remember. To begin, let us state the multiplicative bounds mentioned in the handout, altered ever so slightly:

Theorem 1

Let ${X_1, X_2, \ldots, X_n}$ be independent ${\{0,1\}}$ variables, and let ${X = \sum_i X_i}$. Let ${\mu = E[X] = \sum_i E[X_i]}$. Then for any ${\beta \in [0,1]}$,

$\displaystyle \Pr[ X \leq (1 - \beta) \mu ] \leq \exp( -\frac{\beta^2}{2} \mu) \ \ \ \ \ (1)$

and

$\displaystyle \Pr[ X \geq (1 + \beta) \mu ] \leq \exp( -\frac{\beta^2}{3} \mu). \ \ \ \ \ (2)$

Moreover, for all ${\beta \geq 0}$,

$\displaystyle \Pr[ X \geq (1 + \beta) \mu ] \leq (e^{\beta}/(1+\beta)^{1 + \beta})^\mu. \ \ \ \ \ (3)$

Now, here’s the version of the general multiplicative bounds I find easiest to remember:

Theorem 2
Let ${X_1, X_2, \ldots, X_n}$ be independent variables taking values in the range ${[0,1]}$, and let ${X = \sum_i X_i}$. Let ${\mu = E[X] = \sum_i E[X_i]}$. Then for any ${\beta \in [0,1]}$,

$\displaystyle \Pr[ X \leq (1 - \beta) \mu ] \leq \exp( -\frac{\beta^2}{2} \mu). \ \ \ \ \ (4)$

Moreover, for any ${\beta \geq 0}$, we have

$\displaystyle \Pr[ X \geq (1 + \beta) \mu ] \leq \exp( -\frac{\beta^2}{2+\beta} \mu). \ \ \ \ \ (5)$

Note that we’ve relaxed the assumption that ${X_i}$‘s are ${\{0,1\}}$, and are allowing ourselves to lie in ${[0,1]}$. (This is often super-useful, you may want to go over the proof of Theorem 1 in the book or the handout and think about how you may prove it.)

Moreover, (4) is exactly the same as (1). And if we constrain ${\beta \in [0,1]}$, the exponent on the right of (5) is at most ${-\beta^2 \mu /3}$, which is the same as in (2), so (5) is a better bound than (2) for ${\beta}$‘s in this range. So we’re doing at least as well as the first two tail bounds in Theorem 1.

Cool! So does this subsume (3) as well? Sadly, no. We lose something in the translation. Here’s an example. Let the ${X_i}$‘s be i.i.d., taking on value ${1}$ with probability ${1/n}$ and ${0}$ otherwise, so that ${\mu = 1}$. Now, what is the value of ${\beta}$ which gives us a tail probability of ${1/n^{10}}$? If we set ${\beta = c\log n/\log \log n}$ for suitably large ${c}$ then (3) gives us ${1/n^{10}}$, but for (5) to fall below ${1/n^{10}}$ we need ${\beta = c \log n}$ for some constant ${c}$.

However, often the bound from (5) is in the “right ballpark”, and I find it easy to remember. Chances are, if I use a Chernoff bound in lecture, it’ll be the one in Theorem 2.

Secondly, some jargon: these bounds are often called concentration inequalities (since they show that “most of the probability mass is concentrated around the mean”) and tail bounds (they show that the “tails of the distribution are small”) or large deviation bounds (since they show that “large deviations from the mean are unlikely”). There are tons of other tail bounds, some relaxing the requirements of independence, others relaxing the boundedness requirements, others looking at functions other than the sum of the random variables—you’ll see more in the homeworks and future lectures.

## Homework #1 Open Thread

Jeremiah just pointed out that Problem 2(a) analyzing the 3SAT algorithm was missing a couple terms from the summation. The online version has been fixed.

## Another Algorithm for SAT

BTW, when solving HW#1 Problem #2, it may be interesting to try solve the problem for ${k}$-SAT. The algorithm remains the same, and you should be able to show a success probability of approximately ${(\frac{k}{2(k-1)})^n}$. As a sanity check, for ${k=3}$ this is again ${(3/4)^n}$.

So here’s a different randomized algorithm to solve ${k}$-SAT in less than ${2^n}$ time. This one is from a paper titled Satisfiability Coding Lemma due to Ramamohan Paturi, Pavel Pudlak and Francis Zane. Hence, we’ll call this the PPZ algorithm.

Pick a random permutation ${\pi}$ on the variables. Pick a uniformly random “start” assignment ${A_0: [n] \rightarrow \{0,1\}}$ for the variables. Now consider the variables in the order ${\pi}$, and do the following:

If the current variable ${x_i}$ is the only remaining variable for some clause, set it to satisfy that clause. (If ${x_i}$ occurs positively in some clause and negatively in some other clause, you can abort the process at this point.) If the current variable ${x_i}$ only occurs in clauses with more than one variable, set it equal to ${A_0(i)}$.

Now, setting this variable to ${0}$ or ${1}$ will cause some clauses to be satisfied: remove these clauses. In other clauses, this variable will just disappear. E.g., if a clause was ${(x_1 \lor \lnot x_7 \lor x_9)}$ and we set ${x_1}$ to ${0}$, then the new clause will just be ${(\lnot x_7 \lor x_9)}$. Do all these simplifications, and then go on to the next variable.

That’s it. To recap, we just pick a random variable, if it is forced (due to some clause of length ${1}$), set it to the forced value, else set it randomly. Continue until we have a complete assignment.

A theorem due to PPZ says:

Theorem 1
Given a satisfying ${k}$-SAT formula, the probability that this algorithm finds a satisfying assignment is at least ${2^{-n(1-1/k)}}$.

Hence for ${k = 3}$, this gives an algorithm with expected running time approximately ${2^{0.66n} \approx 1.58^n}$. Not quite as good as Sch\”{o}ning’s algorithm, but not too bad! And the proof is also very clean. In this post we’ll just consider the case where there is a unique satisfying assignment ${A^*:[n] \rightarrow \{0,1\}}$. (The general case is not much harder, see Section~4.1 in the PPZ paper.)

So yeah, there’s a unique satisfying assignment ${A^*}$. This means that for every variable ${x_i}$, changing the value of ${A^*(i)}$ to ${\lnot A^*(i)}$ will give us an unsatisfying assignment, and some clause will complain because it is not satisfied any more; if there are several, pick any one. Call this a critical clause for variable ${x_i}$, and name it ${C_i}$. OK, ${n}$ variables, each with its very own critical clause.

Why these critical clauses? Suppose in the clause ${C_i}$, we’ve set the ${k-1}$ other variables to be the same as in ${A^*}$. Then the only way to satisfy ${C_i}$ will be to also set ${x_i}$ equal to ${A^*(i)}$. And, if it turns out that in the random order we’re considering the variables, if ${x_i}$ came after all the other variables in ${C_i}$, this is precisely what the PPZ algorithm will do! Indeed, it will not have to make a random choice for ${x_i}$: our hand will be forced to choose ${A^*(i)}$.

So suppose for a random permutation, call a variable ${x_i}$ lucky if among the ${k}$ variables in their critical clause, ${x_i}$ places last in the ordering ${\pi}$. And ${L(\pi)}$ be the set of lucky variables for permutation ${\pi}$. Then, assuming that we lucked out and chose ${A_0(j) == A^*(j)}$ for all the variables ${x_j \in [n] \setminus L(\pi)}$, the algorithm will certainly also set ${x_j}$ to ${A^*(j)}$. So, for the permutation ${\pi}$, the chance (over the choice of ${A_0}$) that we find the satisfying assignment ${A^*}$ is ${2^{-(n-I(\pi))}}$.

Finally, what is ${E[2^{-(n-|L(\pi)|)}]}$, the expectation taken over random permutations ${\pi}$? Rewriting, it is ${E[2^{|L(\pi)|}]/2^n}$. By convexity, the numerator is at least ${2^{E[|L(\pi)|]}}$. And what is the expectation? See, there are ${n}$ variables. Each one has a ${1/k}$ chance to be lucky and placed last in its critical clause, and so the expected size of ${L(\pi)}$ is ${n/k}$. So our success probability is at least ${2^{n/k - n}}$, proving the theorem for the unique assignment case.

A couple of notes: Firstly, you can think of the PPZ algorithm as a randomized non-backtracking Davis-Putnam procedure. You’ll probably hear about DPLL algorithms (named after Martin Davis, Hilary Putnam, George Logemann, and Donald Loveland) sooner or later, if you haven’t already; so if you’re interested, check out these notes by Satish Rao.

Secondly, the PPZ algorithm can be improved. After their paper, Paturi, Pudlak and Zane teamed up with Mike Saks to give us the PPSZ paper (here, here). Their new algorithm first does some simple preprocessing (a few steps of resolution, since you ask), and then runs the PPZ algorithm. The new PPSZ analysis beats Sch\”{o}ning’s performance of ${(\frac{2(k-1)}{k})^n}$ for ${k = 4}$ and higher.

Moreover, as we mentioned in this post, the two algorithms work better in somewhat orthogonal cases: Sch\”{o}ning’s algorithm works better when there are many “well-spread” satisfying assignments, and PPSZ works better with “few” solutions. And the papers of Iwama and Tamaki, and Hertli et al., combine these two algorithms carefully to get the current world record algorithms. Can you beat this world record?

## Lecture 1 notes and references

The reference to the best known algorithm for 3SAT we mentioned in class is this paper: Improving PPSZ for 3-SAT using Critical Variables, by Hertli, Moser, and Scheder.

It gives a algorithm with an expected runtime of 1.321^n, and the idea to cleverly combine Schoening’s algorithm (the one you saw today and will analyze in HW1) and a previous algorithm of Paturi, Pudlak, Saks and Zane (PPSZ).

At a super-high level, Schoening’s algorithm does better if there are lots of (well-separated) solutions that the randomized process can make a beeline for. The PPSZ algorithm, another very simple randomized algorithm that we may see in a later post, works better if there are few satisfying assignments. The idea in the paper of Hertli and others is how to trade off the two guarantees to do better than either one individually; this idea of combining Schoening and PPSZ goes back to a paper of Iwama and Tamaki (SODA ’04).