Randomized Algorithms, Carnegie Mellon: Spring 2011
on the webpage, or here. Good luck!
Update: Remember it’s due 48 hours after you start, or Friday May 6, 11:59pm, whichever comes first.
Fixes: 1(b): “feasible solution to the LP
plus the odd cycle inequalities.” And you need to only show the value is whp.
Thanks for the great presentations yesterday, everyone!
The final will be posted on the course webpage Friday 4/29 evening at the latest, I will post something on the blog once we’ve done so. You can take it in any contiguous 48 hour period of your choice — just download it when you are ready, and hand in your solutions within 48 hours of that. Slip it under my door (preferably), or email it to me otherwise. We’ll stop accepting solutions at 11:59pm on Friday 5/6.
The course FCEs are now online: please give us your feedback!!
Hey, it may be useful for today’s lecture if you have a quick read over Karger’s randomized algorithm for min-cuts. Just the basic algorithm and analysis—we won’t need the improved Karger-Stein variant.
HW#6 is out, it’s a short one. Due next Wednesday April 27th.
Today we talked about random walks on graphs, and the result that in any connected undirected graph G, for any given start vertex u, the expected time for a random walk to visit all the nodes of G (called the cover time of the graph) is at most 2m(n-1), where n is the number of vertices of G and m is the number of edges.
In the process, we proved that for any G, if we think of the walk as at any point in time being on some edge heading in some direction, then each edge/direction is equally likely at probability 1/(2m) at the stationary distribution. (Actually, since we didn’t need to, we didn’t prove it is unique. However, if G is connected, it is not hard to prove by contradiction that there is a unique stationary distribution). We then used that to prove that the expected gap between successive visits to any given (u,v) is 2m. See the notes.
We also gave a few examples to show this is existentially tight. For instance, on a line (n vertices, n-1 edges) we have an expected time to reach the other end of the line. Also on a “lollipop graph” (a clique of n/2 vertices connected to a line of n/2 vertices), the expected time to get from the clique to the end of the line is . Since this is not in the notes, here is a quick argument. First of all, if you are in the clique, then each step has 2/n probability of taking you to the node connecting the clique and the handle (let’s call this “node 0″). When you are at node 0, your next step has probability 2/n to take you to the next point on the handle (let’s call this “node 1″). So, when you are in the clique, it takes steps to get to node 1. Now, think about the following experiment. Say you go into a fair casino with 1 dollar and bet on fair games (betting a dollar each time) until either you lose all your money or you have n/2 dollars in your pocket. Whenever you lose all your money, you go back to your car to get another dollar, and if you get n/2 dollars, you go upstairs to the restaurant. What is the expected number of trips to your car before you go to the restaurant? The answer is n/2 because it’s a fair casino so in expectation, all the money in your pocket when you head to the restaurant came from your car. Now, think of node 0 as your car, node 1 as entering the casino, and the end of the lollipop stick (node n/2) as the restaurant.
We ended our discussion by talking about resistive networks, and using the connection to give another proof of the cover time of a graph. In particular, we have where is the commute-time between u and v, and is the effective resistance between u and v.
In this lecture we talked about Rademacher bounds in machine learning. These are never worse and sometimes can be quite a bit tighter than VC-dimension bounds. Rademacher bounds say that to bound how much you are overfitting (the gap between your error on the training set and your true error on the distribution), you can do the following. See how much you would overfit on random labels (how much better than 50% is the empirical error of the best function in your class when you give random labels to your dataset) and then double that quantity (and add a low-order term). See the notes.
HW5 has been posted; it’s due on Monday April 4th.
Update: for Exercise #2, the expression is indeed correct. (It is and not —you want to show that the solution found using the tree is lousy when used on the original metric.)
A simpler problem, if you’re stuck, is the furthest pair problem. Here, you are given a metric and want to output a pair of points whose distance is the largest. A natural (yet lousy) algorithm would be: embed the metric into a random tree while maintaining distances in expectation, find a furthest pair in the tree, and output this pair. Show an example where this algorithm sucks.
1. Some definitions
Recall that a martingale is a sequence of r.v.s (denoted by ) if each satisfies , and
Somewhat more generally, given a sequence of random variables, a martingale with respect to is another sequence of r.v.s (denoted by ) if each satisfies
- there exists functions such that , and
One can define things even more generally, but for the purposes of this course, let’s just proceed with this. If you’d like more details, check out, say, books by Grimmett and Stirzaker, or Durett, or many others.)
1.1. The Azuma-Hoeffding Inequality
(Apparently Bernstein had essentially figured this one out as well, in addition to the Chernoff-Hoeffding bounds, back in 1937.) The proof of this bound can be found in most texts, we’ll skip it here. BTW, if you just want the upper or lower tail, replace by on the right hand side.
2. The Doob Martingale
Most often, the case we will be concerned with is where our entire space is defined by a sequence of random variables , where each takes values in the set . Moreover, we will be interested in some bounded function , and will want to understand how behaves, when is drawn from the underlying distribution. (Very often these ‘s will be drawn from a “product distribution”—i.e., they will be independent of each other, but they need not be.) Specifically, we ask:
How concentrated is around its mean ?
To this end, define for every , the random variable
(At this point, it is useful to remember the definition of a random variable as a function from the sample space to the reals: so this r.v. is also such a function, obtained by taking averages of over parts of the sample space.)
How does the random variable behave? It’s just the constant : the expected value of the function given random settings for through . What about ? It is a function that depends only on its first variable, namely —instead of averaging over the entire sample space, we partition according to value of the first variable, and average over each part in the partition. And is a function of , averages over the other variables. And so on to , which is the same as the function . So, as we go from to , the random variables go from the constant function to the function .
Of course, we’re defining this for a reason: is a martingale with respect to .
Proof: The first two properties of being a martingale with respect to follow from being bounded, and the definition of itself. For the last property,
The first equaility is the definition of , the second from the fact that for random variables , and the last from the definition of .
Assuming that was bounded was not necessary, one can work with weaker assumptions—see the texts for more details.
Before we continue on this thread, let us show some Doob martingales which arise in CS/Math-y applications.
- Throw balls into bins, and let be some function of the load: the number of empty bins, the max load, the second-highly loaded bin, or some similar function. Let , and be the index of the bin into which ball lands. For , is a martingale with respect to .
- Consider the random graph : vertices, each of the edges chosen independently with probability . Let be the chromatic number of the graph, the minimum number of colors to properly color the graph. There are two natural Doob martingales associated with this, depending on how we choose the variables .
In the first one, let be the edge, and which gives us a martingle sequence of length . This is called the edge-exposure martingale. For the second one, let be the collection of edges going from the vertex to vertices : the new martingale has length and is called the vertex exposure martingale.
- Consider a run of quicksort on a particular input: let be the number of comparisons. Let be the first pivot, the second, etc. Then is a Doob martingale with respect to .
BTW, are these ‘s independent of each other? Naively, they might depend on the size of the current set, which makes it dependent on the past. One way you can make these independent is by letting these ‘s be, say, random independent permutations on all elements, and when you want to choose the pivot, pick the first element from the current set according to the permutation . (Or, you could let be a random independent real in and use that to pick a random element from the current set, etc.)
- Suppose we have red and blue balls in a bin. We draw balls without replacement from this bin: what is the number of red balls drawn? Let be the indicator for whether the ball is red, and let is the number of red balls. Then is a martingale with respect to .
However, in this example, the ‘s are not independent. Nonetheless, the sequence is a Doob martingale. (As in the quicksort example, one can define it with respect to a different set of variables which are independent of each other.)
So yeah, if we want to study the concentration of around , we can now apply Azuma-Hoeffding to the Doob martingale, which gives us the concentration of (i.e., ) around (i.e., ). Good, good.
Next step: to apply Azuma-Hoeffding to the Doob martingale , we need to bound for all . Which just says that if we can go from to in a “small” number of steps (), and each time we’re not smoothing out “too agressively” (), then is concentrated about its mean.
2.1. Indepedence and Lipschitz-ness
One case when it’s easy to bound the ‘s is when the ‘s are independent of each other, and also is not too sensitive in any coordinate—namely, changing any coordinate does not change the value of by much. Let’s see this in detail.
Proof: Let us use to denote the sequence , etc. Recall that
where the last equality is from independence. Similarly for . Hence
where the inequality is from the fact that changing the coordinate from to cannot change the function value by more than , and that .
Now applying Azuma-Hoeffding, we immediately get:
(Disclosure: I am cheating. McDiarmid’s inequality has better constants, the constant in the denominator moves to the numerator.) And armed with this inequality, we can give concentration results for some applications we mentioned above.
- For the balls and bins example, say is the number of empty bins: hence . Also, changing the location of the ball changes by at most . So is -Lipschitz, and hence
Hence, whp, .
- For the case where is the chromatic number of a random graph , and we define the edge-exposure martingale , clearly is -Lipschitz. Hence
This is not very interesting, since the right hand side is only when —but the chromatic number itself lies in , so we get almost no concentration at all.
Instead, we could use a vertex-exposure martingale, where at the step we expose the vertex and its edges going to vertices . Even with respect to these variables, the function is -Lipschitz, and hence
And hence the chromatic number of the random graph is concentrated to within around its mean.
3. Concentration for Random Geometric TSP
McDiarmid’s inequality is convenient to use, but Lipschitz-ness often does not get us as far as we’d like (even with independence). Sometimes you need to bound directly to get the full power of Azuma-Hoeffding. Here’s one example:
Let be points picked independently and uniformly at random from the unit square . Let be the length of the shortest traveling salesman tour on points. How closely is concentrated around its mean ?
In the HW, you will show that ; in fact, one can pin down up to the leading constant. (See the work of Rhee and others.)
3.1. Using McDiarmid: a weak first bound
Note that is -Lipschitz. By Corollary 5 we get that
If we want the deviation probability to be , we would have to set . Not so great, since this is pretty large compared to the expectation itself—we’d like a tighter bound.
3.2. So let’s be more careful: an improved bound
And in fact, we’ll get a better bound using the very same Doob martingale associated with :
But instead of just using the -Lipschitzness of , let us bound better.
Before we prove this lemma, let us complete the concentration bound for TSP using this. Setting gives us , and hence Azuma-Hoeffding gives:
3.3. Some useful lemmas
To prove Lemma 6, we’ll need a simple geometric lemma:
Proof: Define the random variable . Hence, exactly when . For , the area of is at least for some constant .
Define . For some , the chance that points all miss this ball, and hence is at most
Of course, for , . And hence
Secondly, here is another lemma about how the TSP behaves:
Proof: Follows from the fact that , for any .
3.4. Proving Lemma \ref
OK, now to the proof of Lemma 6. Recall that we want to bound ; since is -Lipschitz, we get immediately. For the second bound of , note that
where is a independent copy of the random variable . Hence
Then, if we define the set and , then we get
where the first inequality uses Lemma 8 and the second uses the fact that the minimum distance to a set only increses when the set gets smaller. But now we can invoke Lemma 7 to bound each of the terms by . This completes the proof of Lemma 6.
3.5. Some more about Geometric TSP
For constant dimension , one can consider the same problems in : the expected TSP length is now , and using similar arguments, you can show that devations of have probability .
The result we just proved was by Rhee and Talagrand, but it was not the last result about TSP concentration. Rhee and Talagrand subsequently improved this bound to the TSP has subgaussian tails!
We’ll show a proof of this using Talagrand’s inequality, in a later lecture.
If you’re interested in this line of research, here is a survey article by Michael Steele on concentration properties of optimization problems in Euclidean space, and another one by Alan Frieze and Joe Yukich on many aspects of probabilistic TSP.
As mentioned in a previous post, McDiarmid and Hayward use martingales to give extremely strong concentration results for QuickSort . The book by Dubhashi and Panconesi (preliminary version here) sketches this result, and also contains many other examples and extensions of the use of martingales.
Apart from giving us powerful concentration results, martingales and “stopping times” combine to give very surprising and powerful results: see this survey by Yuval Peres at SODA 2010, or these course notes by Yuval and Eyal Lubetzky.
In celebration of Les Valiant’s winning of the Turing award, today we discussed a classic result of his on the problem of oblivious routing on the hypercube. (Valiant, “A scheme for fast parallel communication”, SIAM J. Computing 1982, and Valiant and Brebner, “Universal schemes for parallel communication”, STOC 1981). The high-level idea is that rather than routing immediately to your final destination, go to a random intermediate point first. The analysis is really beautiful — you define the right quantities and it just magically works out nicely. See today’s class notes. We also discussed the Butterfly and Benes networks as well. (See the excellent notes of Satish Rao for more on them).
At the end, we also briefly discussed Martingales: their definition, Azuma’s inequality, and McDiarmid’s inequality (which doesn’t talk about Martingales directly but is very convenient and can be proven using Azuma). The discussion at the end was extremely rushed but the key point was: suppose you have a complicated random variable you care about like “the running time of my randomized algorithm” where the random variable is a function of a series of random choices Then the sequence where is a Martingale. E.g., the expected running time of quicksort given that the first pivots are is the expected value, over the possible choices of of the expected running time of quicksort given that the first pivots are .
1. An equivalent view of estimating
Again, you have a data stream of elements , each element drawn from the universe . This stream defines a frequency vector , where is the number of times element is seen. Consider the following algorithm to computing .
Take a (suitably random) hash function . Maintain counter , which starts off at zero. Every time an element comes in, increment the counter . And when queried, we reply with the value .
Hence, having seen the stream that results in the frequency vector , the counter will have the value . Does at least have the right expectation? It does:
And what about the variance? Recall that , so let us calculate
What does Chebyshev say then?
Not that hot: in fact, this is usually more than .
But if we take a collection of such independent counters , and given a query, take their average , and return . The expectation of the average remains the same, but the variance falls by a factor of . And we get
So, our probability of error on any query is at most if we take .
1.1. Hey, those calculations look familiar
Sure. This is just a restatement of what we did in lecture. There we took a matrix and filled with random values—hence each row of corresponds to a hash function from to . And taking rows in the matrix corresponds to the variance reduction step at the end.
1.2. Limited Independence
How much randomness do you need for the hash functions? Indeed, hash functions which are -wise independent suffice for the above proofs to go through. And how does one get a -wise independent hash function? Watch this blog (and the HWs) for details.