CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

Final exam

on the webpage, or here. Good luck!

Update: Remember it’s due 48 hours after you start, or Friday May 6, 11:59pm, whichever comes first.

Fixes: 1(b): “feasible solution to the LP plus the odd cycle inequalities.” And you need to only show the value is m(1 - O(\varepsilon)) whp.

FCEs, Final, etc.

Thanks for the great presentations yesterday, everyone!

The final will be posted on the course webpage Friday 4/29 evening at the latest, I will post something on the blog once we’ve done so. You can take it in any contiguous 48 hour period of your choice — just download it when you are ready, and hand in your solutions within 48 hours of that. Slip it under my door (preferably), or email it to me otherwise. We’ll stop accepting solutions at 11:59pm on Friday 5/6.

The course FCEs are now online: please give us your feedback!!

Karger’s min-cut algorithm

Hey, it may be useful for today’s lecture if you have a quick read over Karger’s randomized algorithm for min-cuts. Just the basic algorithm and analysis—we won’t need the improved Karger-Stein variant.

HW #6 Open Thread

HW#6 is out, it’s a short one. Due next Wednesday April 27th.

Lecture 21: Random walks on graphs

Today we talked about random walks on graphs, and the result that in any connected undirected graph G, for any given start vertex u, the expected time for a random walk to visit all the nodes of G (called the cover time of the graph) is at most 2m(n-1), where n is the number of vertices of G and m is the number of edges.

In the process, we proved that for any G, if we think of the walk as at any point in time being on some edge heading in some direction, then each edge/direction is equally likely at probability 1/(2m) at the stationary distribution. (Actually, since we didn’t need to, we didn’t prove it is unique. However, if G is connected, it is not hard to prove by contradiction that there is a unique stationary distribution). We then used that to prove that the expected gap between successive visits to any given (u,v) is 2m. See the notes.

We also gave a few examples to show this is existentially tight. For instance, on a line (n vertices, n-1 edges) we have an expected \Omega(n^2) time to reach the other end of the line. Also on a “lollipop graph” (a clique of n/2 vertices connected to a line of n/2 vertices), the expected time to get from the clique to the end of the line is \Omega(n^3). Since this is not in the notes, here is a quick argument. First of all, if you are in the clique, then each step has 2/n probability of taking you to the node connecting the clique and the handle (let’s call this “node 0”). When you are at node 0, your next step has probability 2/n to take you to the next point on the handle (let’s call this “node 1”). So, when you are in the clique, it takes \Omega(n^2) steps to get to node 1. Now, think about the following experiment. Say you go into a fair casino with 1 dollar and bet on fair games (betting a dollar each time) until either you lose all your money or you have n/2 dollars in your pocket. Whenever you lose all your money, you go back to your car to get another dollar, and if you get n/2 dollars, you go upstairs to the restaurant. What is the expected number of trips to your car before you go to the restaurant? The answer is n/2 because it’s a fair casino so in expectation, all the money in your pocket when you head to the restaurant came from your car. Now, think of node 0 as your car, node 1 as entering the casino, and the end of the lollipop stick (node n/2) as the restaurant.

We ended our discussion by talking about resistive networks, and using the connection to give another proof of the cover time of a graph. In particular, we have C_{uv} = 2m R_{uv} where C_{uv} is the commute-time between u and v, and R_{uv} is the effective resistance between u and v.

Lecture #20: Rademacher bounds

In this lecture we talked about Rademacher bounds in machine learning. These are never worse and sometimes can be quite a bit tighter than VC-dimension bounds. Rademacher bounds say that to bound how much you are overfitting (the gap between your error on the training set and your true error on the distribution), you can do the following. See how much you would overfit on random labels (how much better than 50% is the empirical error of the best function in your class when you give random labels to your dataset) and then double that quantity (and add a low-order term). See the notes.

HW #5 Open Thread

HW5 has been posted; it’s due on Monday April 4th.

Update: for Exercise #2, the expression E_{T \gets \mathcal{D}} [\max_{v \in V} d(v, F_T)] is indeed correct. (It is d and not d_T —you want to show that the solution F_T found using the tree is lousy when used on the original metric.)

A simpler problem, if you’re stuck, is the furthest pair problem. Here, you are given a metric and want to output a pair of points whose distance is the largest. A natural (yet lousy) algorithm would be: embed the metric into a random tree while maintaining distances in expectation, find a furthest pair in the tree, and output this pair. Show an example where this algorithm sucks.

Lecture #19: Martingales

1. Some definitions

Recall that a martingale is a sequence of r.v.s {Z_0, Z_1, Z_2, \ldots} (denoted by {(Z_i)}) if each {Z_i} satisfies {E[|Z_i|] < \infty}, and

\displaystyle  E[Z_i \mid Z_0,...,Z_{i-1}] = Z_{i-1}.

Somewhat more generally, given a sequence {(X_i)} of random variables, a martingale with respect to {(X_i)} is another sequence of r.v.s {Z_0, Z_1, Z_2, \ldots} (denoted by {(Z_i)}) if each {Z_i} satisfies

  • {E[|Z_i|] < \infty},
  • there exists functions {g_i} such that {Z_i = g_i(X_1, X_2, \ldots, X_i)}, and
  • {E[Z_i \mid X_1, \ldots ,X_{i-1}] = Z_{i-1}}.

One can define things even more generally, but for the purposes of this course, let’s just proceed with this. If you’d like more details, check out, say, books by Grimmett and Stirzaker, or Durett, or many others.)

1.1. The Azuma-Hoeffding Inequality

Theorem 1 (Azuma-Hoeffding) If {(Z_i)} is a martingale such that for each {i}, {|Z_i - Z_{i-1}| < c_i}. Then

\displaystyle  \Pr[|Z_n - Z_0| \geq \lambda] \leq 2\exp\left\{-\frac{\lambda^2}{2 \sum_i c_i^2} \right\}.

(Apparently Bernstein had essentially figured this one out as well, in addition to the Chernoff-Hoeffding bounds, back in 1937.) The proof of this bound can be found in most texts, we’ll skip it here. BTW, if you just want the upper or lower tail, replace {2e^{blah}} by {e^{blah}} on the right hand side.

2. The Doob Martingale

Most often, the case we will be concerned with is where our entire space is defined by a sequence of random variables {X_1, X_2, \ldots, X_n}, where each {X_i} takes values in the set {\Omega}. Moreover, we will be interested in some bounded function {f: \Omega^n \rightarrow {\mathbb R}}, and will want to understand how {f(X_1, X_2, \ldots, X_n)} behaves, when {(X_i)} is drawn from the underlying distribution. (Very often these {X_i}‘s will be drawn from a “product distribution”—i.e., they will be independent of each other, but they need not be.) Specifically, we ask:

How concentrated is {f} around its mean {E[f] := E_{X_1, X_2, \ldots, X_n}[f(X_1, X_2, \ldots, X_n)]}?

To this end, define for every {i \in \{0,1, \ldots, n\}}, the random variable

\displaystyle  Z_i := E[ f(X) \mid X_1, X_2, \ldots,X_i ].

(At this point, it is useful to remember the definition of a random variable as a function from the sample space to the reals: so this r.v. {Z_i} is also such a function, obtained by taking averages of {f} over parts of the sample space.)

How does the random variable {Z_0} behave? It’s just the constant {E[f]}: the expected value of the function {f} given random settings for {X_1} through {X_n}. What about {Z_1}? It is a function that depends only on its first variable, namely {Z_1(x_1) = E_{X_2, \ldots, X_n}[f(x_1, X_2, \ldots, X_n)]}—instead of averaging {f} over the entire sample space, we partition {\Omega} according to value of the first variable, and average over each part in the partition. And {Z_2} is a function of {x_1, x_2}, averages over the other variables. And so on to {Z_n}, which is the same as the function {f}. So, as we go from {0} to {n}, the random variables {Z_i} go from the constant function {E[f]} to the function {f}.

Of course, we’re defining this for a reason: {(Z_i)} is a martingale with respect to {(X_i)}.

Lemma 2 For a bounded function {f}, the sequence {(Z_i)_{i = 0}^n} is a martingale with respect to {(X_i)}. (It’s called the Doob martingale for {f}.)

Proof: The first two properties of {(Z_i)} being a martingale with respect to {(X_i)} follow from {f} being bounded, and the definition of {Z_i} itself. For the last property,

\displaystyle  \begin{array}{rl}  E[Z_i \mid X_1, \ldots X_{i-1}] &= E[\, E[f \mid X_1, X_2, \ldots, X_i] \mid X_1, \ldots X_{i-1} ] \\ &= E[ f \mid X_1, \ldots X_{i-1} ] = Z_{i-1}. \end{array}

The first equaility is the definition of {Z_i}, the second from the fact that {E[ U \mid V ] = E[ E[ U \mid V, W ] \mid V ]} for random variables {U, V, W}, and the last from the definition of {Z_{i-1}}. \Box

Assuming that {f} was bounded was not necessary, one can work with weaker assumptions—see the texts for more details.

Before we continue on this thread, let us show some Doob martingales which arise in CS/Math-y applications.

  1. Throw {m} balls into {n} bins, and let {f} be some function of the load: the number of empty bins, the max load, the second-highly loaded bin, or some similar function. Let {\Omega = [n]}, and {X_i} be the index of the bin into which ball {i} lands. For {Z_i = E[ f \mid X_1, \ldots, X_i ]}, {(Z_i)} is a martingale with respect to {(X_i)}.
  2. Consider the random graph {G_{n,p}}: {n} vertices, each of the {\binom{n}{2}} edges chosen independently with probability {p}. Let {\chi} be the chromatic number of the graph, the minimum number of colors to properly color the graph. There are two natural Doob martingales associated with this, depending on how we choose the variables {X_i}.

    In the first one, let {X_i} be the {i^{th}} edge, and which gives us a martingle sequence of length {\binom{n}{2}}. This is called the edge-exposure martingale. For the second one, let {X_i} be the collection of edges going from the vertex {i} to vertices {1, 2, \ldots, i-1}: the new martingale has length {n} and is called the vertex exposure martingale.

  3. Consider a run of quicksort on a particular input: let {Q} be the number of comparisons. Let {X_1} be the first pivot, {X_2} the second, etc. Then {Z_i = E[ Q \mid X_1, \ldots, X_i ]} is a Doob martingale with respect to {(X_i)}.

    BTW, are these {X_i}‘s independent of each other? Naively, they might depend on the size of the current set, which makes it dependent on the past. One way you can make these independent is by letting these {X_i}‘s be, say, random independent permutations on all {n} elements, and when you want to choose the {i^{th}} pivot, pick the first element from the current set according to the permutation {X_i}. (Or, you could let {X_i} be a random independent real in {[0,1]} and use that to pick a random element from the current set, etc.)

  4. Suppose we have {r} red and {b} blue balls in a bin. We draw {n} balls without replacement from this bin: what is the number of red balls drawn? Let {X_i} be the indicator for whether the {i^{th}} ball is red, and let {f = X_1 + X_2 + \ldots + X_n} is the number of red balls. Then {Z_i = E[ f \mid X_1, \ldots, X_i ]} is a martingale with respect to {(X_i)}.

    However, in this example, the {X_i}‘s are not independent. Nonetheless, the sequence is a Doob martingale. (As in the quicksort example, one can define it with respect to a different set of variables which are independent of each other.)

So yeah, if we want to study the concentration of {f} around {E[f]}, we can now apply Azuma-Hoeffding to the Doob martingale, which gives us the concentration of {Z_n} (i.e., {f}) around {Z_0} (i.e., {E[f]}). Good, good.

Next step: to apply Azuma-Hoeffding to the Doob martingale {(Z_i)}, we need to bound {|Z_i - Z_{i-1}|} for all {i}. Which just says that if we can go from {f} to {Ef} in a “small” number of steps ({n}), and each time we’re not smoothing out “too agressively” ({|Z_i - Z_{i-1}| \leq c_i}), then {f} is concentrated about its mean.

2.1. Indepedence and Lipschitz-ness

One case when it’s easy to bound the {|Z_i - Z_{i-1}|}‘s is when the {X_i}‘s are independent of each other, and also {f} is not too sensitive in any coordinate—namely, changing any coordinate does not change the value of {f} by much. Let’s see this in detail.

Definition 3 Given values {(c_i)_{i =1}^n}, the function {f} is \underline{{(c_i)}-Lipschitz} if for all {j} and {x_j \in \Omega}, for all {i \in [n]} and for all {x_i' \in \Omega}, it holds that

\displaystyle  |f(x_1, \ldots, x_{i-1}, x_i, x_{i+1}, \ldots, x_n ) - f(x_1, \ldots, x_{i-1}, x_i', x_{i+1}, \ldots, x_n ) | \leq c_i.

If {c_i = c} for all {i}, then we just say {f} is {c}-Lipschitz.

Lemma 4 If {f} is {(c_i)}-Lipschitz and {X_1, X_2, \ldots, X_n} are independent, then the Doob martingale of {f} with respect to {(X_i)} satisfies

\displaystyle  |Z_i - Z_{i-1}| \leq c_i.

Proof: Let us use {X_{(i:j)}} to denote the sequence {X_i, \ldots, X_j}, etc. Recall that

\displaystyle  \begin{array}{rl}  Z_i &= E[ f \mid X_{(1:i)} ] = \sum_{a_{i+1}, \ldots, a_n} f(X_{(1:i)}, a_{(i+1:n)}) \Pr[ X_{(i+1:n)} = a_{(i+1:n)} \mid X_{(1:i)} ] \\ &= \sum_{a_{i+1}, \ldots, a_n} f(X_{(1:i)}, a_{(i+1:n)}) \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] \end{array}

where the last equality is from independence. Similarly for {Z_{i-1}}. Hence

\displaystyle  \begin{array}{rl}  | Z_i - Z_{i-1} | &= \sum_{a_{i+1}, \ldots, a_n} \bigg| f(X_{(1:i)}, a_{(i+1:n)}) - \sum_{a_i'} \Pr[X_i = a_i'] f(X_{(1:i-1)}, a_i', a_{(i+1:n)}) \bigg| \cdot \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] \\ &\le \sum_{a_{i+1}, \ldots, a_n} c_i \cdot \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] = c_i. \end{array}

where the inequality is from the fact that changing the {i^{th}} coordinate from {a_i} to {a_i'} cannot change the function value by more than {c_i}, and that {\sum_{a_i'} \Pr[X_i = a_i'] = 1}. \Box

Now applying Azuma-Hoeffding, we immediately get:

Corollary 5 (McDiarmid’s Inequality) If {f_i} is {c_i}-Lipschitz for each {i}, and {X_1, X_2, \ldots, X_n} are independent, then

\displaystyle  \begin{array}{rl}  \Pr[ f - E[f] \geq \lambda ] &\leq \exp\left( - \frac{ \lambda^2 }{2 \sum_i c_i^2 } \right), \\ \Pr[ f - E[f] < \lambda ] &\leq \exp\left( - \frac{ \lambda^2 }{2 \sum_i c_i^2 } \right). \end{array}

(Disclosure: I am cheating. McDiarmid’s inequality has better constants, the constant {2} in the denominator moves to the numerator.) And armed with this inequality, we can give concentration results for some applications we mentioned above.

  1. For the {m} balls and {n} bins example, say {f} is the number of empty bins: hence {Ef = n(1-1/n)^m \approx n\,e^{-m/n}}. Also, changing the location of the {i^{th}} ball changes {f} by at most {1}. So {f} is {1}-Lipschitz, and hence

    \displaystyle  \Pr [ | f - Ef | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2m} \right).

    Hence, whp, {f \approx n\,e^{-m/n} \pm O(\sqrt{m \log n})}.

  2. For the case where {\chi} is the chromatic number of a random graph {G_{n,p}}, and we define the edge-exposure martingale {Z_i = E[ \chi \mid E_1, E_2, \ldots, E_i ]}, clearly {\chi} is {1}-Lipschitz. Hence

    \displaystyle  \Pr [ | \chi - E\chi | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2\binom{n}{2}} \right)

    This is not very interesting, since the right hand side is {< 1} only when {\lambda \approx n}—but the chromatic number itself lies in {[1,n]}, so we get almost no concentration at all.

    Instead, we could use a vertex-exposure martingale, where at the {i^{th}} step we expose the vertex {i} and its edges going to vertices {1, 2, \ldots, i-1}. Even with respect to these variables, the function {\chi} is {1}-Lipschitz, and hence

    \displaystyle  \Pr [ | \chi - E\chi | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2n} \right)

    And hence the chromatic number of the random graph {G_{n,p}} is concentrated to within {\approx \sqrt{n}} around its mean.

3. Concentration for Random Geometric TSP

McDiarmid’s inequality is convenient to use, but Lipschitz-ness often does not get us as far as we’d like (even with independence). Sometimes you need to bound {|Z_i - Z_{i-1}|} directly to get the full power of Azuma-Hoeffding. Here’s one example:

Let {X_1, X_2, \ldots, X_n} be {n} points picked independently and uniformly at random from the unit square {[0,1]^2}. Let {\tau: ([0,1]^2)^n \rightarrow {\mathbb R}} be the length of the shortest traveling salesman tour on {n} points. How closely is {\tau} concentrated around its mean {E[\tau(X_1, X_2, \ldots, X_n)]}?

In the HW, you will show that {E\tau = \Theta(n^{1/2})}; in fact, one can pin down {E\tau} up to the leading constant. (See the work of Rhee and others.)

3.1. Using McDiarmid: a weak first bound

Note that {\tau} is {2\sqrt{2}}-Lipschitz. By Corollary 5 we get that

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq 2 \exp( -\frac{\lambda^2}{16n}).

If we want the deviation probability to be {1/poly(n)}, we would have to set {\lambda = \Omega(\sqrt{n \log n})}. Not so great, since this is pretty large compared to the expectation itself—we’d like a tighter bound.

3.2. So let’s be more careful: an improved bound

And in fact, we’ll get a better bound using the very same Doob martingale {(Z_i)} associated with {\tau}:

\displaystyle  Z_i = E[ \tau(X_1, X_2, \ldots, X_n) \mid X_1, X_2, \ldots, X_i ].

But instead of just using the {O(1)}-Lipschitzness of {\tau}, let us bound {|Z_i - Z_{i-1}|} better.

Lemma 6

\displaystyle  |Z_i - Z_{i-1}| \leq \min\left\{ 2 \sqrt{2}, \frac{O(1)}{\sqrt{n-i}} \right\}.

Before we prove this lemma, let us complete the concentration bound for TSP using this. Setting {c_i = O(1/\sqrt{n-i})} gives us {\sum_i c_i^2 = O(\log n)}, and hence Azuma-Hoeffding gives:

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq 2 \exp\left( -\frac{\lambda^2}{2\sum_{i} c_i^2}\right) \leq 2 \exp\left( - \frac{\lambda^2}{O(\log n)} \right).

So

\displaystyle  \Pr[ | \tau - E\tau | \leq O(\log n) ] \geq 1 - 1/poly(n).

Much better!

3.3. Some useful lemmas

To prove Lemma 6, we’ll need a simple geometric lemma:

Lemma 7 Let {x \in [0,1]^2}. Pick {k} random points {A} from {[0,1]^2}, the expected distance of point {x} to its closest point in {A} is {O(1/\sqrt{k})}.

Proof: Define the random variable {W = d(x,A)}. Hence, {W \geq r} exactly when {B(x,r) \cap A = \emptyset}. For {r \in [0,\sqrt{2}]}, the area of {B(x, r) \cap [0,1]^2} is at least {c_0 r^2} for some constant {c_0}.

Define {r_0 = \sqrt{c_0/k}}. For some {r = \lambda r_0 \in [0,\sqrt{2}]}, the chance that {k} points all miss this ball, and hence {\Pr[ W \geq r = \lambda r_0 ]} is at most

\displaystyle  (1 - c_0 r^2)^k = (1 - \lambda^2/k)^k \leq e^{-\lambda^2}.

Of course, for {r > \sqrt{2}}, {\Pr[W \geq r ] = 0}. And hence

\displaystyle  E[W] = \int_{r \geq 0} \Pr[ W \geq r ] dr = \sum_{\lambda \in {\mathbb Z}_{\geq 0}} \int_{r \in [\lambda r_0, (\lambda+1)r_0]} \Pr[ W \geq r ] dr \leq \sum_{\lambda \in {\mathbb Z}_{\geq 0}} (\lambda+1)r_0 \cdot e^{-\lambda^2} \leq O(r_0).

\Box

Secondly, here is another lemma about how the TSP behaves:

Lemma 8 For any set of {n-1} points, {A = \{x_1, x_2, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n\}}, we get

\displaystyle  |\tau(A + x_i) - \tau(A + x_i')| \leq 2(d(x_i, A) + d(x_i', A)).

Proof: Follows from the fact that {\tau(A + x) \in [TSP(A), TSP(A) + 2d(x, A)]}, for any {x}. \Box

3.4. Proving Lemma \ref

}

OK, now to the proof of Lemma 6. Recall that we want to bound {|Z_i - Z_{i-1}|}; since {\tau} is {2\sqrt{2}}-Lipschitz, we get {|Z_i - Z_{i-1}| \leq 2\sqrt{2}} immediately. For the second bound of {O(1/\sqrt{k - i})}, note that

\displaystyle  \begin{array}{rl}  Z_{i-1} &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, X_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i-1} ] \\ &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i-1} ] \\ &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i} ] \end{array}

where {\widehat{X}_i} is a independent copy of the random variable {X_i}. Hence

\displaystyle  |Z_i - Z_{i-1}| = E[ \tau(X_1, \ldots, X_{i-1}, X_i, \ldots, X_n) - \tau(X_1, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i} ].

Then, if we define the set {S = X_1, X_2, \ldots, X_{i-1}} and {T = X_{i+1}, \ldots, X_n}, then we get

\displaystyle  \begin{array}{rl}  |Z_i - Z_{i-1}| &= E[ TSP(S \cup T \cup \{X_i\}) - TSP(S \cup T \cup \{\widehat{X}_i\}) \mid X_1, X_2, \ldots, X_{i} ] \\ &\leq E[ 2( d(X_i, S\cup T) + d(\widehat{X}_i, S\cup T)) \mid X_1, X_2, \ldots, X_{i} ] \\ &\leq E_{\widehat{X}_i, T}[ 2( d(X_i, T) + d(\widehat{X}_i, T)) \mid X_{i} ]. \end{array}

where the first inequality uses Lemma 8 and the second uses the fact that the minimum distance to a set only increses when the set gets smaller. But now we can invoke Lemma 7 to bound each of the terms by {O(1)/\sqrt{|T|} = O(1)/\sqrt{n-i}}. This completes the proof of Lemma 6.

3.5. Some more about Geometric TSP

For constant dimension {d>2}, one can consider the same problems in {[0,1]^d}: the expected TSP length is now {\Theta(n^{1 - 1/d})}, and using similar arguments, you can show that devations of {\Omega(t n^{1/2 - 1/d})} have probability {\leq e^{-t^2}}.

The result we just proved was by Rhee and Talagrand, but it was not the last result about TSP concentration. Rhee and Talagrand subsequently improved this bound to the TSP has subgaussian tails!

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq ce^{-\lambda^2/O(1)} .

We’ll show a proof of this using Talagrand’s inequality, in a later lecture.

If you’re interested in this line of research, here is a survey article by Michael Steele on concentration properties of optimization problems in Euclidean space, and another one by Alan Frieze and Joe Yukich on many aspects of probabilistic TSP.

4. Citations

As mentioned in a previous post, McDiarmid and Hayward use martingales to give extremely strong concentration results for QuickSort . The book by Dubhashi and Panconesi (preliminary version here) sketches this result, and also contains many other examples and extensions of the use of martingales.

Other resources for concentration using martingales: this survey by Colin McDiarmid, or this article by Fan Chung and Linyuan Lu.

Apart from giving us powerful concentration results, martingales and “stopping times” combine to give very surprising and powerful results: see this survey by Yuval Peres at SODA 2010, or these course notes by Yuval and Eyal Lubetzky.

Lecture #18: Oblivious routing on a hypercube

In celebration of Les Valiant’s winning of the Turing award, today we discussed a classic result of his on the problem of oblivious routing on the hypercube. (Valiant, “A scheme for fast parallel communication”, SIAM J. Computing 1982, and Valiant and Brebner, “Universal schemes for parallel communication”, STOC 1981). The high-level idea is that rather than routing immediately to your final destination, go to a random intermediate point first. The analysis is really beautiful — you define the right quantities and it just magically works out nicely. See today’s class notes. We also discussed the Butterfly and Benes networks as well. (See the excellent notes of Satish Rao for more on them).

At the end, we also briefly discussed Martingales: their definition, Azuma’s inequality, and McDiarmid’s inequality (which doesn’t talk about Martingales directly but is very convenient and can be proven using Azuma). The discussion at the end was extremely rushed but the key point was: suppose you have a complicated random variable \phi you care about like “the running time of my randomized algorithm” where the random variable is a function of a series of random choices z_1, z_2, .... Then the sequence X_0, X_1, \ldots where X_i = E[\phi | z_1, \ldots, z_i] is a Martingale. E.g., the expected running time of quicksort given that the first i-1 pivots are z_1, \ldots z_{i-1} is the expected value, over the possible choices of z_i of the expected running time of quicksort given that the first i pivots are z_1, \ldots, z_i.


Lecture #17: Dimension reduction (continued)

1. An equivalent view of estimating {F_2}

Again, you have a data stream of elements {\sigma_1, \sigma_2, \ldots}, each element {\sigma_j} drawn from the universe {[D]}. This stream defines a frequency vector {\vec{x} \in {\mathbb R}^D}, where {x_i} is the number of times element {i} is seen. Consider the following algorithm to computing {F_2 := \| x \|_2^2 = \sum_i x_i^2}.

Take a (suitably random) hash function {h: [D] \rightarrow \{-1, +1\}}. Maintain counter {C}, which starts off at zero. Every time an element {i} comes in, increment the counter {C \rightarrow C + h(i)}. And when queried, we reply with the value {C^2}.

Hence, having seen the stream that results in the frequency vector {x \in {\mathbb Z}_{\geq 0}^D}, the counter will have the value {C = \sum_{i} x_i \, h(i)}. Does {E[C^2]} at least have the right expectation? It does:

\displaystyle   E[C^2] = E[ \sum_{i,j} h(i) h(j) x_ix_j ] = \sum_i x_i^2.

And what about the variance? Recall that {\mathrm{Var}(C^2) = E[(C^2)^2] - E[C^2]^2}, so let us calculate

\displaystyle   E[(C^2)^2] = E[ \sum_{p,q,r,s} h(p) h(q) h(r) h(s) x_px_qx_rx_s ] = \sum_p x_p^4 + 6\,\sum_{p < q} x_p^2 x_q^2.

So

\displaystyle  \text{Var}(C^2) = \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 - (\sum_p x_p^2)^2 = 4 \sum_{p < q} x_p^2x_q^2 \leq 2 E[C^2]^2.

What does Chebyshev say then?

\displaystyle  \Pr[ | C^2 - E[C^2] | > \epsilon E[C^2] ] \leq \frac{\text{Var}(C^2)}{(\epsilon E[C^2])^2} \leq \frac{2}{\epsilon^2}.

Not that hot: in fact, this is usually more than {1}.

But if we take a collection of {k} such independent counters {C_1, C_2, \ldots, C_k}, and given a query, take their average {\overline{C} = \frac1k \sum_i C_i}, and return {\overline{C}^2}. The expectation of the average remains the same, but the variance falls by a factor of {k}. And we get

\displaystyle  \Pr[ | \overline{C}^2 - E[\overline{C}^2] | > \epsilon E[\overline{C}^2] ] \leq \frac{\text{Var}(\overline{C}^2)}{(\epsilon E[\overline{C}^2])^2} \leq \frac{2}{k \epsilon^2}.

So, our probability of error on any query is at most {\delta} if we take {k = \frac{2}{\epsilon^2 \delta}}.

1.1. Hey, those calculations look familiar{\ldots}

Sure. This is just a restatement of what we did in lecture. There we took a matrix {M} and filled with random {\{-1, +1\}} values—hence each row of {M} corresponds to a hash function from {[D]} to {\{-1,+1\}}. And taking {k} rows in the matrix corresponds to the variance reduction step at the end.

1.2. Limited Independence

How much randomness do you need for the hash functions? Indeed, hash functions which are {4}-wise independent suffice for the above proofs to go through. And how does one get a {4}-wise independent hash function? Watch this blog (and the HWs) for details.