CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

Category Archives: Uncategorized

Final exam

on the webpage, or here. Good luck!

Update: Remember it’s due 48 hours after you start, or Friday May 6, 11:59pm, whichever comes first.

Fixes: 1(b): “feasible solution to the LP plus the odd cycle inequalities.” And you need to only show the value is m(1 - O(\varepsilon)) whp.

Advertisements

FCEs, Final, etc.

Thanks for the great presentations yesterday, everyone!

The final will be posted on the course webpage Friday 4/29 evening at the latest, I will post something on the blog once we’ve done so. You can take it in any contiguous 48 hour period of your choice — just download it when you are ready, and hand in your solutions within 48 hours of that. Slip it under my door (preferably), or email it to me otherwise. We’ll stop accepting solutions at 11:59pm on Friday 5/6.

The course FCEs are now online: please give us your feedback!!

Lecture 21: Random walks on graphs

Today we talked about random walks on graphs, and the result that in any connected undirected graph G, for any given start vertex u, the expected time for a random walk to visit all the nodes of G (called the cover time of the graph) is at most 2m(n-1), where n is the number of vertices of G and m is the number of edges.

In the process, we proved that for any G, if we think of the walk as at any point in time being on some edge heading in some direction, then each edge/direction is equally likely at probability 1/(2m) at the stationary distribution. (Actually, since we didn’t need to, we didn’t prove it is unique. However, if G is connected, it is not hard to prove by contradiction that there is a unique stationary distribution). We then used that to prove that the expected gap between successive visits to any given (u,v) is 2m. See the notes.

We also gave a few examples to show this is existentially tight. For instance, on a line (n vertices, n-1 edges) we have an expected \Omega(n^2) time to reach the other end of the line. Also on a “lollipop graph” (a clique of n/2 vertices connected to a line of n/2 vertices), the expected time to get from the clique to the end of the line is \Omega(n^3). Since this is not in the notes, here is a quick argument. First of all, if you are in the clique, then each step has 2/n probability of taking you to the node connecting the clique and the handle (let’s call this “node 0”). When you are at node 0, your next step has probability 2/n to take you to the next point on the handle (let’s call this “node 1”). So, when you are in the clique, it takes \Omega(n^2) steps to get to node 1. Now, think about the following experiment. Say you go into a fair casino with 1 dollar and bet on fair games (betting a dollar each time) until either you lose all your money or you have n/2 dollars in your pocket. Whenever you lose all your money, you go back to your car to get another dollar, and if you get n/2 dollars, you go upstairs to the restaurant. What is the expected number of trips to your car before you go to the restaurant? The answer is n/2 because it’s a fair casino so in expectation, all the money in your pocket when you head to the restaurant came from your car. Now, think of node 0 as your car, node 1 as entering the casino, and the end of the lollipop stick (node n/2) as the restaurant.

We ended our discussion by talking about resistive networks, and using the connection to give another proof of the cover time of a graph. In particular, we have C_{uv} = 2m R_{uv} where C_{uv} is the commute-time between u and v, and R_{uv} is the effective resistance between u and v.

Lecture #20: Rademacher bounds

In this lecture we talked about Rademacher bounds in machine learning. These are never worse and sometimes can be quite a bit tighter than VC-dimension bounds. Rademacher bounds say that to bound how much you are overfitting (the gap between your error on the training set and your true error on the distribution), you can do the following. See how much you would overfit on random labels (how much better than 50% is the empirical error of the best function in your class when you give random labels to your dataset) and then double that quantity (and add a low-order term). See the notes.

HW #5 Open Thread

HW5 has been posted; it’s due on Monday April 4th.

Update: for Exercise #2, the expression E_{T \gets \mathcal{D}} [\max_{v \in V} d(v, F_T)] is indeed correct. (It is d and not d_T —you want to show that the solution F_T found using the tree is lousy when used on the original metric.)

A simpler problem, if you’re stuck, is the furthest pair problem. Here, you are given a metric and want to output a pair of points whose distance is the largest. A natural (yet lousy) algorithm would be: embed the metric into a random tree while maintaining distances in expectation, find a furthest pair in the tree, and output this pair. Show an example where this algorithm sucks.

Lecture #19: Martingales

1. Some definitions

Recall that a martingale is a sequence of r.v.s {Z_0, Z_1, Z_2, \ldots} (denoted by {(Z_i)}) if each {Z_i} satisfies {E[|Z_i|] < \infty}, and

\displaystyle  E[Z_i \mid Z_0,...,Z_{i-1}] = Z_{i-1}.

Somewhat more generally, given a sequence {(X_i)} of random variables, a martingale with respect to {(X_i)} is another sequence of r.v.s {Z_0, Z_1, Z_2, \ldots} (denoted by {(Z_i)}) if each {Z_i} satisfies

  • {E[|Z_i|] < \infty},
  • there exists functions {g_i} such that {Z_i = g_i(X_1, X_2, \ldots, X_i)}, and
  • {E[Z_i \mid X_1, \ldots ,X_{i-1}] = Z_{i-1}}.

One can define things even more generally, but for the purposes of this course, let’s just proceed with this. If you’d like more details, check out, say, books by Grimmett and Stirzaker, or Durett, or many others.)

1.1. The Azuma-Hoeffding Inequality

Theorem 1 (Azuma-Hoeffding) If {(Z_i)} is a martingale such that for each {i}, {|Z_i - Z_{i-1}| < c_i}. Then

\displaystyle  \Pr[|Z_n - Z_0| \geq \lambda] \leq 2\exp\left\{-\frac{\lambda^2}{2 \sum_i c_i^2} \right\}.

(Apparently Bernstein had essentially figured this one out as well, in addition to the Chernoff-Hoeffding bounds, back in 1937.) The proof of this bound can be found in most texts, we’ll skip it here. BTW, if you just want the upper or lower tail, replace {2e^{blah}} by {e^{blah}} on the right hand side.

2. The Doob Martingale

Most often, the case we will be concerned with is where our entire space is defined by a sequence of random variables {X_1, X_2, \ldots, X_n}, where each {X_i} takes values in the set {\Omega}. Moreover, we will be interested in some bounded function {f: \Omega^n \rightarrow {\mathbb R}}, and will want to understand how {f(X_1, X_2, \ldots, X_n)} behaves, when {(X_i)} is drawn from the underlying distribution. (Very often these {X_i}‘s will be drawn from a “product distribution”—i.e., they will be independent of each other, but they need not be.) Specifically, we ask:

How concentrated is {f} around its mean {E[f] := E_{X_1, X_2, \ldots, X_n}[f(X_1, X_2, \ldots, X_n)]}?

To this end, define for every {i \in \{0,1, \ldots, n\}}, the random variable

\displaystyle  Z_i := E[ f(X) \mid X_1, X_2, \ldots,X_i ].

(At this point, it is useful to remember the definition of a random variable as a function from the sample space to the reals: so this r.v. {Z_i} is also such a function, obtained by taking averages of {f} over parts of the sample space.)

How does the random variable {Z_0} behave? It’s just the constant {E[f]}: the expected value of the function {f} given random settings for {X_1} through {X_n}. What about {Z_1}? It is a function that depends only on its first variable, namely {Z_1(x_1) = E_{X_2, \ldots, X_n}[f(x_1, X_2, \ldots, X_n)]}—instead of averaging {f} over the entire sample space, we partition {\Omega} according to value of the first variable, and average over each part in the partition. And {Z_2} is a function of {x_1, x_2}, averages over the other variables. And so on to {Z_n}, which is the same as the function {f}. So, as we go from {0} to {n}, the random variables {Z_i} go from the constant function {E[f]} to the function {f}.

Of course, we’re defining this for a reason: {(Z_i)} is a martingale with respect to {(X_i)}.

Lemma 2 For a bounded function {f}, the sequence {(Z_i)_{i = 0}^n} is a martingale with respect to {(X_i)}. (It’s called the Doob martingale for {f}.)

Proof: The first two properties of {(Z_i)} being a martingale with respect to {(X_i)} follow from {f} being bounded, and the definition of {Z_i} itself. For the last property,

\displaystyle  \begin{array}{rl}  E[Z_i \mid X_1, \ldots X_{i-1}] &= E[\, E[f \mid X_1, X_2, \ldots, X_i] \mid X_1, \ldots X_{i-1} ] \\ &= E[ f \mid X_1, \ldots X_{i-1} ] = Z_{i-1}. \end{array}

The first equaility is the definition of {Z_i}, the second from the fact that {E[ U \mid V ] = E[ E[ U \mid V, W ] \mid V ]} for random variables {U, V, W}, and the last from the definition of {Z_{i-1}}. \Box

Assuming that {f} was bounded was not necessary, one can work with weaker assumptions—see the texts for more details.

Before we continue on this thread, let us show some Doob martingales which arise in CS/Math-y applications.

  1. Throw {m} balls into {n} bins, and let {f} be some function of the load: the number of empty bins, the max load, the second-highly loaded bin, or some similar function. Let {\Omega = [n]}, and {X_i} be the index of the bin into which ball {i} lands. For {Z_i = E[ f \mid X_1, \ldots, X_i ]}, {(Z_i)} is a martingale with respect to {(X_i)}.
  2. Consider the random graph {G_{n,p}}: {n} vertices, each of the {\binom{n}{2}} edges chosen independently with probability {p}. Let {\chi} be the chromatic number of the graph, the minimum number of colors to properly color the graph. There are two natural Doob martingales associated with this, depending on how we choose the variables {X_i}.

    In the first one, let {X_i} be the {i^{th}} edge, and which gives us a martingle sequence of length {\binom{n}{2}}. This is called the edge-exposure martingale. For the second one, let {X_i} be the collection of edges going from the vertex {i} to vertices {1, 2, \ldots, i-1}: the new martingale has length {n} and is called the vertex exposure martingale.

  3. Consider a run of quicksort on a particular input: let {Q} be the number of comparisons. Let {X_1} be the first pivot, {X_2} the second, etc. Then {Z_i = E[ Q \mid X_1, \ldots, X_i ]} is a Doob martingale with respect to {(X_i)}.

    BTW, are these {X_i}‘s independent of each other? Naively, they might depend on the size of the current set, which makes it dependent on the past. One way you can make these independent is by letting these {X_i}‘s be, say, random independent permutations on all {n} elements, and when you want to choose the {i^{th}} pivot, pick the first element from the current set according to the permutation {X_i}. (Or, you could let {X_i} be a random independent real in {[0,1]} and use that to pick a random element from the current set, etc.)

  4. Suppose we have {r} red and {b} blue balls in a bin. We draw {n} balls without replacement from this bin: what is the number of red balls drawn? Let {X_i} be the indicator for whether the {i^{th}} ball is red, and let {f = X_1 + X_2 + \ldots + X_n} is the number of red balls. Then {Z_i = E[ f \mid X_1, \ldots, X_i ]} is a martingale with respect to {(X_i)}.

    However, in this example, the {X_i}‘s are not independent. Nonetheless, the sequence is a Doob martingale. (As in the quicksort example, one can define it with respect to a different set of variables which are independent of each other.)

So yeah, if we want to study the concentration of {f} around {E[f]}, we can now apply Azuma-Hoeffding to the Doob martingale, which gives us the concentration of {Z_n} (i.e., {f}) around {Z_0} (i.e., {E[f]}). Good, good.

Next step: to apply Azuma-Hoeffding to the Doob martingale {(Z_i)}, we need to bound {|Z_i - Z_{i-1}|} for all {i}. Which just says that if we can go from {f} to {Ef} in a “small” number of steps ({n}), and each time we’re not smoothing out “too agressively” ({|Z_i - Z_{i-1}| \leq c_i}), then {f} is concentrated about its mean.

2.1. Indepedence and Lipschitz-ness

One case when it’s easy to bound the {|Z_i - Z_{i-1}|}‘s is when the {X_i}‘s are independent of each other, and also {f} is not too sensitive in any coordinate—namely, changing any coordinate does not change the value of {f} by much. Let’s see this in detail.

Definition 3 Given values {(c_i)_{i =1}^n}, the function {f} is \underline{{(c_i)}-Lipschitz} if for all {j} and {x_j \in \Omega}, for all {i \in [n]} and for all {x_i' \in \Omega}, it holds that

\displaystyle  |f(x_1, \ldots, x_{i-1}, x_i, x_{i+1}, \ldots, x_n ) - f(x_1, \ldots, x_{i-1}, x_i', x_{i+1}, \ldots, x_n ) | \leq c_i.

If {c_i = c} for all {i}, then we just say {f} is {c}-Lipschitz.

Lemma 4 If {f} is {(c_i)}-Lipschitz and {X_1, X_2, \ldots, X_n} are independent, then the Doob martingale of {f} with respect to {(X_i)} satisfies

\displaystyle  |Z_i - Z_{i-1}| \leq c_i.

Proof: Let us use {X_{(i:j)}} to denote the sequence {X_i, \ldots, X_j}, etc. Recall that

\displaystyle  \begin{array}{rl}  Z_i &= E[ f \mid X_{(1:i)} ] = \sum_{a_{i+1}, \ldots, a_n} f(X_{(1:i)}, a_{(i+1:n)}) \Pr[ X_{(i+1:n)} = a_{(i+1:n)} \mid X_{(1:i)} ] \\ &= \sum_{a_{i+1}, \ldots, a_n} f(X_{(1:i)}, a_{(i+1:n)}) \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] \end{array}

where the last equality is from independence. Similarly for {Z_{i-1}}. Hence

\displaystyle  \begin{array}{rl}  | Z_i - Z_{i-1} | &= \sum_{a_{i+1}, \ldots, a_n} \bigg| f(X_{(1:i)}, a_{(i+1:n)}) - \sum_{a_i'} \Pr[X_i = a_i'] f(X_{(1:i-1)}, a_i', a_{(i+1:n)}) \bigg| \cdot \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] \\ &\le \sum_{a_{i+1}, \ldots, a_n} c_i \cdot \Pr[ X_{(i+1:n)} = a_{(i+1:n)} ] = c_i. \end{array}

where the inequality is from the fact that changing the {i^{th}} coordinate from {a_i} to {a_i'} cannot change the function value by more than {c_i}, and that {\sum_{a_i'} \Pr[X_i = a_i'] = 1}. \Box

Now applying Azuma-Hoeffding, we immediately get:

Corollary 5 (McDiarmid’s Inequality) If {f_i} is {c_i}-Lipschitz for each {i}, and {X_1, X_2, \ldots, X_n} are independent, then

\displaystyle  \begin{array}{rl}  \Pr[ f - E[f] \geq \lambda ] &\leq \exp\left( - \frac{ \lambda^2 }{2 \sum_i c_i^2 } \right), \\ \Pr[ f - E[f] < \lambda ] &\leq \exp\left( - \frac{ \lambda^2 }{2 \sum_i c_i^2 } \right). \end{array}

(Disclosure: I am cheating. McDiarmid’s inequality has better constants, the constant {2} in the denominator moves to the numerator.) And armed with this inequality, we can give concentration results for some applications we mentioned above.

  1. For the {m} balls and {n} bins example, say {f} is the number of empty bins: hence {Ef = n(1-1/n)^m \approx n\,e^{-m/n}}. Also, changing the location of the {i^{th}} ball changes {f} by at most {1}. So {f} is {1}-Lipschitz, and hence

    \displaystyle  \Pr [ | f - Ef | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2m} \right).

    Hence, whp, {f \approx n\,e^{-m/n} \pm O(\sqrt{m \log n})}.

  2. For the case where {\chi} is the chromatic number of a random graph {G_{n,p}}, and we define the edge-exposure martingale {Z_i = E[ \chi \mid E_1, E_2, \ldots, E_i ]}, clearly {\chi} is {1}-Lipschitz. Hence

    \displaystyle  \Pr [ | \chi - E\chi | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2\binom{n}{2}} \right)

    This is not very interesting, since the right hand side is {< 1} only when {\lambda \approx n}—but the chromatic number itself lies in {[1,n]}, so we get almost no concentration at all.

    Instead, we could use a vertex-exposure martingale, where at the {i^{th}} step we expose the vertex {i} and its edges going to vertices {1, 2, \ldots, i-1}. Even with respect to these variables, the function {\chi} is {1}-Lipschitz, and hence

    \displaystyle  \Pr [ | \chi - E\chi | \geq \lambda ] \leq 2 \exp\left( - \frac{\lambda^2}{2n} \right)

    And hence the chromatic number of the random graph {G_{n,p}} is concentrated to within {\approx \sqrt{n}} around its mean.

3. Concentration for Random Geometric TSP

McDiarmid’s inequality is convenient to use, but Lipschitz-ness often does not get us as far as we’d like (even with independence). Sometimes you need to bound {|Z_i - Z_{i-1}|} directly to get the full power of Azuma-Hoeffding. Here’s one example:

Let {X_1, X_2, \ldots, X_n} be {n} points picked independently and uniformly at random from the unit square {[0,1]^2}. Let {\tau: ([0,1]^2)^n \rightarrow {\mathbb R}} be the length of the shortest traveling salesman tour on {n} points. How closely is {\tau} concentrated around its mean {E[\tau(X_1, X_2, \ldots, X_n)]}?

In the HW, you will show that {E\tau = \Theta(n^{1/2})}; in fact, one can pin down {E\tau} up to the leading constant. (See the work of Rhee and others.)

3.1. Using McDiarmid: a weak first bound

Note that {\tau} is {2\sqrt{2}}-Lipschitz. By Corollary 5 we get that

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq 2 \exp( -\frac{\lambda^2}{16n}).

If we want the deviation probability to be {1/poly(n)}, we would have to set {\lambda = \Omega(\sqrt{n \log n})}. Not so great, since this is pretty large compared to the expectation itself—we’d like a tighter bound.

3.2. So let’s be more careful: an improved bound

And in fact, we’ll get a better bound using the very same Doob martingale {(Z_i)} associated with {\tau}:

\displaystyle  Z_i = E[ \tau(X_1, X_2, \ldots, X_n) \mid X_1, X_2, \ldots, X_i ].

But instead of just using the {O(1)}-Lipschitzness of {\tau}, let us bound {|Z_i - Z_{i-1}|} better.

Lemma 6

\displaystyle  |Z_i - Z_{i-1}| \leq \min\left\{ 2 \sqrt{2}, \frac{O(1)}{\sqrt{n-i}} \right\}.

Before we prove this lemma, let us complete the concentration bound for TSP using this. Setting {c_i = O(1/\sqrt{n-i})} gives us {\sum_i c_i^2 = O(\log n)}, and hence Azuma-Hoeffding gives:

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq 2 \exp\left( -\frac{\lambda^2}{2\sum_{i} c_i^2}\right) \leq 2 \exp\left( - \frac{\lambda^2}{O(\log n)} \right).

So

\displaystyle  \Pr[ | \tau - E\tau | \leq O(\log n) ] \geq 1 - 1/poly(n).

Much better!

3.3. Some useful lemmas

To prove Lemma 6, we’ll need a simple geometric lemma:

Lemma 7 Let {x \in [0,1]^2}. Pick {k} random points {A} from {[0,1]^2}, the expected distance of point {x} to its closest point in {A} is {O(1/\sqrt{k})}.

Proof: Define the random variable {W = d(x,A)}. Hence, {W \geq r} exactly when {B(x,r) \cap A = \emptyset}. For {r \in [0,\sqrt{2}]}, the area of {B(x, r) \cap [0,1]^2} is at least {c_0 r^2} for some constant {c_0}.

Define {r_0 = \sqrt{c_0/k}}. For some {r = \lambda r_0 \in [0,\sqrt{2}]}, the chance that {k} points all miss this ball, and hence {\Pr[ W \geq r = \lambda r_0 ]} is at most

\displaystyle  (1 - c_0 r^2)^k = (1 - \lambda^2/k)^k \leq e^{-\lambda^2}.

Of course, for {r > \sqrt{2}}, {\Pr[W \geq r ] = 0}. And hence

\displaystyle  E[W] = \int_{r \geq 0} \Pr[ W \geq r ] dr = \sum_{\lambda \in {\mathbb Z}_{\geq 0}} \int_{r \in [\lambda r_0, (\lambda+1)r_0]} \Pr[ W \geq r ] dr \leq \sum_{\lambda \in {\mathbb Z}_{\geq 0}} (\lambda+1)r_0 \cdot e^{-\lambda^2} \leq O(r_0).

\Box

Secondly, here is another lemma about how the TSP behaves:

Lemma 8 For any set of {n-1} points, {A = \{x_1, x_2, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n\}}, we get

\displaystyle  |\tau(A + x_i) - \tau(A + x_i')| \leq 2(d(x_i, A) + d(x_i', A)).

Proof: Follows from the fact that {\tau(A + x) \in [TSP(A), TSP(A) + 2d(x, A)]}, for any {x}. \Box

3.4. Proving Lemma \ref

}

OK, now to the proof of Lemma 6. Recall that we want to bound {|Z_i - Z_{i-1}|}; since {\tau} is {2\sqrt{2}}-Lipschitz, we get {|Z_i - Z_{i-1}| \leq 2\sqrt{2}} immediately. For the second bound of {O(1/\sqrt{k - i})}, note that

\displaystyle  \begin{array}{rl}  Z_{i-1} &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, X_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i-1} ] \\ &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i-1} ] \\ &= E[ \tau(X_1, X_2, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i} ] \end{array}

where {\widehat{X}_i} is a independent copy of the random variable {X_i}. Hence

\displaystyle  |Z_i - Z_{i-1}| = E[ \tau(X_1, \ldots, X_{i-1}, X_i, \ldots, X_n) - \tau(X_1, \ldots, X_{i-1}, \widehat{X}_i, \ldots, X_n) \mid X_1, X_2, \ldots, X_{i} ].

Then, if we define the set {S = X_1, X_2, \ldots, X_{i-1}} and {T = X_{i+1}, \ldots, X_n}, then we get

\displaystyle  \begin{array}{rl}  |Z_i - Z_{i-1}| &= E[ TSP(S \cup T \cup \{X_i\}) - TSP(S \cup T \cup \{\widehat{X}_i\}) \mid X_1, X_2, \ldots, X_{i} ] \\ &\leq E[ 2( d(X_i, S\cup T) + d(\widehat{X}_i, S\cup T)) \mid X_1, X_2, \ldots, X_{i} ] \\ &\leq E_{\widehat{X}_i, T}[ 2( d(X_i, T) + d(\widehat{X}_i, T)) \mid X_{i} ]. \end{array}

where the first inequality uses Lemma 8 and the second uses the fact that the minimum distance to a set only increses when the set gets smaller. But now we can invoke Lemma 7 to bound each of the terms by {O(1)/\sqrt{|T|} = O(1)/\sqrt{n-i}}. This completes the proof of Lemma 6.

3.5. Some more about Geometric TSP

For constant dimension {d>2}, one can consider the same problems in {[0,1]^d}: the expected TSP length is now {\Theta(n^{1 - 1/d})}, and using similar arguments, you can show that devations of {\Omega(t n^{1/2 - 1/d})} have probability {\leq e^{-t^2}}.

The result we just proved was by Rhee and Talagrand, but it was not the last result about TSP concentration. Rhee and Talagrand subsequently improved this bound to the TSP has subgaussian tails!

\displaystyle  \Pr[ |\tau - E\tau| \geq \lambda ] \leq ce^{-\lambda^2/O(1)} .

We’ll show a proof of this using Talagrand’s inequality, in a later lecture.

If you’re interested in this line of research, here is a survey article by Michael Steele on concentration properties of optimization problems in Euclidean space, and another one by Alan Frieze and Joe Yukich on many aspects of probabilistic TSP.

4. Citations

As mentioned in a previous post, McDiarmid and Hayward use martingales to give extremely strong concentration results for QuickSort . The book by Dubhashi and Panconesi (preliminary version here) sketches this result, and also contains many other examples and extensions of the use of martingales.

Other resources for concentration using martingales: this survey by Colin McDiarmid, or this article by Fan Chung and Linyuan Lu.

Apart from giving us powerful concentration results, martingales and “stopping times” combine to give very surprising and powerful results: see this survey by Yuval Peres at SODA 2010, or these course notes by Yuval and Eyal Lubetzky.

Lecture #17: Dimension reduction (continued)

1. An equivalent view of estimating {F_2}

Again, you have a data stream of elements {\sigma_1, \sigma_2, \ldots}, each element {\sigma_j} drawn from the universe {[D]}. This stream defines a frequency vector {\vec{x} \in {\mathbb R}^D}, where {x_i} is the number of times element {i} is seen. Consider the following algorithm to computing {F_2 := \| x \|_2^2 = \sum_i x_i^2}.

Take a (suitably random) hash function {h: [D] \rightarrow \{-1, +1\}}. Maintain counter {C}, which starts off at zero. Every time an element {i} comes in, increment the counter {C \rightarrow C + h(i)}. And when queried, we reply with the value {C^2}.

Hence, having seen the stream that results in the frequency vector {x \in {\mathbb Z}_{\geq 0}^D}, the counter will have the value {C = \sum_{i} x_i \, h(i)}. Does {E[C^2]} at least have the right expectation? It does:

\displaystyle   E[C^2] = E[ \sum_{i,j} h(i) h(j) x_ix_j ] = \sum_i x_i^2.

And what about the variance? Recall that {\mathrm{Var}(C^2) = E[(C^2)^2] - E[C^2]^2}, so let us calculate

\displaystyle   E[(C^2)^2] = E[ \sum_{p,q,r,s} h(p) h(q) h(r) h(s) x_px_qx_rx_s ] = \sum_p x_p^4 + 6\,\sum_{p < q} x_p^2 x_q^2.

So

\displaystyle  \text{Var}(C^2) = \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 - (\sum_p x_p^2)^2 = 4 \sum_{p < q} x_p^2x_q^2 \leq 2 E[C^2]^2.

What does Chebyshev say then?

\displaystyle  \Pr[ | C^2 - E[C^2] | > \epsilon E[C^2] ] \leq \frac{\text{Var}(C^2)}{(\epsilon E[C^2])^2} \leq \frac{2}{\epsilon^2}.

Not that hot: in fact, this is usually more than {1}.

But if we take a collection of {k} such independent counters {C_1, C_2, \ldots, C_k}, and given a query, take their average {\overline{C} = \frac1k \sum_i C_i}, and return {\overline{C}^2}. The expectation of the average remains the same, but the variance falls by a factor of {k}. And we get

\displaystyle  \Pr[ | \overline{C}^2 - E[\overline{C}^2] | > \epsilon E[\overline{C}^2] ] \leq \frac{\text{Var}(\overline{C}^2)}{(\epsilon E[\overline{C}^2])^2} \leq \frac{2}{k \epsilon^2}.

So, our probability of error on any query is at most {\delta} if we take {k = \frac{2}{\epsilon^2 \delta}}.

1.1. Hey, those calculations look familiar{\ldots}

Sure. This is just a restatement of what we did in lecture. There we took a matrix {M} and filled with random {\{-1, +1\}} values—hence each row of {M} corresponds to a hash function from {[D]} to {\{-1,+1\}}. And taking {k} rows in the matrix corresponds to the variance reduction step at the end.

1.2. Limited Independence

How much randomness do you need for the hash functions? Indeed, hash functions which are {4}-wise independent suffice for the above proofs to go through. And how does one get a {4}-wise independent hash function? Watch this blog (and the HWs) for details.

Lecture #17: Dimension Reduction

Today we’ll talk about dimensionality reduction, and some related topics in data streaming.

1. Dimension Reduction

Suppose we are given a set of {n} points {\{x_1, x_2, \ldots, x_n\}} in {{\mathbb R}^D}. How small can we make {D} and still maintain the Euclidean distances between the points? Clearly, we can always make {D = n-1}, since any set of {n} points lies on a {n-1}-dimensional subspace. And this is (existentially) tight: e.g., the case when {x_2 - x_1, x_3 - x_1, \ldots, x_n - x_1} are all orthogonal vectors.

But what if we were OK with the distances being approximately preserved? In HW#3, you saw that while there could only be {D} orthogonal unit vectors in {{\mathbb R}^D}, there could be as many as {\exp(c \varepsilon^2 D)} unit vectors which are {\varepsilon}-orthogonal—i.e., whose mutual inner products all lie in {[-\varepsilon, \varepsilon]}. Near-orthogonality allows us to pack exponentially more vectors!

Put another way, note that

\displaystyle  \| \vec{a} - \vec{b} \|_2^2 = \langle \vec{a} - \vec{b}, \vec{a} - \vec{b} \rangle = \langle \vec{a}, \vec{a} \rangle + \langle \vec{b}, \vec{b} \rangle - 2\langle \vec{a}, \vec{b} \rangle = \| \vec{a} \|_2^2 + \| \vec{b} \|_2^2 - 2\langle \vec{a}, \vec{b} \rangle.

And hence the squared Euclidean distance between any pair of the points defined by these {\varepsilon}-orthogonal vectors falls in {2(1 \pm \varepsilon)}. So, if we wanted {n} points exactly at unit (Euclidean) distance from each other, we would need {n-1} dimensions. (Think of a triangle in {2}-dims.) But if we wanted to pack in {n} points which were at distance {(1 \pm \varepsilon)} from each other, we could pack them into

\displaystyle  O\left(\frac{\log n}{\varepsilon^2}\right)

dimensions.

1.1. The Johnson Lindenstrauss lemma

The Johnson Lindenstrauss “flattening” lemma says that such a claim is true not just for equidistant points, but for any set of {n} points in Euclidean space:

Lemma 1 Let {\varepsilon \in (0, 1/2)}. Given any set of points {X = \{x_1, x_2, \ldots, x_n\}} in {{\mathbb R}^D}, there exists a map {A: {\mathbb R}^D \rightarrow {\mathbb R}^k} with {k = O(\varepsilon^{-2} \log n)} such that

\displaystyle  1 - \varepsilon \leq \frac{\| A(x_i) - A(x_j) \|_2^2 }{ \| x_i - x_j \|_2^2} \leq 1 + \varepsilon.

Note that the target dimension {k} is independent of the original dimension {D}, and depends only on the number of points {n} and the accuracy parameter {\varepsilon}.

This lemma is tight up to the constant term: it is easy to see that we need at least {\Omega(\frac1\varepsilon \log n)} using a packing argument. Noga Alon showed a lower bound of {\Omega(\frac{\log n}{\varepsilon^2 \log 1/\varepsilon})}.

1.2. The construction

The JL lemma is pretty surprising, but the construction of the map is perhaps even more surprising: it is a super-simple random construction. Let {M} be a {k \times D} matrix, such that every entry of {M} is filled with an i.i.d. draw from a standard normal {N(0, 1)} distribution (a.k.a. the “Gaussian” distribution). For {x \in {\mathbb R}^D}, define

\displaystyle  A(x) = \frac{1}{\sqrt{k}} Mx.

That’s it. You hit the vector {x} with a Gaussian matrix {M}, and scale it down by {\sqrt{k}}. That’s the map {A}. Note that it is a linear map: {A(x) + A(y) = A(x+y)}. So suppose we could show the following lemma:

Lemma 2 Let {\varepsilon \in (0, 1/2)}. If {A} is constructed as above with {k = c \varepsilon^{-2} \log \delta^{-1}}, and {x \in {\mathbb R}^D} is a unit vector, then

\displaystyle  \Pr[ \| A(x) \|_2^2 \in 1 \pm \varepsilon ] \geq 1 - \delta.

Then we’d get a proof of Lemma 1. Indeed, set {\delta = 1/n^2}, and hence {k = O(\varepsilon^{-2} \log n)}. Now for each {x_i, x_j \in X} we get that the squared length of {x_i - x_j} is maintained to within {1 \pm \varepsilon} with probability at least {1 - 1/n^2}. By a union bound, all {\binom{n}{2}} pairs of distances in {\binom{X}{2}} are maintained with probability at least {1 - \binom{n}{2}\frac1{n^2} \geq 1/2}. This proves Lemma 1.

A few comments about this construction:

  • The above proof shows not only the existence of a good map, we also get that a random map as above works with constant probability! In other words, a Monte-Carlo randomized algorithm for dimension reduction. (Since we can efficiently check that the distances are preserved to within the prescribed bounds, we can convert this into a Las Vegas algorithm.)
  • The algorithm (at least the Monte Carlo version) does not even look at the set of points {X}: it works for any set {X} with high probability. Hence, we can pick this map {A} before the points in {X} arrive.

  • Given a set {X \subseteq {\mathbb R}^D}, one can get deterministic poly-time algorithms constructing a dimension reduction map {A: {\mathbb R}^D \rightarrow {\mathbb R}^k} for {k = O(\varepsilon^{-2} \log |X|)}: the first one was given in this paper of Lars Engebretsen, Piotr Indyk and Ryan O’Donnell; another construction is due to D. Sivakumar.

A SODA 2011 paper of T.S. Jayram and David Woodruff shows that this dependence of {O(\varepsilon^{-2} \log \delta^{-1})} is the best possible. Note that if we use this approach the union bound to prove JL, then {O(\varepsilon^{-2} \log n)} is the best bound possible. (An earlier version of these notes incorrectly claimed that the Jayram-Woodruff paper also showed an unconditional lower bound for JL, thanks to Jelani for pointing out the mistake.)

1.3. The proof

Now, on to the proof of Lemma 2. Here’s the main idea. Imagine that the vector we’re considering is just the elementary unit vector {e_1 = (1, 0, \ldots, 0)}. Then {M\, e_1} is just a vector with independent and identical Gaussian values, and we’re interested in its length—the sum of squares of these Gaussians. If these were bounded r.v.s, we’d be done—but they are not. However, their tails are very small, so things should work out

But what’s a Gaussian {N(0,1)}? Well, it looks like this:

\vspace{1in}

Which is not too different from this (bounded) random variable, if you squint a bit:

\vspace{1in}

Which has constant mean. So, if we take a sum of a bunch of such random variables (actually of their squares), it should behave pretty much like its mean (which is {\propto k}), because of a Chernoff-like argument. And so the expected length is close to {\sqrt{k}}, which explains the division by {\sqrt{k}}.

Now we just need to make all this precise, and remove the assumption that the vector was just {e_1}. That’s what the rest of the formal proof does: it has a few steps, but each of them is fairly elementary.

1.4. The proof, this time for real

We’ll be using basic facts about Gaussians, let’s just recall them. The probability density function for the Gaussian {N(\mu, \sigma^2)} is

\displaystyle  f(x) = \textstyle \frac{1}{\sqrt{2 \pi} \sigma } e^{\frac{ (x - \mu)^2 }{2\sigma^2} }.

We also use the following; the proof just needs some elbow grease.

Proposition 3 If {Y_1 \sim N(\mu_1, \sigma_1^2)} and {Y_2 \sim N(\mu_2, \sigma_2^2)}, then

\displaystyle  Y_1 + Y_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

Recall that we want to argue about the squared length of {A(x) \in {\mathbb R}^k}. To start off, observe that each coordinate of the vector {Mx} behaves like

\displaystyle  Y \sim \langle G_1, G_2, \ldots, G_D \rangle \cdot x = \sum x_i \, G_i

where the {G_i}‘s are i.i.d. {N(0,1)} r.v.s. But then the proposition tells us that {Y \sim N(0, x_1^2 + x_2^2 + \ldots + x_D^2)}. And since {x} is a unit length vector, this is simply {N(0,1)}. So, each of the {k} coordinates of {Mx} behaves just like an independent Gaussian!

What is the squared length of {A(x) = \frac1{\sqrt{k}} Mx}, then? It is

\displaystyle  Z := \sum_{i = 1}^k \frac{1}{k} \cdot Y_i^2

where each {Y_i \sim N(0,1)}, independent of the others. And since {E[Y_i^2] = \text{Var}(Y_i) + E[Y_i]^2 = 1}, we get {E[Z] = 1}.

Now to show that {Z} does not deviate too much from {1}. And {Z} is the sum of a bunch of independent and identical random variables. If only the {Y_i}‘s were all bounded, we could have used a Chernoff bound and be done. But these are not bounded, so this is finally where we’ll need to do a little work. (Note: we could take the easy way out, observe that the squares of Gaussians are chi-squared r.v.s, the sum of {k} of them is chi-squared with {k} degrees of freedom, and the internets conveniently has tail bounds for these things. But we digress.)

So let’s start down the ye olde Chernoff path, for the upper tail, say: \Pr[ Z \geq 1 + \varepsilon ] &\leq \Pr[ e^{tkZ} \geq e^{tk(1+\varepsilon)} ] \leq E[ e^{tkZ} ]/e^{tk(1+\varepsilon)} = \prod_i \left( E[ e^{tY_i^2} ]/e^{t(1+\varepsilon)} \right) for every {t > 0}. And what is {E[e^{tY^2}]} for {Y \sim N(0,1)}? Let’s calculate it: \frac1{\sqrt{2\pi}} \int_y e^{ty^2} e^{-y^2/2} dy &= \frac1{\sqrt{2\pi}} \int_z e^{-z^2/2} \frac{dz}{\sqrt{1 – 2t}} = \frac{1}{\sqrt{1 – 2t}}. for {t < 1/2}. So our current bound on the upper tail is that for all { t \in (0, 1/2)} we have

\displaystyle  \begin{array}{rl}  \Pr[ Z \geq (1+\varepsilon) ] &\leq \left( \frac{1}{e^{t(1+\varepsilon)} \sqrt{1 - 2t}} \right)^k. \end{array}

Let’s just focus on part of this expression:

\displaystyle  \begin{array}{rl}  \left( \frac{1}{e^{t} \sqrt{1 - 2t}} \right) &= \exp\left( - t - \frac12 \log (1 - 2t)) \right) \\ &= \exp\left( (2t)^2/4 + (2t)^3/6 + \cdots \right) \leq \exp\left( t^2( 1 + 2t + 2t^2 + \cdots ) \right) \\ &= \exp(t^2/(1 - 2t)). \end{array}

Plugging this back, we get

\displaystyle  \begin{array}{rl}  \Pr[ Z \geq (1+\varepsilon) ] &\leq \left( \frac{1}{e^{t(1+\varepsilon)} \sqrt{1 - 2t}} \right)^k \\ &\leq \exp( kt^2/(1-2t) - kt\varepsilon ) \leq e^{-k\varepsilon^2/8}, \end{array}

if we set {t = \varepsilon/4} and use the fact that {1-2t \geq 1/2} for {\varepsilon \leq 1/2}. (Note: this setting of {t} also satisfies {t \in (0,1/2)}, which we needed from our previous calculations.)

Almost done: let’s take stock of the situation. We observed that {\| A(x) \|_2^2} was distributed like a sum of squares of Gaussians, and using that we proved that

\displaystyle  \Pr[ \| A(x) \|_2^2 > 1 + \varepsilon ] \leq \exp( - k\varepsilon^2/8 ) \leq \delta/2

for {k = \frac{8}{\varepsilon^2} \ln \frac{2}{\delta}}. A similar calculation bounds the lower tail, and finishes the proof of Lemma 2.

Citations: The JL Lemma was first proved in this paper of Bill Johnson and Joram Lindenstrauss. There have been several proofs after theirs, usually trying to tighten their results, or simplify the algorithm/proof (see citations in some of the newer papers): the proof follows some combinations of the proofs in this STOC ’98 paper of Piotr Indyk and Rajeev Motwani, and this paper by Sanjoy Dasgupta and myself.

2. The data stream model

The JL map we considered was a linear map, and that has many advantages. One of them is that we can use it in a distributed context: if {t} players each have a vector {\vec{y}_i} and each knows the JL matrix {A}, then to compute {A(\sum_i \vec{y}_i)} each person can just compute {A\vec{y}_i}, send their answers out, and then someone can sum up the answers to get {\sum_i A(\vec{y}_i) = A(\sum_i \vec{y}_i)}. Since these vectors {A\vec{y}_i} are smaller than {\vec{y}_i} (they lie in {{\mathbb R}^k} instead of {{\mathbb R}^D}), this can result in significant savings in communication. (We need all players to know the matrix {A}, but if they have shared randomness they can generate this matrix themselves.)

This same idea is useful in the context of data streaming: suppose you have a data stream of a large number of elements {\sigma_1, \sigma_2, \ldots,} whizzing past you, each element {\sigma_j} drawn from the universe {[D]}. This stream defines a frequency vector {\vec{x} \in {\mathbb R}^D}, where {x_i} is the number of times element {i} is seen. People working on data streams want to calculate statistics of this vector {\vec{x}}—e.g., how many non-zeroes does it have? What is the {\ell_1} length of this? (Duh! it’s just the length of the data stream.) What is {\sum_i x_i^2 = \| \vec{x} \|_2^2}? Etc.

The Space Crunch. All this can be trivially done if we use {D} space to actually store the vector {\vec{x}}. Suppose we do not want to store the frequency vector explicitly, but are OK with approximate answers. We can use JL or similar schemes to approximately calculate {\sum_i x_i^2}. Suppose {A} is a random {k \times D} Gaussian matrix, then by the guarantee of the JL lemma, the estimate {\|A\vec{x}\|_2^2 \in (1 \pm \varepsilon) \|\vec{x}\|_2^2} with probability {1 - \delta}, if {k = \Omega(1/\varepsilon^2 \log 1/\delta)}. (Note: this is the error for a single query—so we’re not guaranteeing the counts at all times are close, just at the time the query is made.)

And the algorithm is simple: maintain a vector {y \in {\mathbb R}^k}, initially zero. When the element {j \in [D]} comes by, add in the {j^{th}} column of {A} to {y}. Finally, answer with {\|y\|_2^2}. (If you have to answer {t} queries, choose {k} appropriately larger.)

Of course, you’ve realized I am cheating. In order to save space we used JL. But the JL matrix itself uses {kD} entries, which is a lot of space, much more than the {D} entries of the frequency vector {\vec{x}}! Also, we now need to maintain a matrix of reals, whereas {\vec{x}} just has integers!

We can handle both issues. The former issue can directly be handled by using a pseudorandom generator that “fools” low-space computation—we will not talk about this solution in this lecture. Instead we’ll give a different (though weaker) solution which handles both issues: it will use less space, and will maintain only integer values (if the input has integers).

3. Using random signs instead of Gaussians

While Gaussians have all kinds of nice properties, they are real-valued distributions and hence require attention to precision. How about populating {A} with draws from other, simpler distributions? How about setting each {M_{ij} \in_R \{-1,+1\}}, and letting {A = \frac{1}{\sqrt{k}} M}? (A random sign is also called a Rademacher random variables, btw, the name Bernoulli being already taken for a random bit in {\{0,1\}}.)

Now, we want to study the properties of

\displaystyle  Z := \sum_{i = 1}^k \left( \sum_{j = 1}^D A_{ij} \cdot x_j \right)^2. \ \ \ \ \ (1)

To keep subscripts to a minimum, consider the inner expression

\displaystyle  Y = (\sum_j X_j \cdot x_j)^2

where each {X_i \in_R \{-1, 1\}}. Then

\displaystyle  \begin{array}{rcl}  E[Y] &= E[(\sum_j X_j x_j)(\sum_l X_l x_l)] \\ &= E[ \sum_j X_j^2 x_j^2 + \sum_{j \neq l} X_jX_l x_jx_l ] \\ &= \sum_j E[X_j^2] x_j^2 + \sum_{j \neq l} E[X_jX_l] x_jx_l = \sum_j x_j^2. \end{array}

if the {X_j}‘s are pairwise independent, since {X_j^2 = 1} and {E[X_jX_l] = E[X_j]E[X_l]= 0} by independence. Plugging this into~(1) and recalling that {A_{ij} \in \{-\frac{1}{\sqrt{k}},+\frac{1}{\sqrt{k}}\}} , we get

\displaystyle  E[ Z ] = \sum_{i = 1}^k \frac1k \sum_j x_j^2 = \| x \|_2^2.

Just what we like! To show that {Z} is indeed close to its mean, we will use Chebyshev, and this requires us to compute the variance of {Z}.

If the rows of {A} are independent, then {\text{Var}(Z)} is the sum of the variances from each row, which in terms of the variable {Y} defined above is:

\displaystyle  \text{Var}(Z) = \sum_{i = 1}^k \frac{1}{k^2} \text{Var}(Y) = \frac{\text{Var}(Y)}{k}.

But {\text{Var}(Y) = E[Y^2] - E[Y]^2}, we know what {E[Y]^2} is. For the other term,

\displaystyle  \begin{array}{rl}  E[Y^2] &= E[ \sum_{p,q,r,s} X_pX_qX_rX_s x_px_qx_rx_s ] \\ &= \sum_p E[ X_p^4 x_p^4] + 6\sum_{p < q} E[X_p^2X_q^2 x_p^2x_q^2] + \text{ other terms} \\ &= \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 \end{array}

(The other terms disappear because of {4}-wise independence.) And plugging this into the definition of {\text{Var}(Y)}, we get

\displaystyle \text{Var}(Y) = \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 - (\sum_p x_p^2)^2 = 4 \sum_{p < q} x_p^2x_q^2 \leq 2 E[Z]^2.

Interesting, the variance {Y} is just twice the squared mean—that’s good, since the variance of {Z} (which was the final answer, obtained by taking the average of {k} such variables) is {1/k} as much, since averaging reduces the variance. So {\text{Var}{Z} \leq \frac2k E[Z]^2}. And finally, we can set {k = \frac{2}{\varepsilon^2 \delta}} and use Chebyshev to get

\displaystyle  \Pr[ Z \not\in (1 \pm \varepsilon) E[Z] ] \leq \frac{\text{Var}(Z)}{(\varepsilon E[Z])^2} \leq \delta.

Great! So, if we take a {k \times D} matrix {A} whose {k = \frac{2}{\varepsilon^2 \delta}} rows were independent, each row having {\{-1,+1\}} values drawn from a {4}-wise independent sample space. We maintain a {k}-dimensional vector {y}, and whenever an element {j} in {[D]} comes by in the stream, we just add in the {j^{th}} column of {A} to {y}. And when we want the answer, we reply with {\frac{1}{\sqrt{k}} \| y\|}—this will be correct with probability at least {1 - \delta}.

Why {4}-wise independence? Well, the calculation of {\text{Var}(Z)} only used the fact that any four entries of each row behaved independently of each other. And it is possible to generate {D} values from {\{-1,+1\}} which is {4}-wise independent, using hash functions that require only {O(\log D)} bits of space. (We’ll talk more about this later in the course.) So the total space usage is: {O(k \log D)} bits to store the hash functions, {O(k \log (LD)) = O(\varepsilon^{-2} \log (LD))} to store vector {y} if the frequency of each element is at most {L}, and that’s it.

Citations: This scheme is due to the Gödel prize winning paper of Noga Alon, Yossi Matias, and Mario Szegedy. There has been a lot of interesting work on moment estimation: see, e.g., this STOC 2011 paper of Daniel Kane, Jelani Nelson, Ely Porat and David Woodruff on getting lower bounds for {\ell_p}-norms of the vector {x}, and the many references therein.

4. Subgaussian Behavior

In the previous section, we saw that if each row of the matrix {A} was drawn from a {4}-wise independent sample space (and hence generating any column of {A} could be done in {O(k \log n)} space), setting {k = O(\varepsilon^{-2} \delta^{-1})} would suffice to give answers within {(1\pm \varepsilon)} with probability at least {1- \delta}. Note that the number of rows went from {O(\varepsilon^{-2} \log \delta^{-1})} to {O(\varepsilon^{-2} \delta^{-1})}; this increase typical of cases where we only use the second moment (and limited independence) instead of all the moments (complete independence).

So suppose we did have the luxury of full independence, could we match the JL bound using Rademacher matrices? Or does moving to the {\{-1, 1\}} case already lose something in the performance? It turns out we can also prove Lemma 2 for a Rademacher matrix, losing only constants—we’ll now prove this.

Let’s look over the proof in Section 1, and see what we need to do. We take an arbitrary unit vector {x}, and define

\displaystyle  Y \sim \langle R_1, R_2, \ldots, R_d\rangle \cdot x = \sum_i x_i\, R_i \qquad \text{and} \qquad Z = \frac1k \sum_{i=1}^k Y_i^2

for {R_i \in_R \{-1,1\}}, and {Y_i}‘s being i.i.d and {Y_i \sim Y}. If we could show that

  • {E[Z] = \| x\|_2^2}, and
  • {E[ e^{tY^2} ] \leq \frac{1}{\sqrt{1 - ct}}} for some constant {c},

then the rest of the proof of Section 1 does not use any other facts about Gaussians. And the first fact {E[Z] = \| x\|_2^2} follows by the calculations from the previous section, so all we need to do is to bound the moment generating function for {Y^2}!

We can do this by explicit calculations, but instead let’s give a useful abstraction:

Definition 4 A random variable {V} is said to be subgaussian with parameter {c} and for all real {s}, we have {E[ e^{s V} ] \leq e^{c s^2}}.

(You can define subgaussian-ness alternatively as in these notes by Roman Vershynin, which also shows the two definitions are equivalent for symmetric distributions.) A simple calculation shows that for {G \sim N(0,1)} then {E[ e^{sG} ] = e^{s^2/2}}—good to know that the Gaussian is also subgaussian!

The following lemma gives a slick way to bound the mgf for the square of a subgaussian, now that we’ve done the hard work for the Gaussians.

Lemma 5 If {V} is subgaussian with parameter {c}, then {E[ e^{sV^2} ] \leq \frac{1}{\sqrt{1 - 4c\, s}}} for {s > 0}.

Proof: Well, suppose {G \sim N(0,1)} is an independent Gaussian, then

\displaystyle  E_V[ e^{sV^2} ] = E_{G,V}[ e^{\sqrt{2s}V \, G} ]

by the calculation we just did for Gaussians. (Note that we’ve just introduced a Gaussian into the mix, without any provocation! But it will all work out.) Let just rewrite that

\displaystyle  E_{G,V}[ e^{\sqrt{2s}V \, G} ] = E_G[ E_V[ e^{(\sqrt{2s}G) V} ] ].

Using the {c}-subgaussian behavior of {V} we bound this by

\displaystyle  E_G[ e^{c(\sqrt{2s}|G|)^2} ] = E_G[ e^{2cs G^2} ].

Finally, the calculation~(1) gives this to be {\frac{1}{\sqrt{1 - 4cs}}}. \Box

Good. Now if {Y} were subgaussian, we’d be done. We know that {Y} is a weighted sum of Rademacher varaibles. A Rademacher random variable is indeed {1/2}-subgaussian

\displaystyle  E[ e^{s R} ] = \frac{ e^s + e^{-s} }{2} = \cosh{s} = 1 + \frac{s^2}{2!} + \frac{s^4}{4!} + \cdots \leq e^{s^2/2}.

And if {V_i}‘s are independent and {c}-subgaussian, and {\| x\|_2 = 1}, then {V = \sum_i x_i V_i} has

\displaystyle  E[ e^{sV} ] = E[ e^{\sum_i (sx_i) V_i} ] \leq \prod_i e^{c(sx_i)^2} = e^{cs^2 \sum_i x_i^2} = e^{cs^2}.

To summarize: {R_i}‘s are {1/2}-subgaussian, so {Y = \sum_i x_iR_i} is too. And hence {E[ e^{tY^2} ] \leq \frac{1}{\sqrt{1 - 2t}}} for {\{-1,+1\}}-random variables as well. This, in turn, completes the proof that the Rademacher matrix also has the JL property! Note that the JL matrix {A} now just requires us to pick {kD = O(D \varepsilon^{-2} \log \delta^{-1})} random bits (instead of {kD} random Gaussians); also, there are fewer precision issues to worry about. One can consider other distributions to stick into the matrix {A}—all you need to show is that {Z} has the right mean, and that the entries are subgaussian.

Citations: The scheme of using Rademacher matrices instead of Gaussian matrices for JL was first proposed in this paper by Dimitris Achlioptas. The idea of extending it to subgaussian distributions appears in this paper of Indyk and Naor, and this paper of Matousek. The paper of Klartag and Mendelson generalizes this even further.

BTW, one can define subgaussian distributions as ones that satisfy {E[e^{sV}] \leq e^{cs^2}} only for {c > 0}, or as variables for which {\Pr[ V \geq \lambda ] \leq e^{-c\lambda^2}} for {\lambda > 0} (the upper tail is subgaussian), and prove JL bounds—see, e.g., the paper of Matousek—but it does not matter for distributions symmetric about {0} with bounded variance, since these definitions are then essentially the same.

Fast J-L: Do we really need to plug in non-zero values into every entry of the matrix {A}? What if most of {A} is filled with zeroes? The first problem is that if {x} is a very sparse vector, then {Ax} might be zero with high probability? Achlioptas showed that having a random two-thirds of the entries of {A} being zero still works fine: the paper of Nir Ailon and Bernard Chazelle showed that if you first hit {x} with a suitable matrix {P} which caused {Px} to be “well-spread-out” whp, and then {\|APx\| \approx \|x\|} would still hold for a much sparser {A}. Moreover, this {P} requires much less randomless, and furthermore, the computations can be done faster too! There has been much work on fast and sparse versions of JL: see, e.g., this SODA 11 paper of Ailon and Edo Liberty, and this arxiv preprint by Daniel Kane and Jelani Nelson. Jelani has some notes on the Fast JL Transform.

Compressed Sensing: Finally, the J-L lemma is closely related to compressed sensing: how to reconstruct a sparse signal using very few measurements. See these notes by Jiri Matousek, or these by Baraniuk and others for a proof of the beautiful connection. I will say more about this connection in a later post.

Lecture #14: Game Theory

Today we discussed some key concepts in game theory and connections between (some of) them and results in online learning.

We began by discussing 2-player zero-sum games. The Minimax Theorem states that these games have a well-defined value V such that (a) there exists a mixed strategy p for the row-player that guarantees the row player makes at least V (in expectation) no matter what column the column-player chooses, and (b) there exists a mixed strategy q for the column-player that guarantees the row player makes at most V (in expectation) no matter what row the row-player chooses. We then saw how we could use the results we proved about the Randomized Weighted Majority algorithm to prove this theorem.

We next discussed general-sum games and the notion of Nash equilibria. In a zero-sum game, a Nash equilibrium requires the two players to both be playing minimax optimal strategies, but in general there could be Nash equilibria of multiple quality-levels for each player. We then proved the existence of Nash equilibria. However, unlike in the case of zero-sum games, the proof gives no idea of how to find a Nash equilibrium or even to approach one. In fact, doing so in a 2-player nxn game is known to be PPAD-hard.

Finally we discussed the notion of correlated equilibria and the connection of these to swap-regret, a generalization of the kind of regret notion we discussed last time. In particular, any set of algorithms with swap-regret sublinear in T, when played against each other, will have an empirical distribution of play that approaches (the set of) correlated equilibria.  See lecture notes.

Lecture 1 notes and references

The reference to the best known algorithm for 3SAT we mentioned in class is this paper: Improving PPSZ for 3-SAT using Critical Variables, by Hertli, Moser, and Scheder.

It gives a algorithm with an expected runtime of 1.321^n, and the idea to cleverly combine Schoening’s algorithm (the one you saw today and will analyze in HW1) and a previous algorithm of Paturi, Pudlak, Saks and Zane (PPSZ).

At a super-high level, Schoening’s algorithm does better if there are lots of (well-separated) solutions that the randomized process can make a beeline for. The PPSZ algorithm, another very simple randomized algorithm that we may see in a later post, works better if there are few satisfying assignments. The idea in the paper of Hertli and others is how to trade off the two guarantees to do better than either one individually; this idea of combining Schoening and PPSZ goes back to a paper of Iwama and Tamaki (SODA ’04).