CMU Randomized Algorithms

Randomized Algorithms, Carnegie Mellon: Spring 2011

Lecture #17: Dimension Reduction

Today we’ll talk about dimensionality reduction, and some related topics in data streaming.

1. Dimension Reduction

Suppose we are given a set of {n} points {\{x_1, x_2, \ldots, x_n\}} in {{\mathbb R}^D}. How small can we make {D} and still maintain the Euclidean distances between the points? Clearly, we can always make {D = n-1}, since any set of {n} points lies on a {n-1}-dimensional subspace. And this is (existentially) tight: e.g., the case when {x_2 - x_1, x_3 - x_1, \ldots, x_n - x_1} are all orthogonal vectors.

But what if we were OK with the distances being approximately preserved? In HW#3, you saw that while there could only be {D} orthogonal unit vectors in {{\mathbb R}^D}, there could be as many as {\exp(c \varepsilon^2 D)} unit vectors which are {\varepsilon}-orthogonal—i.e., whose mutual inner products all lie in {[-\varepsilon, \varepsilon]}. Near-orthogonality allows us to pack exponentially more vectors!

Put another way, note that

\displaystyle  \| \vec{a} - \vec{b} \|_2^2 = \langle \vec{a} - \vec{b}, \vec{a} - \vec{b} \rangle = \langle \vec{a}, \vec{a} \rangle + \langle \vec{b}, \vec{b} \rangle - 2\langle \vec{a}, \vec{b} \rangle = \| \vec{a} \|_2^2 + \| \vec{b} \|_2^2 - 2\langle \vec{a}, \vec{b} \rangle.

And hence the squared Euclidean distance between any pair of the points defined by these {\varepsilon}-orthogonal vectors falls in {2(1 \pm \varepsilon)}. So, if we wanted {n} points exactly at unit (Euclidean) distance from each other, we would need {n-1} dimensions. (Think of a triangle in {2}-dims.) But if we wanted to pack in {n} points which were at distance {(1 \pm \varepsilon)} from each other, we could pack them into

\displaystyle  O\left(\frac{\log n}{\varepsilon^2}\right)


1.1. The Johnson Lindenstrauss lemma

The Johnson Lindenstrauss “flattening” lemma says that such a claim is true not just for equidistant points, but for any set of {n} points in Euclidean space:

Lemma 1 Let {\varepsilon \in (0, 1/2)}. Given any set of points {X = \{x_1, x_2, \ldots, x_n\}} in {{\mathbb R}^D}, there exists a map {A: {\mathbb R}^D \rightarrow {\mathbb R}^k} with {k = O(\varepsilon^{-2} \log n)} such that

\displaystyle  1 - \varepsilon \leq \frac{\| A(x_i) - A(x_j) \|_2^2 }{ \| x_i - x_j \|_2^2} \leq 1 + \varepsilon.

Note that the target dimension {k} is independent of the original dimension {D}, and depends only on the number of points {n} and the accuracy parameter {\varepsilon}.

This lemma is tight up to the constant term: it is easy to see that we need at least {\Omega(\frac1\varepsilon \log n)} using a packing argument. Noga Alon showed a lower bound of {\Omega(\frac{\log n}{\varepsilon^2 \log 1/\varepsilon})}.

1.2. The construction

The JL lemma is pretty surprising, but the construction of the map is perhaps even more surprising: it is a super-simple random construction. Let {M} be a {k \times D} matrix, such that every entry of {M} is filled with an i.i.d. draw from a standard normal {N(0, 1)} distribution (a.k.a. the “Gaussian” distribution). For {x \in {\mathbb R}^D}, define

\displaystyle  A(x) = \frac{1}{\sqrt{k}} Mx.

That’s it. You hit the vector {x} with a Gaussian matrix {M}, and scale it down by {\sqrt{k}}. That’s the map {A}. Note that it is a linear map: {A(x) + A(y) = A(x+y)}. So suppose we could show the following lemma:

Lemma 2 Let {\varepsilon \in (0, 1/2)}. If {A} is constructed as above with {k = c \varepsilon^{-2} \log \delta^{-1}}, and {x \in {\mathbb R}^D} is a unit vector, then

\displaystyle  \Pr[ \| A(x) \|_2^2 \in 1 \pm \varepsilon ] \geq 1 - \delta.

Then we’d get a proof of Lemma 1. Indeed, set {\delta = 1/n^2}, and hence {k = O(\varepsilon^{-2} \log n)}. Now for each {x_i, x_j \in X} we get that the squared length of {x_i - x_j} is maintained to within {1 \pm \varepsilon} with probability at least {1 - 1/n^2}. By a union bound, all {\binom{n}{2}} pairs of distances in {\binom{X}{2}} are maintained with probability at least {1 - \binom{n}{2}\frac1{n^2} \geq 1/2}. This proves Lemma 1.

A few comments about this construction:

  • The above proof shows not only the existence of a good map, we also get that a random map as above works with constant probability! In other words, a Monte-Carlo randomized algorithm for dimension reduction. (Since we can efficiently check that the distances are preserved to within the prescribed bounds, we can convert this into a Las Vegas algorithm.)
  • The algorithm (at least the Monte Carlo version) does not even look at the set of points {X}: it works for any set {X} with high probability. Hence, we can pick this map {A} before the points in {X} arrive.

  • Given a set {X \subseteq {\mathbb R}^D}, one can get deterministic poly-time algorithms constructing a dimension reduction map {A: {\mathbb R}^D \rightarrow {\mathbb R}^k} for {k = O(\varepsilon^{-2} \log |X|)}: the first one was given in this paper of Lars Engebretsen, Piotr Indyk and Ryan O’Donnell; another construction is due to D. Sivakumar.

A SODA 2011 paper of T.S. Jayram and David Woodruff shows that this dependence of {O(\varepsilon^{-2} \log \delta^{-1})} is the best possible. Note that if we use this approach the union bound to prove JL, then {O(\varepsilon^{-2} \log n)} is the best bound possible. (An earlier version of these notes incorrectly claimed that the Jayram-Woodruff paper also showed an unconditional lower bound for JL, thanks to Jelani for pointing out the mistake.)

1.3. The proof

Now, on to the proof of Lemma 2. Here’s the main idea. Imagine that the vector we’re considering is just the elementary unit vector {e_1 = (1, 0, \ldots, 0)}. Then {M\, e_1} is just a vector with independent and identical Gaussian values, and we’re interested in its length—the sum of squares of these Gaussians. If these were bounded r.v.s, we’d be done—but they are not. However, their tails are very small, so things should work out

But what’s a Gaussian {N(0,1)}? Well, it looks like this:


Which is not too different from this (bounded) random variable, if you squint a bit:


Which has constant mean. So, if we take a sum of a bunch of such random variables (actually of their squares), it should behave pretty much like its mean (which is {\propto k}), because of a Chernoff-like argument. And so the expected length is close to {\sqrt{k}}, which explains the division by {\sqrt{k}}.

Now we just need to make all this precise, and remove the assumption that the vector was just {e_1}. That’s what the rest of the formal proof does: it has a few steps, but each of them is fairly elementary.

1.4. The proof, this time for real

We’ll be using basic facts about Gaussians, let’s just recall them. The probability density function for the Gaussian {N(\mu, \sigma^2)} is

\displaystyle  f(x) = \textstyle \frac{1}{\sqrt{2 \pi} \sigma } e^{\frac{ (x - \mu)^2 }{2\sigma^2} }.

We also use the following; the proof just needs some elbow grease.

Proposition 3 If {Y_1 \sim N(\mu_1, \sigma_1^2)} and {Y_2 \sim N(\mu_2, \sigma_2^2)}, then

\displaystyle  Y_1 + Y_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

Recall that we want to argue about the squared length of {A(x) \in {\mathbb R}^k}. To start off, observe that each coordinate of the vector {Mx} behaves like

\displaystyle  Y \sim \langle G_1, G_2, \ldots, G_D \rangle \cdot x = \sum x_i \, G_i

where the {G_i}‘s are i.i.d. {N(0,1)} r.v.s. But then the proposition tells us that {Y \sim N(0, x_1^2 + x_2^2 + \ldots + x_D^2)}. And since {x} is a unit length vector, this is simply {N(0,1)}. So, each of the {k} coordinates of {Mx} behaves just like an independent Gaussian!

What is the squared length of {A(x) = \frac1{\sqrt{k}} Mx}, then? It is

\displaystyle  Z := \sum_{i = 1}^k \frac{1}{k} \cdot Y_i^2

where each {Y_i \sim N(0,1)}, independent of the others. And since {E[Y_i^2] = \text{Var}(Y_i) + E[Y_i]^2 = 1}, we get {E[Z] = 1}.

Now to show that {Z} does not deviate too much from {1}. And {Z} is the sum of a bunch of independent and identical random variables. If only the {Y_i}‘s were all bounded, we could have used a Chernoff bound and be done. But these are not bounded, so this is finally where we’ll need to do a little work. (Note: we could take the easy way out, observe that the squares of Gaussians are chi-squared r.v.s, the sum of {k} of them is chi-squared with {k} degrees of freedom, and the internets conveniently has tail bounds for these things. But we digress.)

So let’s start down the ye olde Chernoff path, for the upper tail, say: \Pr[ Z \geq 1 + \varepsilon ] &\leq \Pr[ e^{tkZ} \geq e^{tk(1+\varepsilon)} ] \leq E[ e^{tkZ} ]/e^{tk(1+\varepsilon)} = \prod_i \left( E[ e^{tY_i^2} ]/e^{t(1+\varepsilon)} \right) for every {t > 0}. And what is {E[e^{tY^2}]} for {Y \sim N(0,1)}? Let’s calculate it: \frac1{\sqrt{2\pi}} \int_y e^{ty^2} e^{-y^2/2} dy &= \frac1{\sqrt{2\pi}} \int_z e^{-z^2/2} \frac{dz}{\sqrt{1 – 2t}} = \frac{1}{\sqrt{1 – 2t}}. for {t < 1/2}. So our current bound on the upper tail is that for all { t \in (0, 1/2)} we have

\displaystyle  \begin{array}{rl}  \Pr[ Z \geq (1+\varepsilon) ] &\leq \left( \frac{1}{e^{t(1+\varepsilon)} \sqrt{1 - 2t}} \right)^k. \end{array}

Let’s just focus on part of this expression:

\displaystyle  \begin{array}{rl}  \left( \frac{1}{e^{t} \sqrt{1 - 2t}} \right) &= \exp\left( - t - \frac12 \log (1 - 2t)) \right) \\ &= \exp\left( (2t)^2/4 + (2t)^3/6 + \cdots \right) \leq \exp\left( t^2( 1 + 2t + 2t^2 + \cdots ) \right) \\ &= \exp(t^2/(1 - 2t)). \end{array}

Plugging this back, we get

\displaystyle  \begin{array}{rl}  \Pr[ Z \geq (1+\varepsilon) ] &\leq \left( \frac{1}{e^{t(1+\varepsilon)} \sqrt{1 - 2t}} \right)^k \\ &\leq \exp( kt^2/(1-2t) - kt\varepsilon ) \leq e^{-k\varepsilon^2/8}, \end{array}

if we set {t = \varepsilon/4} and use the fact that {1-2t \geq 1/2} for {\varepsilon \leq 1/2}. (Note: this setting of {t} also satisfies {t \in (0,1/2)}, which we needed from our previous calculations.)

Almost done: let’s take stock of the situation. We observed that {\| A(x) \|_2^2} was distributed like a sum of squares of Gaussians, and using that we proved that

\displaystyle  \Pr[ \| A(x) \|_2^2 > 1 + \varepsilon ] \leq \exp( - k\varepsilon^2/8 ) \leq \delta/2

for {k = \frac{8}{\varepsilon^2} \ln \frac{2}{\delta}}. A similar calculation bounds the lower tail, and finishes the proof of Lemma 2.

Citations: The JL Lemma was first proved in this paper of Bill Johnson and Joram Lindenstrauss. There have been several proofs after theirs, usually trying to tighten their results, or simplify the algorithm/proof (see citations in some of the newer papers): the proof follows some combinations of the proofs in this STOC ’98 paper of Piotr Indyk and Rajeev Motwani, and this paper by Sanjoy Dasgupta and myself.

2. The data stream model

The JL map we considered was a linear map, and that has many advantages. One of them is that we can use it in a distributed context: if {t} players each have a vector {\vec{y}_i} and each knows the JL matrix {A}, then to compute {A(\sum_i \vec{y}_i)} each person can just compute {A\vec{y}_i}, send their answers out, and then someone can sum up the answers to get {\sum_i A(\vec{y}_i) = A(\sum_i \vec{y}_i)}. Since these vectors {A\vec{y}_i} are smaller than {\vec{y}_i} (they lie in {{\mathbb R}^k} instead of {{\mathbb R}^D}), this can result in significant savings in communication. (We need all players to know the matrix {A}, but if they have shared randomness they can generate this matrix themselves.)

This same idea is useful in the context of data streaming: suppose you have a data stream of a large number of elements {\sigma_1, \sigma_2, \ldots,} whizzing past you, each element {\sigma_j} drawn from the universe {[D]}. This stream defines a frequency vector {\vec{x} \in {\mathbb R}^D}, where {x_i} is the number of times element {i} is seen. People working on data streams want to calculate statistics of this vector {\vec{x}}—e.g., how many non-zeroes does it have? What is the {\ell_1} length of this? (Duh! it’s just the length of the data stream.) What is {\sum_i x_i^2 = \| \vec{x} \|_2^2}? Etc.

The Space Crunch. All this can be trivially done if we use {D} space to actually store the vector {\vec{x}}. Suppose we do not want to store the frequency vector explicitly, but are OK with approximate answers. We can use JL or similar schemes to approximately calculate {\sum_i x_i^2}. Suppose {A} is a random {k \times D} Gaussian matrix, then by the guarantee of the JL lemma, the estimate {\|A\vec{x}\|_2^2 \in (1 \pm \varepsilon) \|\vec{x}\|_2^2} with probability {1 - \delta}, if {k = \Omega(1/\varepsilon^2 \log 1/\delta)}. (Note: this is the error for a single query—so we’re not guaranteeing the counts at all times are close, just at the time the query is made.)

And the algorithm is simple: maintain a vector {y \in {\mathbb R}^k}, initially zero. When the element {j \in [D]} comes by, add in the {j^{th}} column of {A} to {y}. Finally, answer with {\|y\|_2^2}. (If you have to answer {t} queries, choose {k} appropriately larger.)

Of course, you’ve realized I am cheating. In order to save space we used JL. But the JL matrix itself uses {kD} entries, which is a lot of space, much more than the {D} entries of the frequency vector {\vec{x}}! Also, we now need to maintain a matrix of reals, whereas {\vec{x}} just has integers!

We can handle both issues. The former issue can directly be handled by using a pseudorandom generator that “fools” low-space computation—we will not talk about this solution in this lecture. Instead we’ll give a different (though weaker) solution which handles both issues: it will use less space, and will maintain only integer values (if the input has integers).

3. Using random signs instead of Gaussians

While Gaussians have all kinds of nice properties, they are real-valued distributions and hence require attention to precision. How about populating {A} with draws from other, simpler distributions? How about setting each {M_{ij} \in_R \{-1,+1\}}, and letting {A = \frac{1}{\sqrt{k}} M}? (A random sign is also called a Rademacher random variables, btw, the name Bernoulli being already taken for a random bit in {\{0,1\}}.)

Now, we want to study the properties of

\displaystyle  Z := \sum_{i = 1}^k \left( \sum_{j = 1}^D A_{ij} \cdot x_j \right)^2. \ \ \ \ \ (1)

To keep subscripts to a minimum, consider the inner expression

\displaystyle  Y = (\sum_j X_j \cdot x_j)^2

where each {X_i \in_R \{-1, 1\}}. Then

\displaystyle  \begin{array}{rcl}  E[Y] &= E[(\sum_j X_j x_j)(\sum_l X_l x_l)] \\ &= E[ \sum_j X_j^2 x_j^2 + \sum_{j \neq l} X_jX_l x_jx_l ] \\ &= \sum_j E[X_j^2] x_j^2 + \sum_{j \neq l} E[X_jX_l] x_jx_l = \sum_j x_j^2. \end{array}

if the {X_j}‘s are pairwise independent, since {X_j^2 = 1} and {E[X_jX_l] = E[X_j]E[X_l]= 0} by independence. Plugging this into~(1) and recalling that {A_{ij} \in \{-\frac{1}{\sqrt{k}},+\frac{1}{\sqrt{k}}\}} , we get

\displaystyle  E[ Z ] = \sum_{i = 1}^k \frac1k \sum_j x_j^2 = \| x \|_2^2.

Just what we like! To show that {Z} is indeed close to its mean, we will use Chebyshev, and this requires us to compute the variance of {Z}.

If the rows of {A} are independent, then {\text{Var}(Z)} is the sum of the variances from each row, which in terms of the variable {Y} defined above is:

\displaystyle  \text{Var}(Z) = \sum_{i = 1}^k \frac{1}{k^2} \text{Var}(Y) = \frac{\text{Var}(Y)}{k}.

But {\text{Var}(Y) = E[Y^2] - E[Y]^2}, we know what {E[Y]^2} is. For the other term,

\displaystyle  \begin{array}{rl}  E[Y^2] &= E[ \sum_{p,q,r,s} X_pX_qX_rX_s x_px_qx_rx_s ] \\ &= \sum_p E[ X_p^4 x_p^4] + 6\sum_{p < q} E[X_p^2X_q^2 x_p^2x_q^2] + \text{ other terms} \\ &= \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 \end{array}

(The other terms disappear because of {4}-wise independence.) And plugging this into the definition of {\text{Var}(Y)}, we get

\displaystyle \text{Var}(Y) = \sum_p x_p^4 + 6\sum_{p < q} x_p^2x_q^2 - (\sum_p x_p^2)^2 = 4 \sum_{p < q} x_p^2x_q^2 \leq 2 E[Z]^2.

Interesting, the variance {Y} is just twice the squared mean—that’s good, since the variance of {Z} (which was the final answer, obtained by taking the average of {k} such variables) is {1/k} as much, since averaging reduces the variance. So {\text{Var}{Z} \leq \frac2k E[Z]^2}. And finally, we can set {k = \frac{2}{\varepsilon^2 \delta}} and use Chebyshev to get

\displaystyle  \Pr[ Z \not\in (1 \pm \varepsilon) E[Z] ] \leq \frac{\text{Var}(Z)}{(\varepsilon E[Z])^2} \leq \delta.

Great! So, if we take a {k \times D} matrix {A} whose {k = \frac{2}{\varepsilon^2 \delta}} rows were independent, each row having {\{-1,+1\}} values drawn from a {4}-wise independent sample space. We maintain a {k}-dimensional vector {y}, and whenever an element {j} in {[D]} comes by in the stream, we just add in the {j^{th}} column of {A} to {y}. And when we want the answer, we reply with {\frac{1}{\sqrt{k}} \| y\|}—this will be correct with probability at least {1 - \delta}.

Why {4}-wise independence? Well, the calculation of {\text{Var}(Z)} only used the fact that any four entries of each row behaved independently of each other. And it is possible to generate {D} values from {\{-1,+1\}} which is {4}-wise independent, using hash functions that require only {O(\log D)} bits of space. (We’ll talk more about this later in the course.) So the total space usage is: {O(k \log D)} bits to store the hash functions, {O(k \log (LD)) = O(\varepsilon^{-2} \log (LD))} to store vector {y} if the frequency of each element is at most {L}, and that’s it.

Citations: This scheme is due to the Gödel prize winning paper of Noga Alon, Yossi Matias, and Mario Szegedy. There has been a lot of interesting work on moment estimation: see, e.g., this STOC 2011 paper of Daniel Kane, Jelani Nelson, Ely Porat and David Woodruff on getting lower bounds for {\ell_p}-norms of the vector {x}, and the many references therein.

4. Subgaussian Behavior

In the previous section, we saw that if each row of the matrix {A} was drawn from a {4}-wise independent sample space (and hence generating any column of {A} could be done in {O(k \log n)} space), setting {k = O(\varepsilon^{-2} \delta^{-1})} would suffice to give answers within {(1\pm \varepsilon)} with probability at least {1- \delta}. Note that the number of rows went from {O(\varepsilon^{-2} \log \delta^{-1})} to {O(\varepsilon^{-2} \delta^{-1})}; this increase typical of cases where we only use the second moment (and limited independence) instead of all the moments (complete independence).

So suppose we did have the luxury of full independence, could we match the JL bound using Rademacher matrices? Or does moving to the {\{-1, 1\}} case already lose something in the performance? It turns out we can also prove Lemma 2 for a Rademacher matrix, losing only constants—we’ll now prove this.

Let’s look over the proof in Section 1, and see what we need to do. We take an arbitrary unit vector {x}, and define

\displaystyle  Y \sim \langle R_1, R_2, \ldots, R_d\rangle \cdot x = \sum_i x_i\, R_i \qquad \text{and} \qquad Z = \frac1k \sum_{i=1}^k Y_i^2

for {R_i \in_R \{-1,1\}}, and {Y_i}‘s being i.i.d and {Y_i \sim Y}. If we could show that

  • {E[Z] = \| x\|_2^2}, and
  • {E[ e^{tY^2} ] \leq \frac{1}{\sqrt{1 - ct}}} for some constant {c},

then the rest of the proof of Section 1 does not use any other facts about Gaussians. And the first fact {E[Z] = \| x\|_2^2} follows by the calculations from the previous section, so all we need to do is to bound the moment generating function for {Y^2}!

We can do this by explicit calculations, but instead let’s give a useful abstraction:

Definition 4 A random variable {V} is said to be subgaussian with parameter {c} and for all real {s}, we have {E[ e^{s V} ] \leq e^{c s^2}}.

(You can define subgaussian-ness alternatively as in these notes by Roman Vershynin, which also shows the two definitions are equivalent for symmetric distributions.) A simple calculation shows that for {G \sim N(0,1)} then {E[ e^{sG} ] = e^{s^2/2}}—good to know that the Gaussian is also subgaussian!

The following lemma gives a slick way to bound the mgf for the square of a subgaussian, now that we’ve done the hard work for the Gaussians.

Lemma 5 If {V} is subgaussian with parameter {c}, then {E[ e^{sV^2} ] \leq \frac{1}{\sqrt{1 - 4c\, s}}} for {s > 0}.

Proof: Well, suppose {G \sim N(0,1)} is an independent Gaussian, then

\displaystyle  E_V[ e^{sV^2} ] = E_{G,V}[ e^{\sqrt{2s}V \, G} ]

by the calculation we just did for Gaussians. (Note that we’ve just introduced a Gaussian into the mix, without any provocation! But it will all work out.) Let just rewrite that

\displaystyle  E_{G,V}[ e^{\sqrt{2s}V \, G} ] = E_G[ E_V[ e^{(\sqrt{2s}G) V} ] ].

Using the {c}-subgaussian behavior of {V} we bound this by

\displaystyle  E_G[ e^{c(\sqrt{2s}|G|)^2} ] = E_G[ e^{2cs G^2} ].

Finally, the calculation~(1) gives this to be {\frac{1}{\sqrt{1 - 4cs}}}. \Box

Good. Now if {Y} were subgaussian, we’d be done. We know that {Y} is a weighted sum of Rademacher varaibles. A Rademacher random variable is indeed {1/2}-subgaussian

\displaystyle  E[ e^{s R} ] = \frac{ e^s + e^{-s} }{2} = \cosh{s} = 1 + \frac{s^2}{2!} + \frac{s^4}{4!} + \cdots \leq e^{s^2/2}.

And if {V_i}‘s are independent and {c}-subgaussian, and {\| x\|_2 = 1}, then {V = \sum_i x_i V_i} has

\displaystyle  E[ e^{sV} ] = E[ e^{\sum_i (sx_i) V_i} ] \leq \prod_i e^{c(sx_i)^2} = e^{cs^2 \sum_i x_i^2} = e^{cs^2}.

To summarize: {R_i}‘s are {1/2}-subgaussian, so {Y = \sum_i x_iR_i} is too. And hence {E[ e^{tY^2} ] \leq \frac{1}{\sqrt{1 - 2t}}} for {\{-1,+1\}}-random variables as well. This, in turn, completes the proof that the Rademacher matrix also has the JL property! Note that the JL matrix {A} now just requires us to pick {kD = O(D \varepsilon^{-2} \log \delta^{-1})} random bits (instead of {kD} random Gaussians); also, there are fewer precision issues to worry about. One can consider other distributions to stick into the matrix {A}—all you need to show is that {Z} has the right mean, and that the entries are subgaussian.

Citations: The scheme of using Rademacher matrices instead of Gaussian matrices for JL was first proposed in this paper by Dimitris Achlioptas. The idea of extending it to subgaussian distributions appears in this paper of Indyk and Naor, and this paper of Matousek. The paper of Klartag and Mendelson generalizes this even further.

BTW, one can define subgaussian distributions as ones that satisfy {E[e^{sV}] \leq e^{cs^2}} only for {c > 0}, or as variables for which {\Pr[ V \geq \lambda ] \leq e^{-c\lambda^2}} for {\lambda > 0} (the upper tail is subgaussian), and prove JL bounds—see, e.g., the paper of Matousek—but it does not matter for distributions symmetric about {0} with bounded variance, since these definitions are then essentially the same.

Fast J-L: Do we really need to plug in non-zero values into every entry of the matrix {A}? What if most of {A} is filled with zeroes? The first problem is that if {x} is a very sparse vector, then {Ax} might be zero with high probability? Achlioptas showed that having a random two-thirds of the entries of {A} being zero still works fine: the paper of Nir Ailon and Bernard Chazelle showed that if you first hit {x} with a suitable matrix {P} which caused {Px} to be “well-spread-out” whp, and then {\|APx\| \approx \|x\|} would still hold for a much sparser {A}. Moreover, this {P} requires much less randomless, and furthermore, the computations can be done faster too! There has been much work on fast and sparse versions of JL: see, e.g., this SODA 11 paper of Ailon and Edo Liberty, and this arxiv preprint by Daniel Kane and Jelani Nelson. Jelani has some notes on the Fast JL Transform.

Compressed Sensing: Finally, the J-L lemma is closely related to compressed sensing: how to reconstruct a sparse signal using very few measurements. See these notes by Jiri Matousek, or these by Baraniuk and others for a proof of the beautiful connection. I will say more about this connection in a later post.

Lecture #16: Odds and ends

Some examples for the CKR decomposition scheme

Varun asked why the CKR algorithm differs in so many ways from Bartal’s procedure:

  • it chooses a single radius instead of choosing an independent radius for each piece it carves out,
  • it chooses the radius from a uniform distribution,
  • it picks the next vertex randomly (instead of arbitrarily), and moreover,
  • it also grows regions from vertices that have been captured in previously carved out pieces.

Here are some examples that might clarify the situation:

Suppose we choose a random radius R uniformly from [r/4,r/2], but independently for each center. Then in the following example, each time a leaf is picked, the unit-length edge (u,v) will be cut with probability 2/r. If there are n leaves, the edge will be cut with probability about Theta(n/r) >> O(log n)/r.

Example 1

Now, suppose we choose a single radius R uniformly from [r/4,r/2], but choose the centers in arbitrary (adversarial) order. In the following example, if we choose the centers in the order l_{r/2-1}, …, l_2, l_1, then we cut the edge (u,v) for sure (with probability 1). Note that in this case, the random order means there’s a good chance that early on in the process, we pick some vertex l_i with some small value of $i$. Since this leaf is close to the edge (u,v), it will put both u and v in the same set.

Example 2

So yeah, once we go with a uniformly random choice of R (instead of choosing it from a geometric distribution), we run into trouble if we re-sample R independently each time time we grow a region, or if we choose the next vertex to grow from in an worst-case fashion.

Finally, what about the growing balls from centers that have already been carved out? That is important for the analysis, because it makes the process depend very mildly on the past evolution. Would the algorithm break down if we did not do that? I am blanking on a bad example right now, but I think there should be one. Let me know if you see it.

Some citations for k-server

In STOC 2008, Cote, Meyerson, and Poplawski gave a randomized algorithm for the k-server problem on certain special kinds of HSTs that achieved a poly-logarithmic competitive ratio. In SODA 2009, Bansal, Buchbinder, and Naor abstracted out certain “convexity” properties, which if we could prove, would give a polylogarithmic competitive ratio for general HSTs, and hence for all metric spaces using the Theorems we saw in class. (See also their ICALP 2009 paper.) It’d be great to make progress on this problem.

Lecture #16: Distance-preserving trees (part II)


1. Embeddings into Distributions over Trees


In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to {O(\log n)}).

Theorem 1 Given any metric {(V,d)} with {|V| = n} and aspect ratio {\Delta}, there exists a efficiently sampleable distribution {\mathcal{D}} over spanning trees of {V} such that for all {u,v\in V}:

  1. For all {T \in \textrm{Support}(\mathcal{D})}, {d_T(u,v) \geq d(u,v)}, and
  2. {\mathop{\mathbb E}_{T\sim \mathcal{D}}[d_T(u,v)] \leq O(\log n \log \Delta) \; d(u,v)}.


To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space {(V, d)} on {|V| = n} points and a parameter {r \in {\mathbb R}_+}, a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of {V} into {S_1 \uplus S_2 \uplus S_3 \uplus \dots \uplus S_t} such that

  1. (Low Radius/Diameter) For all {S_i}, there exists {c_i \in V} such that for all {u \in S_i}, {d(c_i, u) \leq r/2}. Hence, for any {u, v \in S_i}, {d(u,v) \leq r}.
  2. (Low Cutting Probability) For each pair {u,v}, {\Pr[u,v \text{ lie in different } S_i's ] \leq \beta \; \frac{d(u,v)}{r}} with {\beta = O(\log n)}.

We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.

Consider the following recursive algorithm, which takes as input a pair {(U,i)} where {U \subseteq V} is a set of vertices of diameter at most {2^i}, and returns a rooted tree {(T,r)}.


  1. Apply the low-diameter decomposition to {(U,d)} with the parameter {r = 2^{i-1}} to get the partition {S_1,\ldots,S_t}.
  2. Recurse: Let {(T_j,root_j) \leftarrow} TreeEmbed({S_j,i-1}). As a base case, when {S_i} is a single point, simply return that point.
  3. For every tree {T_j} with {j > 1}, add the edge {(root_1,root_j)} with length {2^i}. This is a new tree which we denote {T}.
  4. Return the tree/root pair {(T,root_1)}.

Recall that since the low diameter decomposition is randomized, this algorithm defines a distribution over trees over {U}. To build the tree for {V}, we first rescale so that for all {u,v \in V}, {d(u,v) \geq 1} and {d(u,v) \leq \Delta \approx 2^\delta}. We define the distribution {\mathcal{D}} as the one obtained by calling TreeEmbed{(V,\delta)}.

Lemma 2 For all {x,y \in V}, {d_T(x,y) \geq d(x,y)} for all {T \in support(\mathcal{D})}.

Proof: Fix {x} and {y}, and let {i} be such that {d(x,y) \in (2^{i-1},2^i]}. Consider the invocation of TreeEmbed{(U,i)} such that {x \in U}. First, we examine the case in which {y \in U}. By the definition of the low diameter decomposition, since {d(x,y) > 2^{i-1}}, {x} and {y} will fall into separate parts of the partition obtained in Step 1, and so we will have {d_T(x,y) \geq 2^i}, the length of the edge placed between different subtrees. In the case in which {y \not\in U}, then it must be that {x} and {y} have been separated at a higher level {i'} of the recursion, are consequently separated by a higher subtree edge, and hence {d_T(x,y) \geq 2^{i'} > 2^i}. \Box

Lemma 3 For all {x,y \in V}, {E_{T \sim \mathcal{D}}[d_T(x,y)] \leq d(x,y)\cdot O(\log \Delta \, \log n)}


Proof: We begin with two easy subclaims. Suppose {(T_U,root) \leftarrow} TreeEmbed{(U,i)}:

  1. Claim 1: {d_{T_U}(root,x) \leq 2^{i+1}} for all {x \in U}. By induction, {x} lies in some piece {S_j} of the partition having diameter at most {2^{i-1}} and hence inductively is at distance at most {2^{i-1+1}} from its root {root_j}. That root is connected to the root {root} by an intertree edge of weight {2^i}, giving us {2^{i+1}} in total.
  2. Claim 2: If {x,y \in U}, then {d_{T_U}(x,y) \leq 2\cdot 2^{i+1}}. From the previous claim, each {x} and {y} is at distance at most {2^{i+1}} from {root}, distances are symmetric, and the triangle inequality applies.

We now have from the definition:

\displaystyle  \begin{array}{rcl}  d_T(x,y) &\leq& \sum_{i=\delta}^0\Pr[(x,y)\textrm{ first separated at level i}]\cdot 4\cdot 2^i \\ &\leq& \sum_{i=\delta}^0 \beta\;\frac{d(x,y)}{2^{i-1}} \cdot 2^i\cdot 4 \\ &=& (\delta +1)\; 8\beta\, d(x,y) \end{array}

where the first inequality follows from our subclaims, the second follows from the property of the low diameter decomposition. Setting {\beta = O(\log n)} and {\delta = O(\log \Delta)} completes the proof. \Box

The two lemmas above prove Theorem 1. How do we implement these low diameter decompositions? And how can we get the promised {O(\log n)}? Keep reading…


2. Low Diameter Decompositions


Recall the definition of a (randomized) low-diameter decomposition from above: given a metric {(V,d)} and a bound {r}, we want a partition with pieces of radius at most {r/2}, and want vertices to be separated with “small” probability {\beta \frac{d(x,y)}{r}} (i.e., proportional to their distance, and inversely proportional to {r}).

Before we proceed, think about how you’d get such a decomposition for a line metric, or a tree metric, with {\beta = O(1)}; moreover, you cannot hope to get subconstant {\beta = o(1)} for even the line. So the theorem says that general graphs lose a factor {O(\log n)} more, which is not bad at all! (And this factor is existentially optimal, we will show a tight example.)


2.1. Algorithm I: Geometric Region Growing


To make our life easier, we’ll assume that all distances in the metric are at least {r/n^2}. (We can enforce this via a pre-processing without much effort, I’ll come back to it.)

The algorithm just picks a “truncated” geometric distance {R}, carves out a piece of radius {R} around some vertex, and repeats until the metric is eaten up.

Geom-Regions{(V, d, r)}:

  1. Choose {R \sim \mathtt{Geom}(\frac{4 \ln n}{r})}; if {R > r/2}, then set {R \gets r/2}.
  2. Pick an arbitrary vertex {u \in V}, and set {S_1 \gets \{ v \in V \mid d(u,v) \leq R \}}
  3. Return {\{ S_1 \} \cup } Geom-Regions{(V \setminus S_1, d, r)}.

Clearly, the radius bound is maintained by the fact that {X \leq r/2} with probability {1}.

What’s the chance that {x,y} lie in separate parts? So let’s view this process as picking a vertex {u} and starting with a ball of radius zero around it; then we flip a coin with bias {p = r/(4 \ln n)}, increasing the radius by one after each tails, until either we see a heads or we reach {r/2} tails, when we cut out the piece. And then we pick another vertex, and repeat the process.

Consider the first time when one of these lies in the current ball. Note that either this ball will eventually contain both of them, or will separate them. And to separate them, it must make a cut within the next {d(x,y)} steps. The chance of this is at most the chance of seeing a heads from a bias-{p} coin in {d(x,y)} steps, plus the chance that a {\mathtt{Geom}(p)} r.v. sees more than {(2 \ln n)/p} tails in a row. Using a naive union bound for the former, we get

\displaystyle  d(x,y) \cdot p + (1 - p)^{(2 \ln n)/p} \leq \frac{d(x,y)}{r} \cdot 2 \ln n + \frac{1}{n^2}.

We now use the fact that all distances are at least {r/n^2} to claim that {1/n^2 \leq \frac{d(x,y)}{r}} and hence the probability of {x,y} separated is at most {(4 \ln n + 1) d(x,y)/r}, which proves the second property of the decomposition.

Finally, the loose ends: to enforce the minimum-distance condition that {d(x,y) \geq r/n^2}, just think of the metric as a complete graph with edge-lengths {\ell_{xy} = d(x,y)}, contract all edges {(x,y)} with {d(x,y) < r/n^2}, and recompute edge lengths to get the new metric {d' \leq d}. Running the decomposition Geom-Regions{(V, d', r/2)} on this shrunk metric, and then unshrinking the edges, will ensure that each pair is separated with probability either {0} (if it has length {< r/n^2}), or probability at most {(8 \ln n + 2) d(x,y)/r}. And finally, since the output had radius at most {r/4} according to {d'}, any path has at most {n} nodes and its length can change by at most {n\cdot r/n^2 \leq r/4} for {n \geq 4}, the new radius is at most {r/2}!.

Another advantage of this shrinking preprocessing: a pair {x,y} is separated only when {r \leq d(x,y) \cdot n^2}, and it is separated for sure when {r \geq d(x,y)}. Using this observation in the calculation from the previous section can change the {O(\log n \log \Delta)} to just {O(\log n \min(\log n, \log \Delta))}. But to get the ultimate {O(\log n)} guarantee, we’ll need a different decomposition procedure.


2.2. Algorithm II: The CKR Decomposition


Theorem 4 (The Better Decomposition) There exists an efficiently sampleable probability distribution {\mathcal{D}} over partitions with parts having radius at most {r/2} such that

\displaystyle \Pr[u,v\ \textrm{separated by the partition}] \leq \frac{4\;d(u,v)}{r}\log\left(\frac{|Ball(u,r/2)|}{|Ball(u,r/8)|}\right)

where {Ball(x,r) = \{ y : d(x,y) \leq r \}}.


The procedure for the decomposition is a little less intuitive, but very easy to state:

CKR Decomposition{(V, d, r)}:

  1. Choose {R \in_R [\frac{r}4, \frac{r}{2}]} uniformly at random.
  2. Choose a random permutation {\pi\!: V \rightarrow V} uniformly at random. 
  3. Consider the vertices one by one, in the order given by {\pi}. When we consider {w}, we assign all the yet-unassigned vertices {v} with {d(w,v) \leq R} to {w}‘s partition.

For example, suppose the ordering given by {\pi} is {v_1, v_2, v_3, v_4}. The figure below illustrates the coverage when the vertices are visited by this process.

This construction directly implies the low-radius property, restated in the following claim.

Lemma 5 (Low Radius) The output of the algorithm has the property that for all {S_i}, there exists {c_i \in V} such that for all {u \in S_i}, {d(c_i, u) \leq r/2}.


The real work is in showing that for each pair {(u,v)}, it is separated with small probability. Before proving this, let us state two definitions useful for the proof. For the analysis only: suppose we re-number the vertices {w_1, w_2, \dots, w_n} in order of the distance from the closer of {u,v}.

  • (Settling) At some time instant in this procedure, one (or both) of {u} or {v} gets assigned to some {w_i}. We say that {w_i} settles the pair {(u,v)}
  • (Cutting) At the moment the pair is settled, if only one vertex of this pair is assigned, then we say that {w_i} cuts the pair {(u,v)}.

According to these definitions, each pair is settled at exactly one time instant in the procedure, and it may or may not be cut at that time. Of course, once the pair is settled (with or without being cut), it is never cut in the future.

Now to bound the separation probability. Consider {w_j}, and let {d(w_j,u)=a_j} and {d(w_j,v)=b_j}. Assume {a_j<b_j} (the other case is identical). If {w_j} cuts {(u,v)} when the random values are {R} and {\pi}, the following two properties must hold:

  1. The random variable {R} must lie in the interval {[a_j,b_j]} (else either none or both endpoints of {e} would get marked). 
  2. The node {w_j} must come before {w_1,\dots, w_{j-1}} in the permutation {\pi}.

Suppose not, and one of them came before {w_j} in the permutation. Since all these vertices are closer to the pair than {w_j} is, then for the current value of {R}, they would have settled the pair (either capturing one or both of the endpoints) at some previous time point, and hence {w_j} would not settle—and hence not cut—the pair {(u,v)}.

With these two properties, we establish

\displaystyle  \begin{array}{rl}  \Pr[\text{pair } (u,v) \text{ is separated}] &= \sum_j \Pr[w_j\text{ cuts the pair } (u,v)] \\ &\leq \sum_j \Pr[R \in [a_j, b_j]\text{ and } w_j\text{ comes before }w_1, \dots, w_{j-1}\text{ in }\pi] \\ &\leq \sum_j \frac{d(u,v)}{(r/2 - r/4)} \cdot \frac1j \leq \frac{4\,d(u,v)}{r}\cdot H_n = \frac{d(u,v)}{r}\;O(\log n) \end{array}

But we wanted to do better than that! No worries, the fix is easy, but clever. First, note that if {d(u,v) \geq r/8} then the probability of separating {u,v} is at most {1 \leq 8\,d(u,v)/r}. So suppose {d(u,v) < r/8}. Now, for {w_j} to cut {(u,v)}, it is not enough for {R \in [a_j, b_j]} and {w_j} comes before all {w_i} for {i < j}. It also must be the case that {w_j} be at most {r/2} from the closer of the pair (say {u}) to even reach one of the vertices, let alone separate then. And at least {r/4} from the further one (say {v}) so that some setting of {R} would have a chance to separate the two. So the distance of {w_j} from {u} must be at most {r}, and at least {r/4 - d(u,v) \geq r/8}, and the same for its distance from {v}. If we restrict the harmonic sum in the final expression over just the vertices that satisfy these bounds, we get the bound

\displaystyle  \frac{d(u,v)}{r/4} \left( \frac1{|B(u, r/8)| + 1} + \frac1{|B(u, r/8)| + 2} + \cdots + \frac1{|B(u, r)|} \right),

and hence the bound in Theorem 4.

Theorem 6 (FRT 2003) Using the decomposition procedure from Theorem 4 in the TreeEmbed algorithm, we get that for all {x,y \in V}:

\displaystyle E_T[d_T(x,y)] \leq d(x,y)\cdot O(\log n)


The proof for the TreeEmbed algorithm remains essentially unchanged, except for the final calculations:

\displaystyle  \begin{array}{rcl}  E_T[d_T(x,y)] &\leq& \sum_{i=\delta}^0\Pr[(x,y)\textrm{ separated at level i}]\cdot 4\cdot 2^i \\ &\leq& \sum_{i=\delta}^0 \frac{4\,d(x,y)}{2^{i-1}}\cdot\log\left(\frac{|B(x,2^i)|}{|B(x,2^{i-3})|}\right)\cdot 2^i\cdot 4 \\ &=& 32\,d(x,y)\sum_{i=0}^\delta(\log(|B(x,2^i)|) - \log(|B(x,2^{i-3})|)) \\ &=&32\,d(x,y)(\log(|B(x,2^\delta)|) + \log(|B(x,2^{\delta-1})|) + \log(|B(x,2^{\delta-2})|) \\ &\leq& 96\,d(x,y)\log n \end{array}

where the last equality follows from observing that we have a telescoping sum.

Citations: The {O(\min(\log \Delta, \log n) \log n)} construction was due to Yair Bartal (1996); this substantially improved on the first non-trivial guarantee of {\exp(\sqrt{\log n \log\log n})} due to Alon, Karp, Peleg and West (1992). The low-diameter decomposition is also from Bartal. The {O(\log n)} algorithm is by Fakcharoenphol, Rao, and Talwar (2003), based on the improved decomposition scheme due to Calinescu, Karloff and Rabani (2000).


3. Lower Bounds


Let us show two lower bounds: first, that no randomized low-diameter decomposition can achieve better than {\beta = O(\log n)} for general metrics. And that no random tree embeddings can do better than {\alpha = O(\log n)} either.


3.1. Lower Bounds for Decompositions


First, given a graph {G = (V,E)} with unit length edges, if we apply a {\beta} decomposition with parameter {r} to the graph metric {d_G}, we will cut each edge with probability {\beta/r}. The expected number of cut edges will be {\beta m/r}. So, for each {r} the probabilistic method says there exists a diameter-{r} partition that cuts at most {\beta m/r} edges.

Let {G} be a graph with {n} nodes and {cn} edges (with {c > 1}), where the girth of the graph (the length of the shortest simple cycle) is at least {g = c'\log n} (for constant {c' < 1}). Such graphs are known to exist, this can be shown by the probabilistic method.

Now, if we set {r = \frac{c'}{3} \log n = g/3} and consider any diameter-{r} partition: we claim no set {S} in this partition can induce a cycle. Indeed, since every cycle is of length {g}, two furthest points in the cycle would be {g/2 = \frac{c'}{2} \log n > r} distance from each other. So all sets induce a forest, which means the number of internal edges is at most {n-1 < n}. This means at least {(c-1)n} edges are cut.

Cool. For every diameter-{r} partition, at least {(c-1)n} edges are cut because of the large girth property. But there exists one that cuts at most {\beta\, m/r = \beta\, cn/r} edges, because we have a good decomposition algorithm. So now we put the two facts together.

\displaystyle  (c-1)n \leq \beta\, \frac{cn}{r} \implies \beta \geq \frac{c-1}{c}\, r = \Omega(\log n).


3.2. Lower Bounds for Random Tree Embeddings


Suppose there is a distribution {\mathcal{D}} that achieves expected stretch {\alpha} for the large-girth graphs above. Let’s use this to obtain a low-diameter decomposition with cutting parameter {\beta = O(\alpha)}; this will mean {\alpha = \Omega(\beta) = \Omega(\log n)}.

Sample a tree {T} from the distribution, pick an arbitrary vertex {v}, pick a random value {R \in_R [0, r/2)}. Delete all edges that contain points at distance exactly in {\{ R, r/2+R, r+R, 3r/2+R, \ldots \}} from {v}. The remaining forest has components with radius at most {r/2}, and diameter {r} in the tree. Since distances on the original graph are only smaller, the diameter of each part will only be less in the original graph.

Moreover, given the tree {T}, a pair will be separated with probability at most {\frac{d_T(x,y)}{r/2}}. Taking expectations, the total probability of {x,y} separated is at most

\displaystyle  E_T[ \frac{2\, d_T(x,y)}{r} ] \leq 2\alpha \; \frac{d(x,y)}{r}.

So we have a decomposition scheme with parameter {2 \alpha}. And combining this with the previous lower bound on any decomposition scheme, we get {\alpha = \Omega(\log n)}.


Lecture #15: Distance-preserving trees (part I)


1. Metric Spaces


A metric space is a set {V} of points, with a distance function {d: V \times V \rightarrow {\mathbb R}_{\geq 0}} that satisfies {d(x,x) = 0} for all {x \in V}, symmetry (i.e., {d(x,y) = d(y,x)}), and the triangle inequality (i.e., {d(x,y) + d(y,z) \geq d(x,z)} for all {x,y,z \in V}). Most of the computer science applications deal with finite metrics, and then {n} denotes the number of points {|V|}.

There are many popular problems which are defined on metric spaces:

  • The Traveling Salesman Problem (TSP): the input is a metric space, and the goal is to find a tour {v_1, v_2, \ldots, v_n, v_{n+1} = v_0} on all the {n} nodes whose total length {\sum_{i = 1}^n d(v_i, v_{i+1})} is as small as possible. This problem is sometimes defined on non-metrics as well, but most of the time we consider the metric version. 

    The best approximation algorithm for the problem is a {(3/2 - \epsilon)}-approximation due to Oveis-Gharan, Saberi and Singh (2010). Their paper uses randomization to beat the {3/2}-approximation of Cristofides (1976), and make progress on this long-standing open problem. The best hardness result for this problem is something like {1.005} due to Papadimitriou and Vempala.

  • The {k}-Center/{k}-Means/{k}-median problems: the input is a metric space {(V,d)}, and the goal is to choose some {k} positions {F} from {V} as “facilities”, to minimize some objective function. In {k}-center, we minimize {\max_{v \in V} d(v, F)}, the largest distance from any client to its closest facility; here, we define the distance from a point {v} to a set {S} as {d(v,S) := \min_{s \in S} d(v,s)}. In {k}-median, we minimize {\sum_{v \in V} d(v, F)}, the total (or equivalently, the average) distance from any client to its closest facility. In {k}-means, we minimize {\sum_{v \in V} d(v, F)^2}, the average squared distance from any client to its closest facility. (Note: to see why these problems are called what they are, consider what happens for the {1}-means/medians problem on the line.)The best algorithms for {k}-center give us a {2}-approximation, and this is the best possible unless P=NP. The best {k}-median algorithm gives an {(3+\epsilon)}-approximation, whereas the best hardness known for the version of the problem stated above is {(1 + 1/e)} unless P=NP. For {k}-means, gap between the best algorithm and hardness results is worse for general metric spaces. For geometric spaces, better algorithms are known for {k}-means/medians.
  • The {k}-server problem: this is a classic online problem, where the input is a metric space (given up-front); a sequence of requests {\sigma_1, \sigma_2, \cdots} arrives online, each request being some point in the metric space. The algorithm maintains {k} servers, one each at some {k} positions in the metric space. When the request {\sigma_t} arrives, one of the servers must be moved to {\sigma_t} to serve the request. The cost incurred by the algorithm in this step is the distance moved by the server, and the total cost is the sum of these per-step costs. The goal is to give a strategy that minimizes the total cost of the algorithm.The best algorithm for {k}-server is a {2k-1}-competitive deterministic algorithm due to Koutsoupias and Papadimitriou. Since {k}-server contains paging as a special case (why?), no deterministic algorithm can do better than {k}-competitive. It is a long-standing open problem whether we can do better than {2k-1}CC deterministically—but far more interesting is the question of whether randomization can help beat {2k-1}; the best lower bound against oblivious adversaries is {\Omega(\log k)}, again from the paging problem.


1.1. Approximating Metrics by Trees: Attempt I


A special kind of metric space is a tree metric: here we are given a tree {T = (V,E)} where each edge {e \in E} has a length {\ell_e}. This defines a metric {(V, d_T)}, where the distance {d_T(x,y)} is the length of the (unique) shortest path between {x} and {y}, according to the edge lengths {\ell_e}. In general, given any graph {G = (V,E)} with edge lengths, we get a metric {(V, d_G)}.

Tree metrics are especially nice because we can use the graph theoretic idea that it is “generated” by a tree to understand the structure of the metric better, and hence give better algorithms for problems on tree metrics. For instance:

  • TSP on tree metrics can be solved exactly: just take an Euler tour of the points in the tree.
  • {k}-median can be solved exactly on tree metrics using dynamic programming.
  • {k}-server on trees admits a simple {k}-competitive deterministic algorithm.

So if all metrics spaces were well-approximable by trees (e.g., if there were some small factor {\alpha} such that for every metric {M = (V,d)} we could find a tree {T} such that

\displaystyle  d(x,y) \leq d_T(x,y) \leq \alpha d(x,y) \ \ \ \ \ (1)

for every {x,y \in V}, then we would have an {\alpha}-approximation for TSP and {k}-median, and an {\alpha k}-competitive algorithm for {k}-server on all metrics. Sadly, this is not the case: for the metric generated by the cycle graph {C_n}, the best factor we can get in~(1) is {\alpha = n-1}. This is what we would get if we just approximated the tree by a line.

So even for simple metrics like that generated by the cycle (on which we can solve these problems really easily), this approach hits a dead-end really fast. Pity.


1.2. Approximating Metrics by Trees: Attempt II


Here’s where randomization will come to our help: let’s illustrate the idea on a cycle. Suppose we delete a uniformly random edge of the cycle, we get a tree {T} (in fact, a line). Note that the distances in the line are at least those in the cycle.

How much more? For two vertices {x, y} adjacent in the cycle, the edge {(x,y)} still exists in the tree with probability {1 - 1/n}, in which case {d_T(x,y) = d(x,y)}; else, with probability {1/n}, {x} and {y} lie at distance {n-1} from each other. So the expected distance between the endpoints of an edge {(x,y)} of the cycle is

\displaystyle  (1 - 1/n) \cdot 1 + 1/n \cdot n-1 = 2(1-1/n) \cdot d(x,y)

And indeed, this also holds for any pair {x, y\in V} (check!),

\displaystyle  E_T[ d_T(x,y) ] \leq 2(1-1/n) \cdot d(x,y)

But is this any good for us?

Suppose we wanted to {k}-median on the cycle, and let {F^*} be the optimal solution. For each {x}, let {f_x^*} be the closest facility in {F^*} to {x}; hence the cost of the solution is:

\displaystyle  OPT = \sum_{x \in V} d(x, f_x).

By the expected stretch guarantee, we get that

\displaystyle  \sum_{x \in V} E_T[ d_T(x, f_x) ] < 2\, OPT.

I.e., the expected cost of this solution {F^*} on the random tree is at most {2\, OPT}. And hence, if {OPT_T} is the cost of the optimal solution on {T}, we get

\displaystyle  E_T[ OPT_T ] \leq 2\, OPT

Great—we know that the optimal solution on the random tree does not cost too much. And we know we can find the optimal solution on trees in poly-time.

Let’s say {F_T \subseteq V} is the optimal solution for the tree {T}, where the closest facility in {F_T} to {x} is {f_x^T}, giving {OPT_T = \sum_x d_T(x, f_x^T)}. How does this solution {F_T} perform back on the cycle? Well, each distance in the cycle is less than that in the tree {T}, so the expected cost of solution {F_T} on the cycle will be

\displaystyle  E \left[ \sum_x d(x,f_x^T) \right] = \sum_x E[ d(x,f_x^T) ] \leq \sum_x E[ d_T(x, f_x^T) ] = E_T [ OPT_T ] \leq 2\, OPT .

And we have a randomized {2}-approximation for {k}-median on the cycle!


1.3. Popping the Stack


To recap, here’s the algorithm: pick a random tree {T} from some nice distribution. Find an optimal solution {F_T} for the problem, using distances according to the tree {T}, and output this set as the solution for the original metric.

And what did we use to show this was a good solution? That we had a distribution over trees such that

  • every tree in the distribution had distances no less than that in the original metric, and
  • the expected tree distance between any pair {x, y \in V} satisfies {E_T[ d_T(x,y) ] \leq \alpha \cdot d(x,y)} for some small {\alpha}; here {\alpha = 2}.

And last but not least

  • that the objective function was linear in the distances, and so we could use linearity of expectations.

Note that TSP, {k}-median, {k}-server, and many other metric problems have cost functions that are linear in the distances, so as long as the metrics we care about can be “embedded into random trees” with small {\alpha}, we can translate algorithms on trees for these problems into (randomized) algorithms for general metrics! This approach gets used all the time, and is worth remembering. (BTW, note that this general approach does not work for non-linear objective functions, like {k}-center, or {k}-means.)

But can we get a small {\alpha} in general? In the next section, we show that for any {n}-point metric with aspect ratio {\frac{\max d(x,y)}{\min d(x,y)} = \Delta}, we can get {\alpha = O(\log n \log \Delta)}; and we indicate how to improve this to {O(\log n)}, which is the best possible!


2. Embeddings into Trees


In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to {O(\log n)}).

Theorem 1 Given any metric {(V,d)} with {|V| = n} and aspect ratio {\Delta}, there exists a efficiently sampleable distribution {\mathcal{D}} over spanning trees of {V} such that for all {u,v\in V}:

  1. For all {T \in \textrm{Support}(\mathcal{D})}, {d_T(u,v) \geq d(u,v)}, and
  2. {\mathop{\mathbb E}_{T\sim \mathcal{D}}[d_T(u,v)] \leq O(\log n \log \Delta) \; d(u,v)}.


To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space {(V, d)} on {|V| = n} points and a parameter {r \in {\mathbb R}_+}, a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of {V} into {S_1 \uplus S_2 \uplus S_3 \uplus \dots \uplus S_t} such that

  1. (Low Radius/Diameter) For all {S_i}, there exists {c_i \in V} such that for all {u \in S_i}, {d(c_i, u) \leq r/2}. Hence, for any {u, v \in S_i}, {d(u,v) \leq r}.
  2. (Low Cutting Probability) For each pair {u,v}, {\Pr[u,v \text{ lie in different } S_i's ] \leq \beta \; \frac{d(u,v)}{r}} with {\beta = O(\log n)}.

We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.


HW #4 Open Thread

Please ask your HW#4 questions here. Here’s one:

For Exercise 2, Part A, the algorithm says to “Choose 2m+1 elements S uniformly at random…”. Does this mean with replacement, or without replacement? In other words, are we choosing 2m+1 elements one at a time, each with equal probability of picking any of the elements from A, such that S might be a multiset? Or, are we choosing a set of 2m+1 elements from A, where each set has probability 1 / binom(n, 2m+1) of being picked?

It does not matter; you can do it either way. Though the analysis for the process where you choose independently with replacement is slightly simpler.

Update: the phrasing for problem #2(b) was ambiguous, and has been fixed. (Thanks Kevin!)

Update #2: the indices in the algorithm in problem #2 were off by 1 and have been fixed. (Thanks Favonia!)

Lecture #14: Game Theory

Today we discussed some key concepts in game theory and connections between (some of) them and results in online learning.

We began by discussing 2-player zero-sum games. The Minimax Theorem states that these games have a well-defined value V such that (a) there exists a mixed strategy p for the row-player that guarantees the row player makes at least V (in expectation) no matter what column the column-player chooses, and (b) there exists a mixed strategy q for the column-player that guarantees the row player makes at most V (in expectation) no matter what row the row-player chooses. We then saw how we could use the results we proved about the Randomized Weighted Majority algorithm to prove this theorem.

We next discussed general-sum games and the notion of Nash equilibria. In a zero-sum game, a Nash equilibrium requires the two players to both be playing minimax optimal strategies, but in general there could be Nash equilibria of multiple quality-levels for each player. We then proved the existence of Nash equilibria. However, unlike in the case of zero-sum games, the proof gives no idea of how to find a Nash equilibrium or even to approach one. In fact, doing so in a 2-player nxn game is known to be PPAD-hard.

Finally we discussed the notion of correlated equilibria and the connection of these to swap-regret, a generalization of the kind of regret notion we discussed last time. In particular, any set of algorithms with swap-regret sublinear in T, when played against each other, will have an empirical distribution of play that approaches (the set of) correlated equilibria.  See lecture notes.

HW3 Graders

Hi, if you volunteered to help grade HW3, please send me mail (if you haven’t already done so). Thanks! –Anupam

Lecture #13: Learning Theory II

Today we talked about online learning.  We discussed the Weighted Majority and Randomized Weighted Majority algorithms for the problem of “combining expert advice”, showing for instance that the RWM algorithm satisfies the bound E[\# mistakes] \leq (1+\epsilon)OPT + \frac{1}{\epsilon}\log n, where n is the number of “experts” and  OPT is the number of mistakes of the best expert in hindsight.  Also, this can be used when the experts are not predictors but rather just different options (like whether to play Rock, Paper, or Scissors in the Rock-Paper-Scissors game).  In this case, “# mistakes” becomes “total cost” and all costs are scaled to be in the range [0,1] each round.

We then discussed the “multiarmed bandit” problem, which is like the experts problem except you only find out the payoff for the expert you chose and not for those you didn’t choose.  For motivation, we discussed this in the context of the problem of selling lemonade to an online series of buyers, where the “experts” correspond to different possible prices you might choose for selling your lemonade.  We then went through an analysis of the EXP3 algorithm (though we did a simpler version of the analysis that gets a T^{2/3} dependence on T in the regret bound rather than the optimal T^{1/2}).

See the lecture notes (2nd half)



Lecture #12: Learning Theory 1

Today we talked about the problem of learning a target function from examples, where examples are drawn from some distribution D, and the goal is to use a labeled sample S (a set of examples drawn from D and labeled by the target f) to produce a function h such that Pr_{x \sim D}[h(x)\neq f(x)] is low. We gave a simple efficient algorithm for learning decision-lists in this setting, a basic “Occam’s razor” bound, and then a more interesting bound using the notion of shatter coefficients and a “ghost sample” argument. See 1st half of these lecture notes.

A few additional comments:

  • One way to interpret the basic Occam bound is that in principle, anything you can represent in a polynomial number of bits you can learn from a polynomial number of examples (if running time is not a concern). Also “data compression implies learning”: if you can take a set of m examples and find a prediction rule that is correct on the sample and requires < m/10 bits to write down, then you can be confident it will have low error on future points.
  • On the other hand, we would really like to learn from as few examples as possible, which is the reason for wanting bounds based on more powerful notions of the “underlying complexity” of the target function, such as shatter coefficients. Other very interesting bounds are based on a notion called “Rademacher complexity” which is even tighter.
  • For more info, see notes for 15-859(B) machine learning theory

Lecture #11: Online algorithms

Today we discussed randomized online algorithms, and in particular, algorithms for the ski-rental (elevator-or-stairs) and paging problems. See lecture notes as well as Chapter 13 of the MR book. Also, Claire Mathieu has a very nice set of notes on the randomized ski-rental problem.