Blogroll
Pages
Recent Posts
Advertisements
Randomized Algorithms, Carnegie Mellon: Spring 2011
Today we talked about random walks on graphs, and the result that in any connected undirected graph G, for any given start vertex u, the expected time for a random walk to visit all the nodes of G (called the cover time of the graph) is at most 2m(n-1), where n is the number of vertices of G and m is the number of edges.
In the process, we proved that for any G, if we think of the walk as at any point in time being on some edge heading in some direction, then each edge/direction is equally likely at probability 1/(2m) at the stationary distribution. (Actually, since we didn’t need to, we didn’t prove it is unique. However, if G is connected, it is not hard to prove by contradiction that there is a unique stationary distribution). We then used that to prove that the expected gap between successive visits to any given (u,v) is 2m. See the notes.
We also gave a few examples to show this is existentially tight. For instance, on a line (n vertices, n-1 edges) we have an expected time to reach the other end of the line. Also on a “lollipop graph” (a clique of n/2 vertices connected to a line of n/2 vertices), the expected time to get from the clique to the end of the line is . Since this is not in the notes, here is a quick argument. First of all, if you are in the clique, then each step has 2/n probability of taking you to the node connecting the clique and the handle (let’s call this “node 0”). When you are at node 0, your next step has probability 2/n to take you to the next point on the handle (let’s call this “node 1”). So, when you are in the clique, it takes steps to get to node 1. Now, think about the following experiment. Say you go into a fair casino with 1 dollar and bet on fair games (betting a dollar each time) until either you lose all your money or you have n/2 dollars in your pocket. Whenever you lose all your money, you go back to your car to get another dollar, and if you get n/2 dollars, you go upstairs to the restaurant. What is the expected number of trips to your car before you go to the restaurant? The answer is n/2 because it’s a fair casino so in expectation, all the money in your pocket when you head to the restaurant came from your car. Now, think of node 0 as your car, node 1 as entering the casino, and the end of the lollipop stick (node n/2) as the restaurant.
We ended our discussion by talking about resistive networks, and using the connection to give another proof of the cover time of a graph. In particular, we have where is the commute-time between u and v, and is the effective resistance between u and v.
In this lecture we talked about Rademacher bounds in machine learning. These are never worse and sometimes can be quite a bit tighter than VC-dimension bounds. Rademacher bounds say that to bound how much you are overfitting (the gap between your error on the training set and your true error on the distribution), you can do the following. See how much you would overfit on random labels (how much better than 50% is the empirical error of the best function in your class when you give random labels to your dataset) and then double that quantity (and add a low-order term). See the notes.
HW5 has been posted; it’s due on Monday April 4th.
Update: for Exercise #2, the expression is indeed correct. (It is and not —you want to show that the solution found using the tree is lousy when used on the original metric.)
A simpler problem, if youâ€™re stuck, is the furthest pair problem. Here, you are given a metric and want to output a pair of points whose distance is the largest. A natural (yet lousy) algorithm would be: embed the metric into a random tree while maintaining distances in expectation, find a furthest pair in the tree, and output this pair. Show an example where this algorithm sucks.
1. Some definitions
Recall that a martingale is a sequence of r.v.s (denoted by ) if each satisfies , and
Somewhat more generally, given a sequence of random variables, a martingale with respect to is another sequence of r.v.s (denoted by ) if each satisfies
One can define things even more generally, but for the purposes of this course, let’s just proceed with this. If you’d like more details, check out, say, books by Grimmett and Stirzaker, or Durett, or many others.)
1.1. The Azuma-Hoeffding Inequality
Theorem 1 (Azuma-Hoeffding) If is a martingale such that for each , . Then
(Apparently Bernstein had essentially figured this one out as well, in addition to the Chernoff-Hoeffding bounds, back in 1937.) The proof of this bound can be found in most texts, we’ll skip it here. BTW, if you just want the upper or lower tail, replace by on the right hand side.
2. The Doob Martingale
Most often, the case we will be concerned with is where our entire space is defined by a sequence of random variables , where each takes values in the set . Moreover, we will be interested in some bounded function , and will want to understand how behaves, when is drawn from the underlying distribution. (Very often these ‘s will be drawn from a “product distribution”—i.e., they will be independent of each other, but they need not be.) Specifically, we ask:
How concentrated is around its mean ?
To this end, define for every , the random variable
(At this point, it is useful to remember the definition of a random variable as a function from the sample space to the reals: so this r.v. is also such a function, obtained by taking averages of over parts of the sample space.)
How does the random variable behave? It’s just the constant : the expected value of the function given random settings for through . What about ? It is a function that depends only on its first variable, namely —instead of averaging over the entire sample space, we partition according to value of the first variable, and average over each part in the partition. And is a function of , averages over the other variables. And so on to , which is the same as the function . So, as we go from to , the random variables go from the constant function to the function .
Of course, we’re defining this for a reason: is a martingale with respect to .
Lemma 2 For a bounded function , the sequence is a martingale with respect to . (It’s called the Doob martingale for .)
Proof: The first two properties of being a martingale with respect to follow from being bounded, and the definition of itself. For the last property,
The first equaility is the definition of , the second from the fact that for random variables , and the last from the definition of .
Assuming that was bounded was not necessary, one can work with weaker assumptions—see the texts for more details.
Before we continue on this thread, let us show some Doob martingales which arise in CS/Math-y applications.
In the first one, let be the edge, and which gives us a martingle sequence of length . This is called the edge-exposure martingale. For the second one, let be the collection of edges going from the vertex to vertices : the new martingale has length and is called the vertex exposure martingale.
BTW, are these ‘s independent of each other? Naively, they might depend on the size of the current set, which makes it dependent on the past. One way you can make these independent is by letting these ‘s be, say, random independent permutations on all elements, and when you want to choose the pivot, pick the first element from the current set according to the permutation . (Or, you could let be a random independent real in and use that to pick a random element from the current set, etc.)
However, in this example, the ‘s are not independent. Nonetheless, the sequence is a Doob martingale. (As in the quicksort example, one can define it with respect to a different set of variables which are independent of each other.)
So yeah, if we want to study the concentration of around , we can now apply Azuma-Hoeffding to the Doob martingale, which gives us the concentration of (i.e., ) around (i.e., ). Good, good.
Next step: to apply Azuma-Hoeffding to the Doob martingale , we need to bound for all . Which just says that if we can go from to in a “small” number of steps (), and each time we’re not smoothing out “too agressively” (), then is concentrated about its mean.
2.1. Indepedence and Lipschitz-ness
One case when it’s easy to bound the ‘s is when the ‘s are independent of each other, and also is not too sensitive in any coordinate—namely, changing any coordinate does not change the value of by much. Let’s see this in detail.
Definition 3 Given values , the function is \underline{-Lipschitz} if for all and , for all and for all , it holds that
If for all , then we just say is -Lipschitz.
Lemma 4 If is -Lipschitz and are independent, then the Doob martingale of with respect to satisfies
Proof: Let us use to denote the sequence , etc. Recall that
where the last equality is from independence. Similarly for . Hence
where the inequality is from the fact that changing the coordinate from to cannot change the function value by more than , and that .
Now applying Azuma-Hoeffding, we immediately get:
Corollary 5 (McDiarmid’s Inequality) If is -Lipschitz for each , and are independent, then
(Disclosure: I am cheating. McDiarmid’s inequality has better constants, the constant in the denominator moves to the numerator.) And armed with this inequality, we can give concentration results for some applications we mentioned above.
Hence, whp, .
This is not very interesting, since the right hand side is only when —but the chromatic number itself lies in , so we get almost no concentration at all.
Instead, we could use a vertex-exposure martingale, where at the step we expose the vertex and its edges going to vertices . Even with respect to these variables, the function is -Lipschitz, and hence
And hence the chromatic number of the random graph is concentrated to within around its mean.
3. Concentration for Random Geometric TSP
McDiarmid’s inequality is convenient to use, but Lipschitz-ness often does not get us as far as we’d like (even with independence). Sometimes you need to bound directly to get the full power of Azuma-Hoeffding. Here’s one example:
Let be points picked independently and uniformly at random from the unit square . Let be the length of the shortest traveling salesman tour on points. How closely is concentrated around its mean ?
In the HW, you will show that ; in fact, one can pin down up to the leading constant. (See the work of Rhee and others.)
3.1. Using McDiarmid: a weak first bound
Note that is -Lipschitz. By Corollary 5 we get that
If we want the deviation probability to be , we would have to set . Not so great, since this is pretty large compared to the expectation itself—we’d like a tighter bound.
3.2. So let’s be more careful: an improved bound
And in fact, we’ll get a better bound using the very same Doob martingale associated with :
But instead of just using the -Lipschitzness of , let us bound better.
Before we prove this lemma, let us complete the concentration bound for TSP using this. Setting gives us , and hence Azuma-Hoeffding gives:
So
Much better!
3.3. Some useful lemmas
To prove Lemma 6, we’ll need a simple geometric lemma:
Lemma 7 Let . Pick random points from , the expected distance of point to its closest point in is .
Proof: Define the random variable . Hence, exactly when . For , the area of is at least for some constant .
Define . For some , the chance that points all miss this ball, and hence is at most
Of course, for , . And hence
Secondly, here is another lemma about how the TSP behaves:
Lemma 8 For any set of points, , we get
Proof: Follows from the fact that , for any .
3.4. Proving Lemma \ref
}
OK, now to the proof of Lemma 6. Recall that we want to bound ; since is -Lipschitz, we get immediately. For the second bound of , note that
where is a independent copy of the random variable . Hence
Then, if we define the set and , then we get
where the first inequality uses Lemma 8 and the second uses the fact that the minimum distance to a set only increses when the set gets smaller. But now we can invoke Lemma 7 to bound each of the terms by . This completes the proof of Lemma 6.
3.5. Some more about Geometric TSP
For constant dimension , one can consider the same problems in : the expected TSP length is now , and using similar arguments, you can show that devations of have probability .
The result we just proved was by Rhee and Talagrand, but it was not the last result about TSP concentration. Rhee and Talagrand subsequently improved this bound to the TSP has subgaussian tails!
We’ll show a proof of this using Talagrand’s inequality, in a later lecture.
If you’re interested in this line of research, here is a survey article by Michael Steele on concentration properties of optimization problems in Euclidean space, and another one by Alan Frieze and Joe Yukich on many aspects of probabilistic TSP.
4. Citations
As mentioned in a previous post, McDiarmid and Hayward use martingales to give extremely strong concentration results for QuickSort . The book by Dubhashi and Panconesi (preliminary version here) sketches this result, and also contains many other examples and extensions of the use of martingales.
Other resources for concentration using martingales: this survey by Colin McDiarmid, or this article by Fan Chung and Linyuan Lu.
Apart from giving us powerful concentration results, martingales and “stopping times” combine to give very surprising and powerful results: see this survey by Yuval Peres at SODA 2010, or these course notes by Yuval and Eyal Lubetzky.
In celebration of Les Valiant’s winning of the Turing award, today we discussed a classic result of his on the problem of oblivious routing on the hypercube. (Valiant, “A scheme for fast parallel communication”, SIAM J. Computing 1982, and Valiant and Brebner, “Universal schemes for parallel communication”, STOC 1981). The high-level idea is that rather than routing immediately to your final destination, go to a random intermediate point first. The analysis is really beautiful — you define the right quantities and it just magically works out nicely. See today’s class notes. We also discussed the Butterfly and Benes networks as well. (See the excellent notes of Satish Rao for more on them).
At the end, we also briefly discussed Martingales: their definition, Azuma’s inequality, and McDiarmid’s inequality (which doesn’t talk about Martingales directly but is very convenient and can be proven using Azuma). The discussion at the end was extremely rushed but the key point was: suppose you have a complicated random variable you care about like “the running time of my randomized algorithm” where the random variable is a function of a series of random choices Then the sequence where is a Martingale. E.g., the expected running time of quicksort given that the first pivots are is the expected value, over the possible choices of of the expected running time of quicksort given that the first pivots are .
1. An equivalent view of estimating
Again, you have a data stream of elements , each element drawn from the universe . This stream defines a frequency vector , where is the number of times element is seen. Consider the following algorithm to computing .
Take a (suitably random) hash function . Maintain counter , which starts off at zero. Every time an element comes in, increment the counter . And when queried, we reply with the value .
Hence, having seen the stream that results in the frequency vector , the counter will have the value . Does at least have the right expectation? It does:
And what about the variance? Recall that , so let us calculate
So
What does Chebyshev say then?
Not that hot: in fact, this is usually more than .
But if we take a collection of such independent counters , and given a query, take their average , and return . The expectation of the average remains the same, but the variance falls by a factor of . And we get
So, our probability of error on any query is at most if we take .
1.1. Hey, those calculations look familiar
Sure. This is just a restatement of what we did in lecture. There we took a matrix and filled with random values—hence each row of corresponds to a hash function from to . And taking rows in the matrix corresponds to the variance reduction step at the end.
1.2. Limited Independence
How much randomness do you need for the hash functions? Indeed, hash functions which are -wise independent suffice for the above proofs to go through. And how does one get a -wise independent hash function? Watch this blog (and the HWs) for details.
Today we’ll talk about dimensionality reduction, and some related topics in data streaming.
1. Dimension Reduction
Suppose we are given a set of points in . How small can we make and still maintain the Euclidean distances between the points? Clearly, we can always make , since any set of points lies on a -dimensional subspace. And this is (existentially) tight: e.g., the case when are all orthogonal vectors.
But what if we were OK with the distances being approximately preserved? In HW#3, you saw that while there could only be orthogonal unit vectors in , there could be as many as unit vectors which are -orthogonal—i.e., whose mutual inner products all lie in . Near-orthogonality allows us to pack exponentially more vectors!
Put another way, note that
And hence the squared Euclidean distance between any pair of the points defined by these -orthogonal vectors falls in . So, if we wanted points exactly at unit (Euclidean) distance from each other, we would need dimensions. (Think of a triangle in -dims.) But if we wanted to pack in points which were at distance from each other, we could pack them into
dimensions.
1.1. The Johnson Lindenstrauss lemma
The Johnson Lindenstrauss “flattening” lemma says that such a claim is true not just for equidistant points, but for any set of points in Euclidean space:
Lemma 1 Let . Given any set of points in , there exists a map with such that
Note that the target dimension is independent of the original dimension , and depends only on the number of points and the accuracy parameter .
This lemma is tight up to the constant term: it is easy to see that we need at least using a packing argument. Noga Alon showed a lower bound of .
1.2. The construction
The JL lemma is pretty surprising, but the construction of the map is perhaps even more surprising: it is a super-simple random construction. Let be a matrix, such that every entry of is filled with an i.i.d. draw from a standard normal distribution (a.k.a. the “Gaussian” distribution). For , define
That’s it. You hit the vector with a Gaussian matrix , and scale it down by . That’s the map . Note that it is a linear map: . So suppose we could show the following lemma:
Lemma 2 Let . If is constructed as above with , and is a unit vector, then
Then we’d get a proof of Lemma 1. Indeed, set , and hence . Now for each we get that the squared length of is maintained to within with probability at least . By a union bound, all pairs of distances in are maintained with probability at least . This proves Lemma 1.
A few comments about this construction:
A SODA 2011 paper of T.S. Jayram and David Woodruff shows that this dependence of is the best possible. Note that if we use this approach the union bound to prove JL, then is the best bound possible. (An earlier version of these notes incorrectly claimed that the Jayram-Woodruff paper also showed an unconditional lower bound for JL, thanks to Jelani for pointing out the mistake.)
1.3. The proof
Now, on to the proof of Lemma 2. Here’s the main idea. Imagine that the vector we’re considering is just the elementary unit vector . Then is just a vector with independent and identical Gaussian values, and we’re interested in its length—the sum of squares of these Gaussians. If these were bounded r.v.s, we’d be done—but they are not. However, their tails are very small, so things should work out
But what’s a Gaussian ? Well, it looks like this:
\vspace{1in}
Which is not too different from this (bounded) random variable, if you squint a bit:
\vspace{1in}
Which has constant mean. So, if we take a sum of a bunch of such random variables (actually of their squares), it should behave pretty much like its mean (which is ), because of a Chernoff-like argument. And so the expected length is close to , which explains the division by .
Now we just need to make all this precise, and remove the assumption that the vector was just . That’s what the rest of the formal proof does: it has a few steps, but each of them is fairly elementary.
1.4. The proof, this time for real
We’ll be using basic facts about Gaussians, let’s just recall them. The probability density function for the Gaussian is
We also use the following; the proof just needs some elbow grease.
Recall that we want to argue about the squared length of . To start off, observe that each coordinate of the vector behaves like
where the ‘s are i.i.d. r.v.s. But then the proposition tells us that . And since is a unit length vector, this is simply . So, each of the coordinates of behaves just like an independent Gaussian!
What is the squared length of , then? It is
where each , independent of the others. And since , we get .
Now to show that does not deviate too much from . And is the sum of a bunch of independent and identical random variables. If only the ‘s were all bounded, we could have used a Chernoff bound and be done. But these are not bounded, so this is finally where we’ll need to do a little work. (Note: we could take the easy way out, observe that the squares of Gaussians are chi-squared r.v.s, the sum of of them is chi-squared with degrees of freedom, and the internets conveniently has tail bounds for these things. But we digress.)
So let’s start down the ye olde Chernoff path, for the upper tail, say: \Pr[ Z \geq 1 + \varepsilon ] &\leq \Pr[ e^{tkZ} \geq e^{tk(1+\varepsilon)} ] \leq E[ e^{tkZ} ]/e^{tk(1+\varepsilon)} = \prod_i \left( E[ e^{tY_i^2} ]/e^{t(1+\varepsilon)} \right) for every . And what is for ? Let’s calculate it: \frac1{\sqrt{2\pi}} \int_y e^{ty^2} e^{-y^2/2} dy &= \frac1{\sqrt{2\pi}} \int_z e^{-z^2/2} \frac{dz}{\sqrt{1 – 2t}} = \frac{1}{\sqrt{1 – 2t}}. for . So our current bound on the upper tail is that for all we have
Let’s just focus on part of this expression:
Plugging this back, we get
if we set and use the fact that for . (Note: this setting of also satisfies , which we needed from our previous calculations.)
Almost done: let’s take stock of the situation. We observed that was distributed like a sum of squares of Gaussians, and using that we proved that
for . A similar calculation bounds the lower tail, and finishes the proof of Lemma 2.
Citations: The JL Lemma was first proved in this paper of Bill Johnson and Joram Lindenstrauss. There have been several proofs after theirs, usually trying to tighten their results, or simplify the algorithm/proof (see citations in some of the newer papers): the proof follows some combinations of the proofs in this STOC ’98 paper of Piotr Indyk and Rajeev Motwani, and this paper by Sanjoy Dasgupta and myself.
2. The data stream model
The JL map we considered was a linear map, and that has many advantages. One of them is that we can use it in a distributed context: if players each have a vector and each knows the JL matrix , then to compute each person can just compute , send their answers out, and then someone can sum up the answers to get . Since these vectors are smaller than (they lie in instead of ), this can result in significant savings in communication. (We need all players to know the matrix , but if they have shared randomness they can generate this matrix themselves.)
This same idea is useful in the context of data streaming: suppose you have a data stream of a large number of elements whizzing past you, each element drawn from the universe . This stream defines a frequency vector , where is the number of times element is seen. People working on data streams want to calculate statistics of this vector —e.g., how many non-zeroes does it have? What is the length of this? (Duh! it’s just the length of the data stream.) What is ? Etc.
The Space Crunch. All this can be trivially done if we use space to actually store the vector . Suppose we do not want to store the frequency vector explicitly, but are OK with approximate answers. We can use JL or similar schemes to approximately calculate . Suppose is a random Gaussian matrix, then by the guarantee of the JL lemma, the estimate with probability , if . (Note: this is the error for a single query—so we’re not guaranteeing the counts at all times are close, just at the time the query is made.)
And the algorithm is simple: maintain a vector , initially zero. When the element comes by, add in the column of to . Finally, answer with . (If you have to answer queries, choose appropriately larger.)
Of course, you’ve realized I am cheating. In order to save space we used JL. But the JL matrix itself uses entries, which is a lot of space, much more than the entries of the frequency vector ! Also, we now need to maintain a matrix of reals, whereas just has integers!
We can handle both issues. The former issue can directly be handled by using a pseudorandom generator that “fools” low-space computation—we will not talk about this solution in this lecture. Instead we’ll give a different (though weaker) solution which handles both issues: it will use less space, and will maintain only integer values (if the input has integers).
3. Using random signs instead of Gaussians
While Gaussians have all kinds of nice properties, they are real-valued distributions and hence require attention to precision. How about populating with draws from other, simpler distributions? How about setting each , and letting ? (A random sign is also called a Rademacher random variables, btw, the name Bernoulli being already taken for a random bit in .)
Now, we want to study the properties of
To keep subscripts to a minimum, consider the inner expression
where each . Then
if the ‘s are pairwise independent, since and by independence. Plugging this into~(1) and recalling that , we get
Just what we like! To show that is indeed close to its mean, we will use Chebyshev, and this requires us to compute the variance of .
If the rows of are independent, then is the sum of the variances from each row, which in terms of the variable defined above is:
But , we know what is. For the other term,
(The other terms disappear because of -wise independence.) And plugging this into the definition of , we get
Interesting, the variance is just twice the squared mean—that’s good, since the variance of (which was the final answer, obtained by taking the average of such variables) is as much, since averaging reduces the variance. So . And finally, we can set and use Chebyshev to get
Great! So, if we take a matrix whose rows were independent, each row having values drawn from a -wise independent sample space. We maintain a -dimensional vector , and whenever an element in comes by in the stream, we just add in the column of to . And when we want the answer, we reply with —this will be correct with probability at least .
Why -wise independence? Well, the calculation of only used the fact that any four entries of each row behaved independently of each other. And it is possible to generate values from which is -wise independent, using hash functions that require only bits of space. (We’ll talk more about this later in the course.) So the total space usage is: bits to store the hash functions, to store vector if the frequency of each element is at most , and that’s it.
Citations: This scheme is due to the Gödel prize winning paper of Noga Alon, Yossi Matias, and Mario Szegedy. There has been a lot of interesting work on moment estimation: see, e.g., this STOC 2011 paper of Daniel Kane, Jelani Nelson, Ely Porat and David Woodruff on getting lower bounds for -norms of the vector , and the many references therein.
4. Subgaussian Behavior
In the previous section, we saw that if each row of the matrix was drawn from a -wise independent sample space (and hence generating any column of could be done in space), setting would suffice to give answers within with probability at least . Note that the number of rows went from to ; this increase typical of cases where we only use the second moment (and limited independence) instead of all the moments (complete independence).
So suppose we did have the luxury of full independence, could we match the JL bound using Rademacher matrices? Or does moving to the case already lose something in the performance? It turns out we can also prove Lemma 2 for a Rademacher matrix, losing only constants—we’ll now prove this.
Let’s look over the proof in Section 1, and see what we need to do. We take an arbitrary unit vector , and define
for , and ‘s being i.i.d and . If we could show that
then the rest of the proof of Section 1 does not use any other facts about Gaussians. And the first fact follows by the calculations from the previous section, so all we need to do is to bound the moment generating function for !
We can do this by explicit calculations, but instead let’s give a useful abstraction:
Definition 4 A random variable is said to be subgaussian with parameter and for all real , we have .
(You can define subgaussian-ness alternatively as in these notes by Roman Vershynin, which also shows the two definitions are equivalent for symmetric distributions.) A simple calculation shows that for then —good to know that the Gaussian is also subgaussian!
The following lemma gives a slick way to bound the mgf for the square of a subgaussian, now that we’ve done the hard work for the Gaussians.
Proof: Well, suppose is an independent Gaussian, then
by the calculation we just did for Gaussians. (Note that we’ve just introduced a Gaussian into the mix, without any provocation! But it will all work out.) Let just rewrite that
Using the -subgaussian behavior of we bound this by
Finally, the calculation~(1) gives this to be .
Good. Now if were subgaussian, we’d be done. We know that is a weighted sum of Rademacher varaibles. A Rademacher random variable is indeed -subgaussian
And if ‘s are independent and -subgaussian, and , then has
To summarize: ‘s are -subgaussian, so is too. And hence for -random variables as well. This, in turn, completes the proof that the Rademacher matrix also has the JL property! Note that the JL matrix now just requires us to pick random bits (instead of random Gaussians); also, there are fewer precision issues to worry about. One can consider other distributions to stick into the matrix —all you need to show is that has the right mean, and that the entries are subgaussian.
Citations: The scheme of using Rademacher matrices instead of Gaussian matrices for JL was first proposed in this paper by Dimitris Achlioptas. The idea of extending it to subgaussian distributions appears in this paper of Indyk and Naor, and this paper of Matousek. The paper of Klartag and Mendelson generalizes this even further.
BTW, one can define subgaussian distributions as ones that satisfy only for , or as variables for which for (the upper tail is subgaussian), and prove JL bounds—see, e.g., the paper of Matousek—but it does not matter for distributions symmetric about with bounded variance, since these definitions are then essentially the same.
Fast J-L: Do we really need to plug in non-zero values into every entry of the matrix ? What if most of is filled with zeroes? The first problem is that if is a very sparse vector, then might be zero with high probability? Achlioptas showed that having a random two-thirds of the entries of being zero still works fine: the paper of Nir Ailon and Bernard Chazelle showed that if you first hit with a suitable matrix which caused to be “well-spread-out” whp, and then would still hold for a much sparser . Moreover, this requires much less randomless, and furthermore, the computations can be done faster too! There has been much work on fast and sparse versions of JL: see, e.g., this SODA 11 paper of Ailon and Edo Liberty, and this arxiv preprint by Daniel Kane and Jelani Nelson. Jelani has some notes on the Fast JL Transform.
Compressed Sensing: Finally, the J-L lemma is closely related to compressed sensing: how to reconstruct a sparse signal using very few measurements. See these notes by Jiri Matousek, or these by Baraniuk and others for a proof of the beautiful connection. I will say more about this connection in a later post.
Some examples for the CKR decomposition scheme
Varun asked why the CKR algorithm differs in so many ways from Bartal’s procedure:
Here are some examples that might clarify the situation:
Suppose we choose a random radius R uniformly from [r/4,r/2], but independently for each center. Then in the following example, each time a leaf is picked, the unit-length edge (u,v) will be cut with probability 2/r. If there are n leaves, the edge will be cut with probability about Theta(n/r) >> O(log n)/r.
Now, suppose we choose a single radius R uniformly from [r/4,r/2], but choose the centers in arbitrary (adversarial) order. In the following example, if we choose the centers in the order l_{r/2-1}, …, l_2, l_1, then we cut the edge (u,v) for sure (with probability 1). Note that in this case, the random order means there’s a good chance that early on in the process, we pick some vertex l_i with some small value of $i$. Since this leaf is close to the edge (u,v), it will put both u and v in the same set.
So yeah, once we go with a uniformly random choice of R (instead of choosing it from a geometric distribution), we run into trouble if we re-sample R independently each time time we grow a region, or if we choose the next vertex to grow from in an worst-case fashion.
Finally, what about the growing balls from centers that have already been carved out? That is important for the analysis, because it makes the process depend very mildly on the past evolution. Would the algorithm break down if we did not do that? I am blanking on a bad example right now, but I think there should be one. Let me know if you see it.
Some citations for k-server
In STOC 2008, Cote, Meyerson, and Poplawski gave a randomized algorithm for the k-server problem on certain special kinds of HSTs that achieved a poly-logarithmic competitive ratio. In SODA 2009, Bansal, Buchbinder, and Naor abstracted out certain “convexity” properties, which if we could prove, would give a polylogarithmic competitive ratio for general HSTs, and hence for all metric spaces using the Theorems we saw in class. (See also their ICALP 2009 paper.) It’d be great to make progress on this problem.
1. Embeddings into Distributions over Trees
In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to ).
Theorem 1 Given any metric with and aspect ratio , there exists a efficiently sampleable distribution over spanning trees of such that for all :
- For all , , and
- .
To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space on points and a parameter , a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of into such that
We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.
Consider the following recursive algorithm, which takes as input a pair where is a set of vertices of diameter at most , and returns a rooted tree .
TreeEmbed:
- Apply the low-diameter decomposition to with the parameter to get the partition .
- Recurse: Let TreeEmbed(). As a base case, when is a single point, simply return that point.
- For every tree with , add the edge with length . This is a new tree which we denote .
- Return the tree/root pair .
Recall that since the low diameter decomposition is randomized, this algorithm defines a distribution over trees over . To build the tree for , we first rescale so that for all , and . We define the distribution as the one obtained by calling TreeEmbed.
Lemma 2 For all , for all .
Proof: Fix and , and let be such that . Consider the invocation of TreeEmbed such that . First, we examine the case in which . By the definition of the low diameter decomposition, since , and will fall into separate parts of the partition obtained in Step 1, and so we will have , the length of the edge placed between different subtrees. In the case in which , then it must be that and have been separated at a higher level of the recursion, are consequently separated by a higher subtree edge, and hence .
Lemma 3 For all ,
Proof: We begin with two easy subclaims. Suppose TreeEmbed:
We now have from the definition:
where the first inequality follows from our subclaims, the second follows from the property of the low diameter decomposition. Setting and completes the proof.
The two lemmas above prove Theorem 1. How do we implement these low diameter decompositions? And how can we get the promised ? Keep reading…
2. Low Diameter Decompositions
Recall the definition of a (randomized) low-diameter decomposition from above: given a metric and a bound , we want a partition with pieces of radius at most , and want vertices to be separated with “small” probability (i.e., proportional to their distance, and inversely proportional to ).
Before we proceed, think about how you’d get such a decomposition for a line metric, or a tree metric, with ; moreover, you cannot hope to get subconstant for even the line. So the theorem says that general graphs lose a factor more, which is not bad at all! (And this factor is existentially optimal, we will show a tight example.)
2.1. Algorithm I: Geometric Region Growing
To make our life easier, we’ll assume that all distances in the metric are at least . (We can enforce this via a pre-processing without much effort, I’ll come back to it.)
The algorithm just picks a “truncated” geometric distance , carves out a piece of radius around some vertex, and repeats until the metric is eaten up.
Geom-Regions:
- Choose ; if , then set .
- Pick an arbitrary vertex , and set .
- Return Geom-Regions.
Clearly, the radius bound is maintained by the fact that with probability .
What’s the chance that lie in separate parts? So let’s view this process as picking a vertex and starting with a ball of radius zero around it; then we flip a coin with bias , increasing the radius by one after each tails, until either we see a heads or we reach tails, when we cut out the piece. And then we pick another vertex, and repeat the process.
Consider the first time when one of these lies in the current ball. Note that either this ball will eventually contain both of them, or will separate them. And to separate them, it must make a cut within the next steps. The chance of this is at most the chance of seeing a heads from a bias- coin in steps, plus the chance that a r.v. sees more than tails in a row. Using a naive union bound for the former, we get
We now use the fact that all distances are at least to claim that and hence the probability of separated is at most , which proves the second property of the decomposition.
Finally, the loose ends: to enforce the minimum-distance condition that , just think of the metric as a complete graph with edge-lengths , contract all edges with , and recompute edge lengths to get the new metric . Running the decomposition Geom-Regions on this shrunk metric, and then unshrinking the edges, will ensure that each pair is separated with probability either (if it has length ), or probability at most . And finally, since the output had radius at most according to , any path has at most nodes and its length can change by at most for , the new radius is at most !.
Another advantage of this shrinking preprocessing: a pair is separated only when , and it is separated for sure when . Using this observation in the calculation from the previous section can change the to just . But to get the ultimate guarantee, we’ll need a different decomposition procedure.
2.2. Algorithm II: The CKR Decomposition
Theorem 4 (The Better Decomposition) There exists an efficiently sampleable probability distribution over partitions with parts having radius at most such that
where .
The procedure for the decomposition is a little less intuitive, but very easy to state:
CKR Decomposition:
- Choose uniformly at random.
- Choose a random permutation uniformly at random.
- Consider the vertices one by one, in the order given by . When we consider , we assign all the yet-unassigned vertices with to ‘s partition.
For example, suppose the ordering given by is . The figure below illustrates the coverage when the vertices are visited by this process.
This construction directly implies the low-radius property, restated in the following claim.
Lemma 5 (Low Radius) The output of the algorithm has the property that for all , there exists such that for all , .
The real work is in showing that for each pair , it is separated with small probability. Before proving this, let us state two definitions useful for the proof. For the analysis only: suppose we re-number the vertices in order of the distance from the closer of .
According to these definitions, each pair is settled at exactly one time instant in the procedure, and it may or may not be cut at that time. Of course, once the pair is settled (with or without being cut), it is never cut in the future.
Now to bound the separation probability. Consider , and let and . Assume (the other case is identical). If cuts when the random values are and , the following two properties must hold:
Suppose not, and one of them came before in the permutation. Since all these vertices are closer to the pair than is, then for the current value of , they would have settled the pair (either capturing one or both of the endpoints) at some previous time point, and hence would not settle—and hence not cut—the pair .
With these two properties, we establish
But we wanted to do better than that! No worries, the fix is easy, but clever. First, note that if then the probability of separating is at most . So suppose . Now, for to cut , it is not enough for and comes before all for . It also must be the case that be at most from the closer of the pair (say ) to even reach one of the vertices, let alone separate then. And at least from the further one (say ) so that some setting of would have a chance to separate the two. So the distance of from must be at most , and at least , and the same for its distance from . If we restrict the harmonic sum in the final expression over just the vertices that satisfy these bounds, we get the bound
and hence the bound in Theorem 4.
Theorem 6 (FRT 2003) Using the decomposition procedure from Theorem 4 in the TreeEmbed algorithm, we get that for all :
The proof for the TreeEmbed algorithm remains essentially unchanged, except for the final calculations:
where the last equality follows from observing that we have a telescoping sum.
Citations: The construction was due to Yair Bartal (1996); this substantially improved on the first non-trivial guarantee of due to Alon, Karp, Peleg and West (1992). The low-diameter decomposition is also from Bartal. The algorithm is by Fakcharoenphol, Rao, and Talwar (2003), based on the improved decomposition scheme due to Calinescu, Karloff and Rabani (2000).
3. Lower Bounds
Let us show two lower bounds: first, that no randomized low-diameter decomposition can achieve better than for general metrics. And that no random tree embeddings can do better than either.
3.1. Lower Bounds for Decompositions
First, given a graph with unit length edges, if we apply a decomposition with parameter to the graph metric , we will cut each edge with probability . The expected number of cut edges will be . So, for each the probabilistic method says there exists a diameter- partition that cuts at most edges.
Let be a graph with nodes and edges (with ), where the girth of the graph (the length of the shortest simple cycle) is at least (for constant ). Such graphs are known to exist, this can be shown by the probabilistic method.
Now, if we set and consider any diameter- partition: we claim no set in this partition can induce a cycle. Indeed, since every cycle is of length , two furthest points in the cycle would be distance from each other. So all sets induce a forest, which means the number of internal edges is at most . This means at least edges are cut.
Cool. For every diameter- partition, at least edges are cut because of the large girth property. But there exists one that cuts at most edges, because we have a good decomposition algorithm. So now we put the two facts together.
3.2. Lower Bounds for Random Tree Embeddings
Suppose there is a distribution that achieves expected stretch for the large-girth graphs above. Let’s use this to obtain a low-diameter decomposition with cutting parameter ; this will mean .
Sample a tree from the distribution, pick an arbitrary vertex , pick a random value . Delete all edges that contain points at distance exactly in from . The remaining forest has components with radius at most , and diameter in the tree. Since distances on the original graph are only smaller, the diameter of each part will only be less in the original graph.
Moreover, given the tree , a pair will be separated with probability at most . Taking expectations, the total probability of separated is at most
So we have a decomposition scheme with parameter . And combining this with the previous lower bound on any decomposition scheme, we get .
1. Metric Spaces
A metric space is a set of points, with a distance function that satisfies for all , symmetry (i.e., ), and the triangle inequality (i.e., for all ). Most of the computer science applications deal with finite metrics, and then denotes the number of points .
There are many popular problems which are defined on metric spaces:
The best approximation algorithm for the problem is a -approximation due to Oveis-Gharan, Saberi and Singh (2010). Their paper uses randomization to beat the -approximation of Cristofides (1976), and make progress on this long-standing open problem. The best hardness result for this problem is something like due to Papadimitriou and Vempala.
1.1. Approximating Metrics by Trees: Attempt I
A special kind of metric space is a tree metric: here we are given a tree where each edge has a length . This defines a metric , where the distance is the length of the (unique) shortest path between and , according to the edge lengths . In general, given any graph with edge lengths, we get a metric .
Tree metrics are especially nice because we can use the graph theoretic idea that it is “generated” by a tree to understand the structure of the metric better, and hence give better algorithms for problems on tree metrics. For instance:
So if all metrics spaces were well-approximable by trees (e.g., if there were some small factor such that for every metric we could find a tree such that
for every , then we would have an -approximation for TSP and -median, and an -competitive algorithm for -server on all metrics. Sadly, this is not the case: for the metric generated by the cycle graph , the best factor we can get in~(1) is . This is what we would get if we just approximated the tree by a line.
So even for simple metrics like that generated by the cycle (on which we can solve these problems really easily), this approach hits a dead-end really fast. Pity.
1.2. Approximating Metrics by Trees: Attempt II
Here’s where randomization will come to our help: let’s illustrate the idea on a cycle. Suppose we delete a uniformly random edge of the cycle, we get a tree (in fact, a line). Note that the distances in the line are at least those in the cycle.
How much more? For two vertices adjacent in the cycle, the edge still exists in the tree with probability , in which case ; else, with probability , and lie at distance from each other. So the expected distance between the endpoints of an edge of the cycle is
And indeed, this also holds for any pair (check!),
But is this any good for us?
Suppose we wanted to -median on the cycle, and let be the optimal solution. For each , let be the closest facility in to ; hence the cost of the solution is:
By the expected stretch guarantee, we get that
I.e., the expected cost of this solution on the random tree is at most . And hence, if is the cost of the optimal solution on , we get
Great—we know that the optimal solution on the random tree does not cost too much. And we know we can find the optimal solution on trees in poly-time.
Let’s say is the optimal solution for the tree , where the closest facility in to is , giving . How does this solution perform back on the cycle? Well, each distance in the cycle is less than that in the tree , so the expected cost of solution on the cycle will be
And we have a randomized -approximation for -median on the cycle!
1.3. Popping the Stack
To recap, here’s the algorithm: pick a random tree from some nice distribution. Find an optimal solution for the problem, using distances according to the tree , and output this set as the solution for the original metric.
And what did we use to show this was a good solution? That we had a distribution over trees such that
And last but not least
Note that TSP, -median, -server, and many other metric problems have cost functions that are linear in the distances, so as long as the metrics we care about can be “embedded into random trees” with small , we can translate algorithms on trees for these problems into (randomized) algorithms for general metrics! This approach gets used all the time, and is worth remembering. (BTW, note that this general approach does not work for non-linear objective functions, like -center, or -means.)
But can we get a small in general? In the next section, we show that for any -point metric with aspect ratio , we can get ; and we indicate how to improve this to , which is the best possible!
2. Embeddings into Trees
In this section, we prove the following theorem using tree embeddings (and then, in the following section, we improve it further to ).
Theorem 1 Given any metric with and aspect ratio , there exists a efficiently sampleable distribution over spanning trees of such that for all :
- For all , , and
- .
To prove this theorem, we will use the idea of a low diameter decomposition. Given a metric space on points and a parameter , a (randomized) low-diameter decomposition is an efficiently sampleable probability distribution over partitions of into such that
We’ll show how to construct such a decomposition in the next section (next lecture), and use such a decomposition to prove Theorem 1.