Lecture #17: Dimension Reduction
Today we’ll talk about dimensionality reduction, and some related topics in data streaming.
1. Dimension Reduction
Suppose we are given a set of points
in
. How small can we make
and still maintain the Euclidean distances between the points? Clearly, we can always make
, since any set of
points lies on a
-dimensional subspace. And this is (existentially) tight: e.g., the case when
are all orthogonal vectors.
But what if we were OK with the distances being approximately preserved? In HW#3, you saw that while there could only be orthogonal unit vectors in
, there could be as many as
unit vectors which are
-orthogonal—i.e., whose mutual inner products all lie in
. Near-orthogonality allows us to pack exponentially more vectors!
Put another way, note that
And hence the squared Euclidean distance between any pair of the points defined by these -orthogonal vectors falls in
. So, if we wanted
points exactly at unit (Euclidean) distance from each other, we would need
dimensions. (Think of a triangle in
-dims.) But if we wanted to pack in
points which were at distance
from each other, we could pack them into
dimensions.
1.1. The Johnson Lindenstrauss lemma
The Johnson Lindenstrauss “flattening” lemma says that such a claim is true not just for equidistant points, but for any set of points in Euclidean space:
Lemma 1 Let
. Given any set of points
in
, there exists a map
with
such that
Note that the target dimension is independent of the original dimension
, and depends only on the number of points
and the accuracy parameter
.
This lemma is tight up to the constant term: it is easy to see that we need at least using a packing argument. Noga Alon showed a lower bound of
.
1.2. The construction
The JL lemma is pretty surprising, but the construction of the map is perhaps even more surprising: it is a super-simple random construction. Let be a
matrix, such that every entry of
is filled with an i.i.d. draw from a standard normal
distribution (a.k.a. the “Gaussian” distribution). For
, define
That’s it. You hit the vector with a Gaussian matrix
, and scale it down by
. That’s the map
. Note that it is a linear map:
. So suppose we could show the following lemma:
Lemma 2 Let
. If
is constructed as above with
, and
is a unit vector, then
Then we’d get a proof of Lemma 1. Indeed, set , and hence
. Now for each
we get that the squared length of
is maintained to within
with probability at least
. By a union bound, all
pairs of distances in
are maintained with probability at least
. This proves Lemma 1.
A few comments about this construction:
- The above proof shows not only the existence of a good map, we also get that a random map as above works with constant probability! In other words, a Monte-Carlo randomized algorithm for dimension reduction. (Since we can efficiently check that the distances are preserved to within the prescribed bounds, we can convert this into a Las Vegas algorithm.)
- The algorithm (at least the Monte Carlo version) does not even look at the set of points
: it works for any set
with high probability. Hence, we can pick this map
before the points in
arrive.
- Given a set
, one can get deterministic poly-time algorithms constructing a dimension reduction map
for
: the first one was given in this paper of Lars Engebretsen, Piotr Indyk and Ryan O’Donnell; another construction is due to D. Sivakumar.
A SODA 2011 paper of T.S. Jayram and David Woodruff shows that this dependence of is the best possible. Note that if we use this approach the union bound to prove JL, then
is the best bound possible. (An earlier version of these notes incorrectly claimed that the Jayram-Woodruff paper also showed an unconditional lower bound for JL, thanks to Jelani for pointing out the mistake.)
1.3. The proof
Now, on to the proof of Lemma 2. Here’s the main idea. Imagine that the vector we’re considering is just the elementary unit vector . Then
is just a vector with independent and identical Gaussian values, and we’re interested in its length—the sum of squares of these Gaussians. If these were bounded r.v.s, we’d be done—but they are not. However, their tails are very small, so things should work out
But what’s a Gaussian ? Well, it looks like this:
\vspace{1in}
Which is not too different from this (bounded) random variable, if you squint a bit:
\vspace{1in}
Which has constant mean. So, if we take a sum of a bunch of such random variables (actually of their squares), it should behave pretty much like its mean (which is ), because of a Chernoff-like argument. And so the expected length is close to
, which explains the division by
.
Now we just need to make all this precise, and remove the assumption that the vector was just . That’s what the rest of the formal proof does: it has a few steps, but each of them is fairly elementary.
1.4. The proof, this time for real
We’ll be using basic facts about Gaussians, let’s just recall them. The probability density function for the Gaussian is
We also use the following; the proof just needs some elbow grease.
Recall that we want to argue about the squared length of . To start off, observe that each coordinate of the vector
behaves like
where the ‘s are i.i.d.
r.v.s. But then the proposition tells us that
. And since
is a unit length vector, this is simply
. So, each of the
coordinates of
behaves just like an independent Gaussian!
What is the squared length of , then? It is
where each , independent of the others. And since
, we get
.
Now to show that does not deviate too much from
. And
is the sum of a bunch of independent and identical random variables. If only the
‘s were all bounded, we could have used a Chernoff bound and be done. But these are not bounded, so this is finally where we’ll need to do a little work. (Note: we could take the easy way out, observe that the squares of Gaussians are chi-squared r.v.s, the sum of
of them is chi-squared with
degrees of freedom, and the internets conveniently has tail bounds for these things. But we digress.)
So let’s start down the ye olde Chernoff path, for the upper tail, say: \Pr[ Z \geq 1 + \varepsilon ] &\leq \Pr[ e^{tkZ} \geq e^{tk(1+\varepsilon)} ] \leq E[ e^{tkZ} ]/e^{tk(1+\varepsilon)} = \prod_i \left( E[ e^{tY_i^2} ]/e^{t(1+\varepsilon)} \right) for every . And what is
for
? Let’s calculate it: \frac1{\sqrt{2\pi}} \int_y e^{ty^2} e^{-y^2/2} dy &= \frac1{\sqrt{2\pi}} \int_z e^{-z^2/2} \frac{dz}{\sqrt{1 – 2t}} = \frac{1}{\sqrt{1 – 2t}}. for
. So our current bound on the upper tail is that for all
we have
Let’s just focus on part of this expression:
Plugging this back, we get
if we set and use the fact that
for
. (Note: this setting of
also satisfies
, which we needed from our previous calculations.)
Almost done: let’s take stock of the situation. We observed that was distributed like a sum of squares of Gaussians, and using that we proved that
for . A similar calculation bounds the lower tail, and finishes the proof of Lemma 2.
Citations: The JL Lemma was first proved in this paper of Bill Johnson and Joram Lindenstrauss. There have been several proofs after theirs, usually trying to tighten their results, or simplify the algorithm/proof (see citations in some of the newer papers): the proof follows some combinations of the proofs in this STOC ’98 paper of Piotr Indyk and Rajeev Motwani, and this paper by Sanjoy Dasgupta and myself.
2. The data stream model
The JL map we considered was a linear map, and that has many advantages. One of them is that we can use it in a distributed context: if players each have a vector
and each knows the JL matrix
, then to compute
each person can just compute
, send their answers out, and then someone can sum up the answers to get
. Since these vectors
are smaller than
(they lie in
instead of
), this can result in significant savings in communication. (We need all players to know the matrix
, but if they have shared randomness they can generate this matrix themselves.)
This same idea is useful in the context of data streaming: suppose you have a data stream of a large number of elements whizzing past you, each element
drawn from the universe
. This stream defines a frequency vector
, where
is the number of times element
is seen. People working on data streams want to calculate statistics of this vector
—e.g., how many non-zeroes does it have? What is the
length of this? (Duh! it’s just the length of the data stream.) What is
? Etc.
The Space Crunch. All this can be trivially done if we use space to actually store the vector
. Suppose we do not want to store the frequency vector explicitly, but are OK with approximate answers. We can use JL or similar schemes to approximately calculate
. Suppose
is a random
Gaussian matrix, then by the guarantee of the JL lemma, the estimate
with probability
, if
. (Note: this is the error for a single query—so we’re not guaranteeing the counts at all times are close, just at the time the query is made.)
And the algorithm is simple: maintain a vector , initially zero. When the element
comes by, add in the
column of
to
. Finally, answer with
. (If you have to answer
queries, choose
appropriately larger.)
Of course, you’ve realized I am cheating. In order to save space we used JL. But the JL matrix itself uses entries, which is a lot of space, much more than the
entries of the frequency vector
! Also, we now need to maintain a matrix of reals, whereas
just has integers!
We can handle both issues. The former issue can directly be handled by using a pseudorandom generator that “fools” low-space computation—we will not talk about this solution in this lecture. Instead we’ll give a different (though weaker) solution which handles both issues: it will use less space, and will maintain only integer values (if the input has integers).
3. Using random signs instead of Gaussians
While Gaussians have all kinds of nice properties, they are real-valued distributions and hence require attention to precision. How about populating with draws from other, simpler distributions? How about setting each
, and letting
? (A random sign is also called a Rademacher random variables, btw, the name Bernoulli being already taken for a random bit in
.)
Now, we want to study the properties of
where each . Then
if the ‘s are pairwise independent, since
and
by independence. Plugging this into~(1) and recalling that
, we get
Just what we like! To show that is indeed close to its mean, we will use Chebyshev, and this requires us to compute the variance of
.
If the rows of are independent, then
is the sum of the variances from each row, which in terms of the variable
defined above is:
But , we know what
is. For the other term,
(The other terms disappear because of -wise independence.) And plugging this into the definition of
, we get
Interesting, the variance is just twice the squared mean—that’s good, since the variance of
(which was the final answer, obtained by taking the average of
such variables) is
as much, since averaging reduces the variance. So
. And finally, we can set
and use Chebyshev to get
Great! So, if we take a matrix
whose
rows were independent, each row having
values drawn from a
-wise independent sample space. We maintain a
-dimensional vector
, and whenever an element
in
comes by in the stream, we just add in the
column of
to
. And when we want the answer, we reply with
—this will be correct with probability at least
.
Why -wise independence? Well, the calculation of
only used the fact that any four entries of each row behaved independently of each other. And it is possible to generate
values from
which is
-wise independent, using hash functions that require only
bits of space. (We’ll talk more about this later in the course.) So the total space usage is:
bits to store the hash functions,
to store vector
if the frequency of each element is at most
, and that’s it.
Citations: This scheme is due to the Gödel prize winning paper of Noga Alon, Yossi Matias, and Mario Szegedy. There has been a lot of interesting work on moment estimation: see, e.g., this STOC 2011 paper of Daniel Kane, Jelani Nelson, Ely Porat and David Woodruff on getting lower bounds for -norms of the vector
, and the many references therein.
4. Subgaussian Behavior
In the previous section, we saw that if each row of the matrix was drawn from a
-wise independent sample space (and hence generating any column of
could be done in
space), setting
would suffice to give answers within
with probability at least
. Note that the number of rows went from
to
; this increase typical of cases where we only use the second moment (and limited independence) instead of all the moments (complete independence).
So suppose we did have the luxury of full independence, could we match the JL bound using Rademacher matrices? Or does moving to the case already lose something in the performance? It turns out we can also prove Lemma 2 for a Rademacher matrix, losing only constants—we’ll now prove this.
Let’s look over the proof in Section 1, and see what we need to do. We take an arbitrary unit vector , and define
for , and
‘s being i.i.d and
. If we could show that
-
, and
-
for some constant
,
then the rest of the proof of Section 1 does not use any other facts about Gaussians. And the first fact follows by the calculations from the previous section, so all we need to do is to bound the moment generating function for
!
We can do this by explicit calculations, but instead let’s give a useful abstraction:
Definition 4 A random variable
is said to be subgaussian with parameter
and for all real
, we have
.
(You can define subgaussian-ness alternatively as in these notes by Roman Vershynin, which also shows the two definitions are equivalent for symmetric distributions.) A simple calculation shows that for then
—good to know that the Gaussian is also subgaussian!
The following lemma gives a slick way to bound the mgf for the square of a subgaussian, now that we’ve done the hard work for the Gaussians.
Proof: Well, suppose is an independent Gaussian, then
by the calculation we just did for Gaussians. (Note that we’ve just introduced a Gaussian into the mix, without any provocation! But it will all work out.) Let just rewrite that
Using the -subgaussian behavior of
we bound this by
Finally, the calculation~(1) gives this to be .
Good. Now if were subgaussian, we’d be done. We know that
is a weighted sum of Rademacher varaibles. A Rademacher random variable is indeed
-subgaussian
And if ‘s are independent and
-subgaussian, and
, then
has
To summarize: ‘s are
-subgaussian, so
is too. And hence
for
-random variables as well. This, in turn, completes the proof that the Rademacher matrix also has the JL property! Note that the JL matrix
now just requires us to pick
random bits (instead of
random Gaussians); also, there are fewer precision issues to worry about. One can consider other distributions to stick into the matrix
—all you need to show is that
has the right mean, and that the entries are subgaussian.
Citations: The scheme of using Rademacher matrices instead of Gaussian matrices for JL was first proposed in this paper by Dimitris Achlioptas. The idea of extending it to subgaussian distributions appears in this paper of Indyk and Naor, and this paper of Matousek. The paper of Klartag and Mendelson generalizes this even further.
BTW, one can define subgaussian distributions as ones that satisfy only for
, or as variables for which
for
(the upper tail is subgaussian), and prove JL bounds—see, e.g., the paper of Matousek—but it does not matter for distributions symmetric about
with bounded variance, since these definitions are then essentially the same.
Fast J-L: Do we really need to plug in non-zero values into every entry of the matrix ? What if most of
is filled with zeroes? The first problem is that if
is a very sparse vector, then
might be zero with high probability? Achlioptas showed that having a random two-thirds of the entries of
being zero still works fine: the paper of Nir Ailon and Bernard Chazelle showed that if you first hit
with a suitable matrix
which caused
to be “well-spread-out” whp, and then
would still hold for a much sparser
. Moreover, this
requires much less randomless, and furthermore, the computations can be done faster too! There has been much work on fast and sparse versions of JL: see, e.g., this SODA 11 paper of Ailon and Edo Liberty, and this arxiv preprint by Daniel Kane and Jelani Nelson. Jelani has some notes on the Fast JL Transform.
Compressed Sensing: Finally, the J-L lemma is closely related to compressed sensing: how to reconstruct a sparse signal using very few measurements. See these notes by Jiri Matousek, or these by Baraniuk and others for a proof of the beautiful connection. I will say more about this connection in a later post.
