7

A Note on Distributions and Approximation

 2 years ago
source link: https://rjlipton.wpcomstaging.com/2012/03/11/a-note-on-distributions-and-approximation/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A Note on Distributions and Approximation

March 11, 2012

A helpful analogy for the quantum computing debate?

Roman Smolensky was a complexity theorist—unfortunately “was” is the correct word here. He passed away in 1995, way too young, at the age of 35. His STOC 1987 paper “Algebraic methods in the theory of lower bounds for Boolean circuit complexity” is a classic, and one can only imagine what he might have done had he lived longer.

Today I, Ken, wish to explain a well known result of Smolensky’s about low-level complexity theory. Then I draw an analogy we hope may help in understanding mathematical issues underlying our current debate over scalable quantum computers.

Dick raises a principle of mathematics that is not well known, or not as well known as it should be: when it is hard to work with sets, replace them by distributions. Even though distributions are a more general notion, they have some nicer properties. Properties of distributions over qubits and their errors are at the heart of the debate, so we feel noting some properties of distributions on ordinary bits or vectors over a finite field, and raising some questions, may be a helpful stepping stone.

Smolensky considered bounded (or sub-logarithmic) depth circuits of unbounded fan-in {\mathsf{AND}} and {\mathsf{OR}} gates (with {\mathsf{NOT}} gates at inputs), or equivalently circuits of {\mathsf{NAND}} or {\mathsf{NOR}} gates, together with unbounded fan-in gates telling whether the number of incoming 1’s is a multiple of a prime {p}. He proved that they cannot compute the mod-{q} function unless {q} is a power of {p}, and indeed cannot even approximate it in a sense whose dependence on distributions we wish to bring out. Let us first turn to Smolensky’s work, in the particular case of parity, i.e., {q = 2} and {p} is an odd prime.

Approximating NANDs

As Smolensky’s paper acknowledges, the following lemma on approximating {\mathsf{NAND}} by low-degree polynomials was also proved by Alexander Razborov for {p = 2}, and by David Mix Barrington for all {p}:

Lemma 1 For any distribution {D} on {\{-1,1\}^m}, and {k > 0}, there exists a polynomial {h(x_1,\dots,x_m)} of degree at most {(p-1)k} over {\mathbb{Z}_p} such that

\displaystyle  \Pr_{y \leftarrow D}[h(y) \neq \mathsf{NAND}(y)] \leq \frac{1}{p^k}.

Proof: Consider polynomials of the form {g = 1 - 2\cdot h^{p-1}} where

\displaystyle  h = c_1\frac{y_1 + 1}{2} + c_2\frac{y_2 + 1}{2} + \cdots + c_m\frac{y_m + 1}{2}

with coefficients {c_1,\dots,c_m \in \mathbb{Z}_p}.

If {y = (y_1,\dots,y_m)} makes {\mathsf{NAND}(y)} false, then since {-1} stands for true, {h(y_1,\dots,y_m) = 0}, and so {g(y) = 1} which stands for false.

Otherwise, some {y_i = 1}, and thus {h} has an additive term that is simply {c_i} by itself. Whatever the value of the rest of {h}, there is exactly a {(p-1)/p} probability that a uniformly random choice of coefficients including {c_i} will produce an {h} such that {h(y) \neq 0}. Raising that to the power {p-1} produces {1}, and so {h'(y) = -1} which stands for true.

Furthermore, using {k}-many sets of {m}-many coefficients {c_{i,j}} to define {h_1,\dots,h_k} in the manner of {h}, we define

\displaystyle  g' = (-1) + 2\prod_{j=1}^k (1 - h_j^{p-1})

Then when {y = (-1,\dots,-1)}, all {h_j(y) = 0} so {g'(y) = 1} which stands for false, while for all other {y \in \{-1,1\}^m}, the chance of making some {h_j(y)} nonzero, so that the whole product is zeroed out and {g'(y) = -1}, is {\Delta = 1 - 1/p^k}.

Thus for all {y}, the probability over the {c_{i,j}} that the resulting {g'(y)} agrees with {\mathsf{NAND}(y)} on that {y} is {\Delta}. Picture the {y}‘s as rows of a big table and the choices for all {c_{i,j}} as columns. Each row has density {\Delta} of agreements. It follows that regardless of the distribution {D} on the rows, some column has density at least {\Delta} of agreements when {y} is chosen according to {D}. \Box

The reason for caring about arbitrary distributions on {y} is that the {y_i} will be outputs from {\mathsf{NAND}} gates higher in the circuit, and those values might be correlated. Lecture notes by Alexis Maciel and David Mix Barrington (which I am using in a course) in fact represent the {y_i} as functions {f_i(X)} of a single variable {X}. Here {X} is ultimately interpreted as the inputs {x_1,\dots,x_n} to the circuit {C}, but it could be arbitrary.

Lower Bound for Parity

The point of the lemma is that if the circuit {C} has small depth {d}, then composing the polynomials {g'} obtained for each {\mathsf{NAND}} gate {g} yields a polynomial {f_C} of degree at most {\delta = k^d(p-1)^d} that agrees with {C} on all but an {s/p^k} proportion of inputs {x_1,\dots,x_n}, where {s} is the number of gates in {C}. When {d = o(log n)}, there is some slack to choose {k} non-constant but growing slowly enough with {n} (for some fixed {b > 0} used as below) to make:

\displaystyle  \begin{array}{rcl}    \delta &=& o(\sqrt{n}),\\  k &=& \log_2 s + o(n^{1/bd}),\text{ and easily}\\  s/p^k &=& o(1)   \end{array}

where {s} is the size of the circuit. This allows {s} to be as big as {2^{n^{1/bd}}}, and this becomes the circuit size lower bound obtained by Smolensky’s argument.

Note that regardless of the distribution on the inputs, there always exists a setting of the {skm}-many coefficients {c_{g,i,j}} over all the gates {g} that achieves the bound {s/p^k} on the error probability over that distribution of inputs. In particular, this works for the uniform distribution over the inputs {(x_1,x_2,\dots,x_n)}, but does not require it. Thus {C} can be approximated by some polynomials {r} of low degree {r = o(\sqrt{n})}.

This sets the stage for Smolensky’s famous “trick” which we give in the case of parity. Over the domain {B = \{-1,1\}^n}, parity is just the product {x_1 x_2 \cdots x_n} of all the variables, and {x_i^2 = 1}. Thus parity is a polynomial {\kappa} that is “complete” in the following sense defined in his paper:

For every function {f: \mathbb{Z}_p^n \rightarrow \mathbb{Z}_p} there are polynomials {g,h} both of degree at most {n/2} such that

\displaystyle  (\forall x \in B)f(x) = \kappa(x)g(x) + h(x).

The genius of this is that if {r} agrees with {\kappa} on some subset {A \subseteq B} and has low degree {\delta}, then we can bound the size of {A}. Namely, every function {f} when restricted further to {A} can be written as a polynomial of degree at most {\Delta = \delta + \frac{n}{2}}. There are {p^{|A|}} such functions, but only {p^{(^n_i)}} ways to assign the coefficients of terms of degree {i} for all {i \leq \Delta}. Taking logs base {p} we get

\displaystyle  |A| \leq \sum_{i=0}^{\Delta} \left(\mbox{~}^n_i\mbox{\;}\right) \leq (\frac{1}{2} + o(1))2^n.

To understand the right-hand inequality intuitively, note that the standard deviation of {n} coinflips is {\Theta{\sqrt{n}}}, so having {\Delta} be only an {o(\sqrt{n})} deviation above {\frac{n}{2}} makes the overall density of {A} only vanishingly above one-half. Thus {r} and {\kappa} can agree on barely over half the inputs in {\{-1,1\}^n}, making {\kappa}—that is, parity—inapproximable by any small-depth, sub-exponential size circuits {C}. A tour-de-force.

Note that the agreement set {A} is implicitly talking about correlation with the uniform distribution on {B}. We would be interested in treatments that consider non-independent distributions on {B}, and note the presentation of Smolensky’s theorem by Emanuele Viola in this paper in that regard.

The Analogy

The analogy is that the proof technique—for the Lemma—does not require the distribution over the inputs {y} to be uniform. Indeed they can be joint functions of some “hidden variable” {X}, and be in that sense “entangled.” More concretely, they can be arbitrarily strongly correlated. The proof technique copes with this lack of entropy in {y} by creating its own entropy in the ability to fit the coefficients {c_{i,j}} when taking a low-degree polynomial that succeeds with high probability. Although the theorem is usually presented assuming uniform distribution over the inputs {x_1,\dots,x_n}, this is not actually needed for the Lemma.

The analogy with our quantum debate is made by asking, do fault-tolerance schemes work more like the Lemma, or do they really need independence? At the risk of straining what we are trying to say, we can draw out the analogy even further:

{\bullet} “Model of Noise”: A class of distributions {D} on the inputs to the circuit {C}, or on any {\mathsf{NAND}} gate of {C}.

{\bullet} “Correlated/Entangled Errors”: {D} is the image distribution under mappings {y_1 = f_1(X)}, {y_2 = f_2(X)}, {\dots}, {y_m = f_m(X)} of a distribution {E} on a set {X}, where {E} may have lower entropy/fewer degrees of freedom. Note that gates lower down in the circuit may have a greater degree of correlation resulting from the computation by gates above them.

{\bullet} “Reasonable Noise Model”: The distribution {D} is not too complex, such as being {\mathsf{P}}-samplable, or is natural in some other sense. For instance, {D} could be induced from some distribution {D'} on co-ordinates {(y_1,\dots,y_m,z_1,\dots,z_s)} by some operation on the {z_i} co-ordinates, such as projection or “tracing out.”

{\bullet} “Feasible States”: The coefficients {c_{i,j}} that achieve the desired approximations are easy to compute, for instance {\mathsf{P}}-uniform.

The main lack in this analogy is that the complexity-class side has no notion of “noise” per se or “errors” or “correction.” But we offer it nevertheless because we see some similarity in the mathematical issues. We also would like to ask whether the feasible-construction issues have been addressed as-such by complexity theorists.

We have a further reason for thinking along lines of distributions, a.k.a. measures, in a related context. A survey paper by Akshay Venkatesh of Stanford about work on the Littlewood conjecture has some interesting examples and declares straight out:

Measures are often easier to work with than sets.

In particular, on pages 12–13 he gives an example of a statement about sets that is “wishful thinking” for sets, but which he can “salvage” in terms of distributions. We intend to elaborate on this point soon.

Open Problems

What is the complexity of constructing the polynomials {g} that approximate a {\mathsf{NAND}} gate, or a circuit of {\mathsf{NAND}} gates, in terms of the complexity of the distribution {D} on their inputs? Under what conditions are the resulting polynomials computable jointly in uniform polynomial time? When are they succinct?

Why does Smolensky’s argument fail to give a super-polynomial size lower bound for constant-depth circuits with {\mathsf{MOD}_{15}} gates that compute parity? Can we give a better, deeper answer than, “because {\mathbb{Z}_{15}} is not a field”?

[fixed formula for g-prime]

Like this:

Loading...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK