Gaussian Jensen’s inequality

Classical Jensen’s inequality

If B : \mathbb{R} \to \mathbb{R} is concave then for any probability measure d\mu on \Omega we have Jensen’s inequality

\begin{aligned} \int_{\Omega} B(f(x)) d\mu(x) \leq B\left(\int_{\Omega} f(x) d\mu(x) \right) \end{aligned}

holds for all d\mu measurable real valued functions f : \Omega \to \mathbb{R} provided that B(f(x)) is measurable.

Probabilist would write Jensen’s inequality as \mathbb{E} B(X) \leq B(\mathbb{E} X) where X is Borel measurable real valued random variable.

Jensen’s inequality with several variables has series of applications. Pick an arbitrary set K \subset \mathbb{R}^{n} and assume that B: \mathrm{conv}(K) \to \mathbb{R} is concave, where \mathrm{conv}(K) denotes convex hull of the set K. Then for any Borel measurable random variable X :\Omega \to  K we have

\begin{aligned} \mathbb{E} B (X) \leq B(\mathbb{E}X) \end{aligned}

Cauchy–Schwarz inequality: Let B(u,v) = \sqrt{uv} be a map \mathbb{R}^{2}_{+} \to \mathbb{R}. Notice that B is concave on \mathbb{R}^{2}_{+} = \{ (x,y)\, :\, x,y \geq 0\}. Therefore by Jensen’s inequality

\begin{aligned} \mathbb{E} \sqrt{XY} \leq \sqrt{\mathbb{E} X \mathbb{E} Y}\end{aligned}

holds for all nonnegative Borel measurable random variables X,Y : \Omega \to \mathbb{R}_{+}. In particular, if d\mu is a probability measure on \Omega and f,g :\Omega \to \mathbb{R}_{+} are measurable functions then

\begin{aligned} \int_{\Omega} \sqrt{fg} d\mu   \leq \left(\int f d\mu \right)^{1/2} \left( \int g d\mu \right)^{1/2}\end{aligned}

If d\mu is not a probability measure then the following trick using 1-homogeneity of B(u,v) = \sqrt{uv} still gives the inequality in this case. Indeed, consider a new probability measure

\begin{aligned}  d\nu = \frac{\sqrt{fg} \, d\mu}{\int \sqrt{fg} d\mu}\end{aligned}

And integrate the following inequality (in fact equality) with respect to d\nu

\begin{aligned} 1 \leq B(\frac{f}{\sqrt{fg}}, \frac{g}{\sqrt{fg}}). \end{aligned}

Applying Jensen we get

\begin{aligned} 1 \leq B \left(\int \frac{f}{\sqrt{fg}} d\nu, \int \frac{g}{\sqrt{fg}} d\nu  \right)  = \frac{1}{\int \sqrt{fg} d\mu} B\left(\int f d\mu, \int g d\mu \right)\end{aligned}

and we obtain the Cauchy–Schwarz inequality. This argument gives the following

Remark: If B is concave and 1-homogeneous then

\begin{aligned} \int  B(f_{1}, \ldots, f_{n}) d\mu \leq B\left(\int f_{1} d\mu, \ldots, \int f_{n} d\mu   \right) \end{aligned} for all (not necessarily probability) measures d\mu provided that B(f_{1}, \ldots, f_{n}) is measurable

Let us list other applications of Jensen’s inequality:

Holder’s inequality:

\begin{aligned}\int f_{1}^{1/p_{1}}\cdots f_{n}^{1/p_{n}} d\mu \leq  \left(\int f_{1} d\mu \right)^{1/p_{1}} \cdots \left(\int f_{n} d\mu\right)^{1/p_{n}} \end{aligned}

holds for any measure d\mu, all nonnegative measurable f_{1}, \ldots, f_{n}, and all powers p_{1}, \ldots, p_{n} \geq 1 such that \frac{1}{p_{1}}+\ldots+\frac{1}{p_{n}}=1.
Hint: use the fact that B(u_{1}, \ldots, u_{n}) = u_{1}^{1/p_{1}}\cdots u_{n}^{1/p_{n}} is concave on \mathbb{R}^{n}_{+}. Integrate the inequality
\begin{aligned} 1 \leq B\left(\frac{f_{1}}{f_{1}^{1/p_{1}}\cdots f_{n}^{1/p_{n}}}, \ldots , \frac{f_{n}}{f_{1}^{1/p_{1}}\cdots f_{n}^{1/p_{n}}} \right)\end{aligned}
with respect to the probability measure d\nu =\frac{f_{1}^{1/p_{1}}\cdots f_{n}^{1/p_{n}} d\mu}{\int f_{1}^{1/p_{1}}\cdots f_{n}^{1/p_{n}} d\mu} and apply Jensen’s inequality.

Minkowski inequality: Let p\geq 1. Then

\begin{aligned} \left(\int |f+g|^{p} d\mu\right)^{1/p} \leq  \left(\int |f|^{p} d\mu \right)^{1/p} + \left(\int |g|^{p} d\mu  \right)^{1/p}\end{aligned}

Hint: without loss of generality assume f,g \geq 0. Use the fact that B(u,v) = (u^{1/p}+v^{1/p})^{p} is concave. Integrate the inequality
\begin{aligned}1 \leq B\left(\frac{f^{p}}{(f+g)^{p}},\frac{g^{p}}{(f+g)^{p}} \right) \end{aligned} with respect to a probability measure d\nu = \frac{(f+g)^{p} d\mu}{\int (f+g)^{p} d\mu} and apply Jensen’s inequality.

Hanner’s inequality: for p \in [1,2] we have

\begin{aligned} \|f+g\|^{p}_{p}+\|f-g\|^{p}_{p} \geq  |\|f\|_{p}+\|g\|_{p}|^{p}+|\|f\|_{p}-\|g\|_{p}|^{p}\end{aligned}

and the inequality reverses if p\geq 2.
Hint: WLOG f,g \geq 0. Use the fact that B(u,v) = |u^{1/p}+v^{1/p}|^{p}+|u^{1/p}-v^{1/p}|^{p}, where u,v \geq 0, is convex if p \in [1,2], and it is concave if p\geq 2.

Remark: Let p\geq 2. Choose f, g \in L^{p}(d\mu) so that \|f\|_{p} = \|g\|_{p}=1. Then By Hanner’s inequality we have \left\| \frac{ f+g}{2} \right\|_{p}^{p} \leq 1 - \left\| \frac{f-g}{2} \right\|_{p}^{p}. In particular if \|f-g\|_{p} \geq \varepsilon>0 then

\left\| \frac{ f+g}{2} \right\|_{p} \leq \left( 1-\frac{\varepsilon^{p}}{2^{p}}\right)^{1/p}<1, i.e., L^{p} space is uniformly convex. In fact L^{p} space is uniformly convex for p \in (1,2] as well. Indeed, denote f=F+G and g=F-G in Hanner’s inequality with \|F\|_{p}=1 and \|G\|_{p}=1 for p\in (1,2]. Then Hanner’s inequality rewrites as

2^{p+1} \geq |\|F+G\|_{p}-\|F-G\|_{p}|^{p}+|\|F+G\|_{p}-\|F-G\|_{p}|^{p}.

Since the map (u,v) \mapsto |u-v|^{p}+|u+v|^{p} is even and convex, it is (strictly) increasing in each variable. In particular, if \|F-G\|_{p} \geq \varepsilon then the largest possible value for \|F+G\|_{p}, call it \delta_{p}, is the solution of the equation
2^{p+1} = |\delta_{p}-\varepsilon|^{p}+|\delta_{p}+\varepsilon|^{p}. Now it should be a calculus exercise to show that \frac{\delta_{p}}{2} <1. (for example, if \varepsilon=0 then \delta_{p}=2 is the solution. As soon as \varepsilon>0, then because of the fact that (u,v) \mapsto |u-v|^{p}+|u+v|^{p} is strictly increasing in each variable we have \delta_{p}<2). One can show that these upper bounds on \|f+g\|_{p} obtained by Hanner’s inequality are sharp.

Before we move to another application let me mention an (open) problem, which if true would serve as a natural extension of Hanner’s inequality to n functions.

Question 1. Let \varepsilon_{1}, \ldots, \varepsilon_{n} be symmetric \pm 1 Bernoulli random variables. Then for any f_{1}, \ldots, f_{n} \in L^{p} we have

\begin{aligned} \mathbb{E} \left\| \sum_{j=1}^{n}\varepsilon_{j} f_{j}\right\|_{p}^{p} \geq \mathbb{E} \left| \sum_{j=1}^{n} \varepsilon_{j} \|f_{j}\|_{p}\right|^{p} \end{aligned}

holds for p \in [1,2], and the reverse inequality if p \geq 2.

Remark: Question 1 boils down to showing that B(u_{1}, \ldots, u_{n}) = \mathbb{E} |\sum _{j=1}^{n} \varepsilon_{j} u_{j}^{1/p}|^{p} is convex if p \in [1,2] in the domain u_{1}, \ldots, u_{n} \geq 0, and it is concave if p\geq 2. This paper claims to have positively resolved this question but there is a gap in the proof, they prove convexity/concavity in each variable separately which is not sufficient. The concavity for p\geq 3 is due to Gideon Schechtman, so the only open case is to show convexity for p \in (1,2), and concavity for p \in (2,3). Also, notice that if we do not care about the constants then applying Khinchin’s inequality to the both sides of Hanner’s inequality with n functions we obtain

\begin{aligned} \|f_{1}^{2}+...+f_{n}^{2}\|_{p/2} \gtrsim \|f_{1}\|_{p}^{2}+\ldots+\|f_{n}\|_{p}^{2} \end{aligned}

holds for p \in [1,2] and reverse inequality for p\geq 2. Notice that the latter inequality simply follows from Minkowski’s inequality: \|f+g\|_{q} \leq \|f\|_{q}+\|g\|_{q} for q\geq 1, and reverse inequality for q \in (0,1) with f,g \geq 0. Reverse Minkowski can be proved by observing that the function B(u,v) = (u^{1/q}+v^{1/q})^{q} is convex for q \in (0,1) in the domain u,v \geq 0.

Pinsker’s inequality: Let P,Q be probability distribution functions such that dP is absolutely continuous with respect to dQ. Then

\begin{aligned}\int \frac{dP}{dQ}\ln \left(\frac{dP}{dQ}\right) dQ  \geq \frac{1}{2}\left(\int \left| \frac{dP}{dQ} - 1\right|dQ \right)^{2}\end{aligned}

Hint: show the pointwise inequality
\begin{aligned} x \ln x \geq (x-1)+\frac{(x-1)^{2}}{2+\frac{2}{3}(x-1)}, \quad x \geq 0 \end{aligned}
Substitute x = \frac{dP}{dQ} and integrate with respect to a probability measure dQ. We obtain

\begin{aligned}\int \frac{dP}{dQ}\ln \left(\frac{dP}{dQ}\right) dQ \geq \int \frac{(\frac{dP}{dQ}-1)^{2}}{2+\frac{2}{3}(\frac{dP}{dQ}-1)} dQ  \geq \frac{1}{2}\left(\int \left| \frac{dP}{dQ} - 1\right|dQ \right)^{2}\end{aligned}

where the last inequality follows by Jensen applied to a convex function (u,v) \mapsto \frac{u^{2}}{v} for v>0, u \in \mathbb{R}.

An abstract theorem: concave envelope

Concave/convex functions arise naturally when one tries to maximize/minimize a certain functionals over all “test functions”. For example, here is an abstract theorem which happens to be useful for these kind of applications. Consider the following problem

\begin{aligned}B(x) = \sup_{Y} \{ \mathbb{E}\, H(Y), \quad \mathbb{E}\,m(Y) = x\} \end{aligned}

where H : K \to \mathbb{R} is fixed, K \subset \mathbb{R}^{k} is a fixed subset, and m : K \to \mathbb{R}^{m} is a fixed map. The supremum is taken over all random variables Y given on any probability space with values in K (here we also assume that both H(Y) and m(Y) are measurable.

Theorem 1. Let B be defined as above
\begin{aligned}B(x) = \sup_{Y:\, \mathrm{Im}(Y) \in K } \{ \mathbb{E}\, H(Y), \quad \mathbb{E}\,m(Y) = x\} \end{aligned}
Then B coincides with the minimal concave function defined on the convex set \mathrm{conv}(m(K)) with the obstacle condition B(m(y)) \geq H(y).

This abstract theorem provides with more “unexpected” applications of Jensen’s inequality. Here is one such application: let p\geq 2. Suppose we would like to estimate \|f+g\|_{p} from above in terms of \|f\|_{p} and \|g\|_{p}. A trivial (and sharp) estimate is triangle inequality (Minkowski’s inequality) \|f+g\|_{p} \leq \|f\|_{p}+\|g\|_{p}. One drawback of this estimate is that if f, g have disjoint support then in fact we have equality \|f+g\|_{p}^{p} = \|f\|_{p}^{p}+\|g\|_{p}^{p}, whereas the triangle inequality gives \|f+g\|_{p}^{p}\leq (\|f\|_{p}+\|g\|_{p})^{p}. Obviously \|f\|_{p}^{p}+\|g\|_{p}^{p} \leq (\|f\|_{p}+\|g\|_{p})^{p} because a^{p}+b^{p}\leq (a+b)^{p} for all a,b \geq 0. We see that Minkowski’s inequality becomes too rough when applied to functions which “do not overlap”. It becomes even worse when we apply it to n functions, say we want to bound from above \|f_{1}+...+f_{n}\|_{p}. To measure the overlap between f,g one possibility is to introduce a parameter \|fg\|_{p/2}. One asks what is the best possible upper bound on \|f+g\|_{p} in terms of \|f\|_{p}, \|g\|_{p}, \|fg\|_{p/2}. The answer is the following

\begin{aligned} \|f+g\|_{p} \leq \left[ \left(\frac{1+\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p}+ \left(\frac{1-\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p} \right] (\|f\|_{p}^{p}+\|g\|_{p}^{p})^{1/p} \qquad (1)\end{aligned}

where \Gamma = \frac{2\|fg\|_{p/2}^{p/2}}{\|f\|_{p}^{p}+\|g\|_{p}^{p}} \in [0,1]. The equality holds if and only if both f,g are nonnegative (or non-npositive) such that |fg|^{p/2} = k (|f|^{p}+|g|^{p}) for some constant k \in [0,1/2]. If f,g have disjoint support fg=0, then \Gamma=0 and we recover \|f+g\|_{p}\leq (\|f\|_{p}^{p}+\|g\|_{p}^{p})^{1/p}. In general, the map

\begin{aligned} \Gamma \mapsto \left(\frac{1+\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p}+ \left(\frac{1-\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p} \end{aligned}

is increasing on [0,1], and one can show that this sharpening of triangle inequality, does indeed sharpen it:

\begin{aligned}  \left[ \left(\frac{1+\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p}+ \left(\frac{1-\sqrt{1-\Gamma^{2}}}{2}\right)^{1/p} \right] (\|f\|_{p}^{p}+\|g\|_{p}^{p})^{1/p} \leq \|f\|_{p}+\|g\|_{p}\end{aligned}

by estimating

\begin{aligned}\Gamma = \frac{2\|fg\|_{p/2}^{p/2}}{\|f\|_{p}^{p}+\|g\|_{p}^{p}} \stackrel{\mathrm{Cauchy--Schwarz}}{\leq} \frac{2\sqrt{\|f\|_{p}^{p}\|g\|_{p}^{p}}}{\|f\|_{p}^{p}+\|g\|_{p}^{p}} :=\tilde{\Gamma}\end{aligned}, and noticing the identity

\begin{aligned} \left[ \left(\frac{1+\sqrt{1-\tilde{\Gamma}^{2}}}{2}\right)^{1/p}+ \left(\frac{1-\sqrt{1-\tilde{\Gamma}^{2}}}{2}\right)^{1/p} \right] (\|f\|_{p}^{p}+\|g\|_{p}^{p})^{1/p} = \|f\|_{p}+\|g\|_{p}.\end{aligned}

To prove (1) one observes that

\begin{aligned} B(x,y,z) = \left( \left( \frac{1+\sqrt{1-\left(\frac{2z}{x+y}\right)^{2}}}{2}\right)^{1/p} + \left( \frac{1-\sqrt{1-\left(\frac{2z}{x+y}\right)^{2}}}{2}\right)^{1/p}\right)^{p}(x+y) \end{aligned}

is concave on \{(x,y,z), \, x,y,z \geq 0, \, z \leq \sqrt{xy}\}. Without loss of generality assume f,g \geq 0, consider a new probability measure

\begin{aligned} d\nu = \frac{(f+g)^{p}}{\int (f+g)^{p} d\mu} \end{aligned}, integrate the identity

\begin{aligned} 1 \leq B\left(\frac{f^{p}}{(f+g)^{p}},  \frac{g^{p}}{(f+g)^{p}}, \frac{(fg)^{p/2}}{(f+g)^{p}} \right) \end{aligned}

with respect to d\nu and use Jensen’s inequality. The obtained final estimate will be (1).

Of course, one may ask how could one ever guess this expression for B(x,y,z)? And why you ever believed that there is such a function at all? The starting point was the abstract theorem. We wanted to find

\begin{aligned} B(x,y,z) = \sup_{X,Y\geq 0 }\{ \mathbb{E}|X+Y|^{p}, \mathbb{E} (X^{p}, Y^{p}, (XY)^{p/2})=(x,y,z)\},  \end{aligned}

where supremum is taken over all nonnegative random variables over any probability space. The abstract theorem tells us that B is minimal concave function in the domain \mathrm{conv} \{(u^{p}, v^{p}, (u,v)^{p}), u,v \geq 0 \} with the obstacle condition B(u^{p}, v^{p}, (uv)^{p/2}) \geq (u+v)^{p}. So eventually we are solving the problem from geometry, i.e., find the concave envelope. See my talk about how to solve such geometric problems in a systematic way.

I should mention that in this case we were lucky and the function B(x,y,z) turned out to have an explicit formula. Sometimes (in fact most of the times) such optimal functions are implicit. Here is an example of an implicit function: having Hanner’s inequality one may ask to find

\begin{aligned} M(x,y,z) = \sup_{ f,g \in L^{p}(d\mu)} \left\{\left\|\frac{f+g}{2}\right\|_{p}^{p}\; , \; \|f\|^{p}_{p}=x, \, \|g\|_{p}^{p}=y, \,  \|f-g\|_{p}^{p}=z \right\}\end{aligned}

where d\mu is an arbitrary realm valued measure such that its support contains at least two points. It immediately follows from Hanner’s inequality that

\begin{aligned} M(x,y,z) \leq  \left|\frac{x^{1/p}+y^{1/p}}{2}\right|^{p}+\left|\frac{x^{1/p}-y^{1/p}}{2} \right|^{p}-\frac{z}{2^{p}} , \quad p\geq 2\end{aligned}

Perhaps one may guess that we have actually equality here. Unfortunately ‘No’. This would be “too good to be true”. In fact there is a linear function one can put between M and Hanner, namely,

\begin{aligned} M(x,y,z) \stackrel{(C)}{\leq} \frac{x+y}{2}-\frac{z}{2^{p}} \stackrel{(*)}{\leq}  \left|\frac{x^{1/p}+y^{1/p}}{2}\right|^{p}+\left|\frac{x^{1/p}-y^{1/p}}{2} \right|^{p}-\frac{z}{2^{p}}\end{aligned}

Inequality (C) is known as Clarkson’s inequality. And the inequality (*) is simple, it follows from the fact that the right hand side, as a concave function, dominates its tangent plane at point (1,1,1).

So what is the value of this mysterious function M(x,y,z)? It was calculated in my PhD thesis, see Section 2.7. The function M is defined implicitly, it involves solutions of finite number of algebraic equations, I do not intend to reproduce its value here.

Gaussian Jensen’s inequality

Suppose we want to understand under what conditions on B we have

\begin{aligned} \mathbb{E} B(f(X), g(Y))\leq B(\mathbb{E}f(X), \mathbb{E} g(Y)) \end{aligned}

holds for all test functions, say real valued f,g, where X, Y are some random variables (not necessarily all possible random variables!). If X=Y, i.e., X and Y are one and the same random variables, then the best we can hope for is that B must be concave, this is necessary and sufficient condition. If X, Y are independent then B(u,v) must be separately concave, i.e., concave in each variable separately (but not necessarily jointly). This observation suggests that maybe if there is some fixed correlation between random variables X, Y then perhaps the right condition on B will be something between jointly concave and separately concave. The answer is yes, and in case of normal random vectors we have the following nice characterization

Theorem 2. Let X=(X_{1}, \ldots, X_{n})  \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) be a normal random vector. Then

\begin{aligned} \mathbb{E} B(\boldsymbol{f}(X)) \leq B(\mathbb{E} \boldsymbol{f}(X)) \end{aligned}

holds for all test functions \boldsymbol{f}(X) = (f_{1}(X_{1}), \ldots, f_{n}(X_{n})) if and only if

\begin{aligned} \mathrm{Hess}B(u)\bullet \Sigma \leq 0\end{aligned},

where \bullet denotes Hadamard product, and \leq 0 means the matrix is negative semidefinite.

Remark: We assume that B is at least C^{2} in a rectangular domain Q = I_{1} \times \ldots I_{n} where I_{j} are intervals, rays, or real line. In this case test functions f_{j} take values in I_{j}, j=1, \ldots, n. Also, negative semi-definiteness \mathrm{Hess}B(u)\bullet \Sigma \leq 0 is required to hold for all u \in Q.

A short proof is given in Section 2.2 here. To match the notations X \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) being normal random vector means that X = \boldsymbol{\mu} +AZ, where Z \sim \mathcal{N}(0, I_{N \times N}) and A is some n \times N matrix. Clearly \boldsymbol{\Sigma} = AA^{T}. In the paper one uses A^{T} instead of A.

Here I would like to discuss stochastic calculus approach. Essentially these will be the same proofs, perhaps stochastic calculus approach is more intuitive.

Proof of Theorem 2.

Consider stochastic process X_{t}, t \in [0,1], such that X_{0}=\mu and X_{1} \sim X, i.e., X_{1} has the same law as X. The simplest way to construct such a process is to consider the stochastic PDE

\begin{aligned} dX_{t} = AdB_{t}, \quad X_{0}=\mu \end{aligned},

where A is our n \times N matrix, and B_{t} = (B^{1}_{t}, \ldots, B^{N}_{t}) is the standard N dimensional brownian motion starting at zero. One may notice that X_{t} is also a martingale, this will play a key role in the proof. Next, we consider a new process

\begin{aligned} Z_{t} = B(X_{t}), t \in [0,1] \end{aligned}.

We have Z_{0}= B(\boldsymbol{\mu}) and Z_{1} = B(X_{1})\sim B(X). If we can see that the map

\begin{aligned} t \mapsto \mathbb{E} B(X_{t}) \end{aligned}

is monotone on [0,1], namely, it is non-increasing then we obtain

\begin{aligned} \mathbb{E} B(X) = \mathbb{E} B(X_{1}) \leq \mathbb{E} B(X_{0})=B(\boldsymbol{\mu}) \end{aligned}.

This will prove the Jensen’s inequality in the simple case when f_{j}(u)=u, i.e., the test functions f_{j} are identity maps.

To investigate the monotonicity of \mathbb{E} Z_{t}, as a rule of thumb we differentiate the process X_{t} in t. Using Ito calculus we have

\begin{aligned} dZ_{t} = dB(X_{t}) = \langle \nabla B(X_{t}), dX_{t}\rangle  + \frac{1}{2}\langle \mathrm{Hess} B(X_{t}) dX_{t}, dX_{t} \rangle \end{aligned}
\begin{aligned} = \langle A^{T} \nabla B(X_{t}), dB_{t}\rangle + \frac{1}{2}\langle  A^{T}\mathrm{Hess}\, B(X_{t})A dB_{t}, dB_{t} \rangle \end{aligned}
\begin{aligned} = \langle  A^{T} \nabla B(X_{t}), dB_{t}\rangle + \frac{1}{2}\mathrm{Tr}\left( A^{T}\mathrm{Hess}\, B(X_{t})A \right) dt \end{aligned}
\begin{aligned} = \langle A^{T} \nabla B(X_{t}), dB_{t}\rangle + \frac{1}{2}\mathrm{Tr}\left( AA^{T}\mathrm{Hess}\, B(X_{t})\right) dt \end{aligned},

where in the last inequality we used a fact from linear algebra that \mathrm{Tr}(UV)=\mathrm{Tr}(VU) for rectangular matrices U is p \times q, and V is q \times p. In the second to the last inequality we used the property that dB^{j}_{t} dB^{i}_{t} = \delta_{ij}dt.


\begin{aligned} Z_{t} = Z_{0}+\int_{0}^{1} \langle A^{T} \nabla B(X_{t}), dB_{t} \rangle  + \frac{1}{2}\int_{0}^{1}\mathrm{Tr}\left( AA^{T}\mathrm{Hess}\, B(X_{t})\right) dt \end{aligned}.

Taking the expectation of both sides and invoking the martingale property \mathbb{E} \int_{0}^{1} \langle A^{T} \nabla B(X_{t}), dB_{t} \rangle=0 we obtain

\begin{aligned}\mathbb{E} B(X) = B(\boldsymbol{\mu})+ \frac{1}{2}\int_{0}^{1}\mathbb{E} \mathrm{Tr}\left( AA^{T}\mathrm{Hess}\, B(X_{t})\right) dt \leq B(\boldsymbol{\mu}) \end{aligned}


\begin{aligned}  \mathrm{Tr}\left( AA^{T}\mathrm{Hess}\, B(X_{t})\right) = \langle AA^{T} \bullet \mathrm{Hess}\, B(X_{t}) \boldsymbol{1}, \boldsymbol{1}\rangle \leq  0 \end{aligned},

where \boldsymbol{1}=(1, \ldots, 1).

Great! Now consider a general case when f_{j} are not identity maps. One natural process to look at is

Z_{t} = B(f_{1}(X^{1}_{t}), \ldots, f_{n}(X^{n}_{t})).

If we repeat the same computations we encounter with several issues. First notice that the n dimensional process (f_{1}(X^{1}_{t}), \ldots, f_{n}(X^{n}_{t})) is not a martingale. Also \mathbb{E} Z_{0} = B(f_{1}(\mu_{1}), \ldots, f_{n}(\mu_{n})) which is not the right hand side in Jensen’s inequality.

The make the n dimensional process (f_{1}(X^{1}_{t}), \ldots, f_{n}(X^{n}_{t})) the martingale there is one obvious way to do so. Consider a vector function

\begin{aligned}F(x,t) = \mathbb{E}  \left( (f_{1}(X^{1}_{1}), \ldots, f_{n}(X^{n}_{1}))| X_{t}=x\right)\end{aligned}

Then the process Y_{t} = F(X_{t}, t) is the martingale:

\begin{aligned} Y_{0}= \mathbb{E} (f_{1}(X^{1}_{1}), \ldots, f_{n}(X^{n}_{1}))\end{aligned}
\begin{aligned} Y_{1}= (f_{1}(X^{1}_{1}), \ldots, f_{n}(X^{n}_{1})) \sim (f_{1}(X_{1}), \ldots, f_{n}(X_{n})) \end{aligned}.

Let F(x,t)=(F^{1}(x,t), \ldots, F^{n}(x,t)). Then Feynman–Kac tells us

\begin{aligned} \frac{d}{dt} F^{\ell} + \frac{1}{2}\sum_{i,j=1}^{n} (AA^{T})_{ij} \frac{\partial^{2} F^{\ell}}{\partial x_{i} \partial x_{j}}=0, \quad \ell=1, \ldots, n \end{aligned}

Therefore Ito calculus gives

\begin{aligned} d F^{\ell}(X_{t}, t) = F^{\ell}_{t} dt + \langle \nabla F^{\ell}, dX_{t}\rangle + \frac{1}{2} \langle \mathrm{Hess}\, F^{\ell}\,  dX_{t}, dX_{t} \rangle =  \end{aligned}
\begin{aligned}  F^{\ell}_{t} dt + \langle \nabla F^{\ell}, AdB_{t}\rangle + \frac{1}{2} \mathrm{Tr} (A^{T}\mathrm{Hess}\, F^{\ell} A)  dt  =\langle \nabla F^{\ell}, AdB_{t}\rangle \end{aligned}.

If we let dF(X_{t},t) = (dF^{1}(X_{t}, t), \ldots, dF^{\ell}(X_{t},t))^{T} to be a column vector, and J be the Jacobian matrix of F in variable x, then we can write

\begin{aligned} dF(X_{t}, t) = JA\, dB_{t} \end{aligned}.


\begin{aligned}dB(F(X_{t},t)) =\langle \nabla B, dF \rangle + \frac{1}{2} \langle \mathrm{Hess}\, B \, dF, dF\rangle   =\end{aligned}
\begin{aligned} \langle \nabla B, JA\, dB_{t} \rangle + \frac{1}{2} \langle \mathrm{Hess}\, B \, JA\, dB_{t}, JA\, dB_{t}\rangle = \end{aligned}
\begin{aligned} \langle \nabla B, JA\, dB_{t} \rangle + \frac{1}{2} \mathrm{Tr}\left( (JA)^{T} \mathrm{Hess}\, B \, JA \right) dt. \end{aligned}

Integrating in t and taking the expectation of both sides we obtain

\begin{aligned} \mathbb{E} B(\boldsymbol{f}(X)) = B(\mathbb{E} \boldsymbol{f}(X))+ \frac{1}{2} \mathbb{E} \int_{0}^{1}\mathrm{Tr}\left( (JA)^{T} \mathrm{Hess}\, B \, JA \right) dt.  \end{aligned}

Next, notice that F^{\ell}(x,t) depends only on the coordinate x_{\ell} for x = (x_{1}, \ldots, x_{n}). Indeed,

\begin{aligned} F^{\ell} (x,t) = \mathbb{E} (f_{\ell}(X^{\ell}_{1}) | X_{t}=x) = \mathbb{E} f_{\ell}\left(x_{\ell}+\left(\int_{t}^{1} AdB_{t} \right)_{\ell}\right) =\mathbb{E} f_{\ell}\left(x_{\ell}+\langle A_{\ell}, B_{1}-B_{t}\rangle \right) , \end{aligned}

where \left(\int_{t}^{1} AdB_{t} \right)_{\ell} denotes \ell‘th coordinate of the vector, and A_{\ell} denotes \ell'th row of the matrix A. Thus J is the diagonal matrix, say with vector v=(v_{1}, \ldots, v_{n}) on its diagonal. Thus, we have

\begin{aligned} \mathrm{Tr}\left( (JA)^{T} \mathrm{Hess}\, B \, JA \right) = \langle (AA^{T} \bullet \mathrm{Hess}\, B)\,   v, v \rangle \leq 0 \end{aligned}

This implies

\begin{aligned}\mathbb{E} B(\boldsymbol{f}(X)) \leq  B(\mathbb{E} \boldsymbol{f}(X)) \end{aligned}

finishing the proof of one implication in Theorem. To show that \mathbb{E} B(\boldsymbol{f}(X)) \leq B(\mathbb{E} \boldsymbol{f}(X)) implies the infinitesimal inequality AA^{T} \bullet \mathrm{Hess}\, B(x) \leq 0, is more or less easy and is left as an exercise. The hint is the following: by dilating and shifting the test functions show that the inequality \mathbb{E} B(\boldsymbol{f}(X)) \leq B(\mathbb{E} \boldsymbol{f}(X)) implies the inequality \mathbb{E} B(F(X_{t},t)) \leq B(F(X_{0},0)) for all t \in [0,1]. Now send t \to 0 and differentiate at zero. \square.

Remark: Let X = (X_{1}, \ldots, X_{n}) be a random vector with mean \boldsymbol{\mu} and covariance matrix \boldsymbol{\Sigma}. To see the appearance of \boldsymbol{\Sigma} \bullet \mathrm{Hess}\, B \leq 0 “immediately” one can perhaps make the following argument rigorous: let \boldsymbol{a}=(a_{1}, \ldots, a_{n}) and \boldsymbol{u}=(u_{1}, \ldots, u_{n}). Take test function \boldsymbol{f}(X) = \boldsymbol{u}+\varepsilon \boldsymbol{a}\bullet (X-\boldsymbol{\mu}). Then \mathbb{E} \boldsymbol{f}(X) = \boldsymbol{u}. Take \varepsilon  \to 0. Taylor’s formula gives

\begin{aligned} \mathbb{E} B(\boldsymbol{u}+\varepsilon \boldsymbol{a}\bullet (X-\boldsymbol{\mu})) = B(\boldsymbol{u})+\varepsilon \mathbb{E} \langle \nabla B(\boldsymbol{u}), \boldsymbol{a}\bullet(X-\boldsymbol{\mu})\rangle \end{aligned}
\begin{aligned}+ \varepsilon^{2} \frac{1}{2} \mathbb{E} \langle \mathrm{Hess}\, B(\boldsymbol{u})(\boldsymbol{a}\bullet (X-\boldsymbol{\mu})), \boldsymbol{a}\bullet (X-\boldsymbol{\mu}) \rangle o(\varepsilon^{2})= \end{aligned}
\begin{aligned}B(\boldsymbol{u})+ \varepsilon^{2}\frac{1}{2}\langle \Sigma \bullet  \mathrm{Hess}\, B(\boldsymbol{u}) \boldsymbol{a}, \boldsymbol{a} \rangle  + o(\varepsilon^{2}) \leq B(\boldsymbol{u})\end{aligned}

Dividing both sides of the inequality by \varepsilon^{2} and taking \varepsilon \to 0 one recovers \Sigma \bullet \mathrm{Hess}\, B \leq 0.

Summary of the proof: what did we do?

Let us summarize how did we prove Gaussian Jensen’s inequality. For those who are not familiar with stochastic calculus we constructed a path between two measures \mu_{1} = \boldsymbol{f}(X) and the delta measure \mu_{0} = \delta_{\mathbb{E} \boldsymbol{f}(X)} so that the map t \mapsto \mu_{t}, t \in [0,1] is such that \int B(w) d\mu_{t}(w) is monotone. One natural way to construct such path is done through stochastic process Y_{t}. In general one can write down what are these measures \mu_{t}, they satisfy Fokker–Planck equation, as it can be seen from the stochastic PDE for Y_{t}. These kind of questions happen to arise quite often in completely different areas of mathematics. Mass transportation problem falls into this framework, where one tries to minimize the quantity \mathbb{E} B(X, Y) over all couplings (X,Y) so that X has fixed law d\mu, and Y has another fixed law d\nu. The point is we can take X, Y to be independent from each other, or put some dependence between each other.

Remark (very briefly): One may hope that Y can be expressed as as Y=T(X) for some map T :\mathbb{R}^{n} \to \mathbb{R}^{n} (assume for simplicity that our random variables take values in \mathbb{R}^{n}). This is possible most of the times but not always, for example, if X has density \delta_{0}, and Y has density (1/2)\delta_{-1}+(1/2)\delta_{1}, then this is not possible. But if X \sim d\mu =f(x)dx and Y \sim d\nu =g(y)dy, and such T exists and it is C^{1} diffeomorphism, then T must satisfy Monge–Ampere equation

\begin{aligned}g(T(x))\mathrm{det}(T'(x)) =f(x), \end{aligned}

where T'(x) is the Jacobi matrix of the map T :\mathbb{R}^{n} \to \mathbb{R}^{n}. We say T pushes forward measure \mu to \nu, i.e., T_{\#}\mu =\nu in the sense that \mu(T^{-1}(A))=\nu(A) for all measurable A \in \mathbb{R}^{n}. Such maps T can be many (in fact they are always many!). If one looks for solutions T among \nabla \varphi, where \varphi : \mathbb{R}^{n} \to \mathbb{R} is convex, then one expects that such \varphi is unique (up to additive constant), and such maps T=\nabla \varphi for a convex \varphi are very special, they are called Brenier’s map. They are special because if B(x,y)=\Phi(\|x-y\|) is strictly convex function of a distance \|x-y\| then T=\nabla \varphi is the minimizer of optimal transport problem (Monge’s problem)

\begin{aligned} \inf_{T \, :\,  T_{\#}\mu=\nu} \mathbb{E} B(X, T(X)) \quad \quad \mathrm{Monge's \, problem}\end{aligned}

In Kantorovic’s formulation

\begin{aligned} \inf_{(X,Y)\,  :\,  \mathrm{law}(X)=\mu, \, \mathrm{law}(Y)=\nu} \mathbb{E} B(X,Y) \end{aligned}

one usually hopes that these are the same problems, but as we noticed not always. However, as soon as Monge’s problem has a solution (minimizer exists), then yes, these problems are the same.

If T_{\#}\mu=\nu it is very typical to interpolation measures \mu and \nu like this

\begin{aligned}\mu_{t} := ((1-t) \mathrm{Id}+t T)_{\#}\mu \end{aligned}

This kind of interpolation turns out to be helpful when proving certain estimates (Brunn–Minkowski, Marton–Talagrand). It gives a different point of view on interpolation between two measures, which seems to be different from stochastic calculus.

Going back to Kantorovich’s formulation one can show

\begin{aligned} \inf_{(X,Y)\, :\, \mathrm{law}(X)=\mu, \, \mathrm{law}(Y)=\nu} \mathbb{E} B(X,Y) = \sup_{p,q\, :\, p(x)+q(y) \leq B(x,y)} \left\{ \int p(x)d\mu+\int q(y)d\nu \right\} \end{aligned}

under some mild additional assumptions on B, \mu, \nu.

Also, when measures live on the real line then

\begin{aligned} \inf_{(X,Y)\, :\, \mathrm{law}(X)=\mu, \, \mathrm{law}(Y)=\nu} \mathbb{E} \Phi(X-Y) =\int_{0}^{1}\Phi(F^{-1}(t)-G^{-1}(t))dt\end{aligned}

where F, G are cumulative distribution functions of X and Y respectively, and F^{-1}(t) = \inf\{\lambda \in \mathbb{R} \, :\, F(\lambda) >t\}. Furthermore, assuming \mu, \nu do not give point masses then T(x) = G^{-1}(F(x)) is the optimizer in Monge’s problem. In other words real line is well understood.

Another example that comes to my mind is Gaussian Correlation Conjecture (now theorem). One equivalent way to state the theorem is as follows. Pick arbitrary positive numbers s_{1}, \ldots, s_{n}. Let B(u_{1}, \ldots, u_{n}) = 1_{[-s_{1}, s_{1}]}(u_{1})\cdots 1_{[-s_{n}, s_{n}]}(u_{n}). Show that for any normal random vector X=(X_{1}, \ldots, X_{n}) we have
\begin{aligned} \mathbb{E}B(X_{1}, \ldots, X_{n}) \leq \mathbb{E}B(\tilde{X}_{1}, \ldots, \tilde{X}_{n})\end{aligned}, where \tilde{X}=(\tilde{X}_{1}, \ldots, \tilde{X}_{n}) is another normal random vector such that \tilde{X_{j}} \sim X_{j}, i.e., \tilde{X}_{j} and X_{j} have the same laws, and (\tilde{X_{1}}, \ldots, \tilde{X_{m}}) is independent from (\tilde{X_{m+1}}, \ldots, \tilde{X}_{n}). The inequality \mathbb{E} B(\boldsymbol{X})\leq \mathbb{E}(\tilde{\boldsymbol{X}}) is true for any m, 1\leq m <n. Royen found a path, call it X_{t}, such that X_{0} \sim X, and X_{1} \sim \tilde{X} and so that t \mapsto \mathbb{E} B(X_{t}) is monotone of t \in [0,1]. In fact, one can choose X_{t} to be normal random vector with covariance matrix
\begin{aligned}C(t) = \begin{pmatrix} C_{11} & tC_{12} \\ tC_{21} & C_{22} \end{pmatrix} \end{aligned}, where C(1) is the covariance matrix of X, and C_{11} is m\times m covariance matrix of (X_{1}, \ldots, X_{m}). One can show \frac{d}{dt} \mathbb{E} B(X_{t}) \leq 0 by a remarkable computation.

In general it is an interesting and open problem to understand the following: given two random variables X and Y, for simplicity say they are on the same probability space \Omega. They can be from some family of random variables, say gaussians, or from all random variables (without any restriction). How to find a path Z_{t}, Z_{0} \sim X, Z_{1} \sim Y, and what conditions should one put on B so that t \mapsto  \mathbb{E} B(Z_{t}), t \in [0,1] is monotone.

Applications of Gaussian Jensen’s inequality

Any theorem may easily be forgotten unless one knows about interesting applications. I will briefly list some of them here:


We talked about hypercontractivity on the hypercube in one of my previous blog post. Here I will speak about hypercontractivity in Gauss space.

For r \in [0,1] let T_{r}f(x) = \mathbb{E} f(rx+\sqrt{1-r^{2}}Y) where Y \in \mathcal{N}(0,1) be the Hermite operator. Let 1 \leq p<q \leq \infty. Then (\mathbb{E} |T_{r}f(X)|^{q})^{1/q} \leq (\mathbb{E} |f(X)|^{p})^{1/p}, X \in \mathcal{N}(0,1), for all test functions f if and only if r \leq \sqrt{\frac{p-1}{q-1}}.

Proof: the necessity follows by considering linear functions f(x)=1+\varepsilon x. Then T_{r}f(x) = 1+r\varepsilon x. Now substitute these functions into our inequality, take \varepsilon \to 0, and expand L^{p} norms using Taylor’s formula.

To prove sufficiency part, WLOG f \geq 0. By duality we can rewrite the inequality as

\begin{aligned} \mathbb{E} g(X)f(rX+\sqrt{1-r^{2}}Y) \leq (\mathbb{E} g(X)^{p})^{1/p} (\mathbb{E} f(Y)^{q'})^{1/q'}\end{aligned}

for standard independent Gaussians X, Y. Here \frac{1}{q}+\frac{1}{q'}=1.
To give a “Jensen’s inequality form”, we can further rewrite it as

\begin{aligned} \mathbb{E} B(u(X), v(rX+\sqrt{1-r^{2}}Y)) \leq B(\mathbb{E} u(X), \mathbb{E} v(rX+\sqrt{1-r^{2}}Y)) =B(\mathbb{E} u(X), \mathbb{E}v(Y))\end{aligned} for all nonnegative bounded functions u, v, where B(x,y) = x^{1/p}y^{1/q'}. Let us see what Gaussian Jensen’s inequality tells us for this case. The covariance matrix for the normal random vector (X,rX+\sqrt{1-r^{2}}Y) is

\boldsymbol{\Sigma}  = \begin{aligned} \begin{pmatrix} 1 & r \\ r & 1 \end{pmatrix} \end{aligned}


\begin{aligned}\boldsymbol{\Sigma} \bullet \mathrm{Hess}\, B(x,y) = \frac{u^{1/p}v^{1/q'}}{pq'}\begin{pmatrix}p'(\frac{1}{p}-1)u^{-2} & r (uv)^{-1}\\ r(uv)^{-1} & p (\frac{1}{q'}-1) \end{pmatrix}  \leq 0 \end{aligned}

if and only if q'p(\frac{1}{p}-1)(\frac{1}{q'}-1) \geq r^{2}. On the other hand q'p(\frac{1}{p}-1)(\frac{1}{q'}-1) = \frac{p-1}{q-1}. \square.

To put the argument in a different way: in general B(u,v)=u^{a}v^{b}, a,b \geq 0, is concave if and only if a+b\leq 1. But when we take Hadamard product of B with \boldsymbol{\Sigma} then the resulted matrix can be made semidefinite even in the range a+b> provided that r is sufficiently small.

In what follows I will briefly list some applications of Gaussian Jensen’s inequality, and for simplicity I will consider applications only in dimension 1. Higher dimensions easily follow from “higher dimensional version” of Gaussian Jensen’s inequality where one considers normal random vectors X=(X_{1}, \ldots, X_{n}), where each X_{j} is another normal random vector, and \boldsymbol{f}(X) = (f_{1}(X_{1}), \ldots, f_{n}(X_{n})). In this case the proof of “higher dimensional Gaussian Jensen” is absolutely the same as the one I presented here, only the linear algebra will be a little bit involved.

1. Brascamp–Lieb inequality: by playing with B(u_{1}, \ldots, u_{n})=u_{1}^{1/p_{1}}\ldots u_{n}^{1/p_{n}} and its reverse forms one obtains “Gaussian Brascamp–Lieb inequality” with constant 1, which is not the classical Brascamp–Lieb as it involves Lebesgue measures and a certain sharp constants. To get the classical Brascamp–Lieb there is certain “change of variables”, and an unpleasant passage to a limit to get rid off the Gaussian measures.

2. Ehrhard inequality: for all measurable sets A, B \subset \mathbb{R}, and all nonnegative \alpha, \beta, |\alpha-\beta|\leq 1, \alpha+\beta \geq 1 we have

\begin{aligned} \Phi^{-1}(|\lambda A+(1-\lambda)B|_{\gamma})  \geq\alpha \Phi^{-1}(|A|_{\gamma})+\beta \Phi^{-1}(|B|_{\gamma})\end{aligned}

where \gamma is the standard Gaussian measure, and \Phi(x) = \gamma((-\infty, x]) is the Gaussian cumulative distribution function.

An interesting case to keep in mind is when \alpha=\lambda \in [0,1] and \beta=1-\lambda. The Ehrhard inequality implies Gaussian isoperimetric inequality, the proof is the same as Brunn–Minkowski implies classical isoperimetric inequality. Sometimes one states the Ehrhard inequality in a more general functional form

\begin{aligned}\int_{\mathbb{R}}\sup_{\alpha x+\beta y=t} \Phi(\alpha \Phi^{-1}(f(x))+\beta \Phi^{-1}(g(y))) d\gamma(t) \leq \Phi(\alpha \Phi^{-1}( \int f d\gamma )+\beta \Phi^{-1}(\int g d\gamma ))\end{aligned}

for all Borel measurable f,g with values in [0,1], and by applying the functional form of Ehrhard to f=1_{A}, g=1_{B} one obtains the Ehrhard inequality for sets. However, the Ehrhard inequality for sets (in dimension n) implies the functional Ehrhard in dimensions n-1. The proof of the functional Ehrhard using Gaissian Jensen’s inequality is tricky, because we have “supremum”, i.e., integral of an L^{\infty} in the left hand side of the functional Ehrhard inequality. But this can be seen as a limit of L^{p} norms as p \to \infty, and then one can rewrite it further by duality as one pure integral. But this is not enough, one also changes gaussian measure in a sophisticated way when taking the limit of L^{p} norms. In these slides one can see how the main idea works. The complete proof in a more generality is given here.

3. Borell’s Gaussian noise stability: pick two standard Gaussian random variables X, Y with correlation \mathbb{E} XY=\rho. Then

\begin{aligned} \sup_{A, B \subset \mathbb{R}, \, |A|_{\gamma}=u, \, |B|_{\gamma}=v}\mathbb{P}(X \in A, \, Y \in B) = \mathbb{P}(X<\Phi^{-1}(|A|_{\gamma}), Y < \Phi^{-1}(|B|_{\gamma})) \end{aligned}

where |A|_{\gamma} is the standard gaussian measure of the set A, and \Phi(x) = \gamma((-\infty, x]) is the Gaussian cumulative distribution function. To prove this inequality one considers the following function B(u,v) = \mathbb{P}(X<\Phi^{-1}(u), Y < \Phi^{-1}(v)) and applies Gaussian Jensen’s inequality with test functions f=1_{A} and g=1_{B}.

In applications it is a different function B, sometimes defined implicitly, sometimes explicitly, that solves the problem. Such B‘s arise as pointwise minimal, “concave in the sense” \boldsymbol{\Sigma}\bullet \mathrm{Hess}\, B \leq 0 functions satisfying an obstacle condition B\geq H for some fixed H. How to find such “concave envelopes” remains unknown to me.

Analysis on {0,1}^N

The set \{0,1\}^{\mathbb{N}}, where \mathbb{N} is the set of natural numbers, consists of infinity strings of zeros and ones. It is equipped with mod(2) addition structure and it is called Cantor group, or dyadic group. Another notation is \prod_{1}^{\infty} \mathbb{Z}/2\mathbb{Z}. One studies functions f : \{0,1\}^{\mathbb{N}} \to \mathbb{R}.

One question one can ask is why historically we started with f : [0,1) \to \mathbb{R} and not with f : \{0,1\}^{\mathbb{N}} \to \mathbb{R}? Is not from the prospective of computers functions on \{0,1\}^{\mathbb{N}} more convenient to use? I understand that there is a binary correspondence (not necessarily unique) between [0,1) and \{0,1\}^{\mathbb{N}} but one may forget what kind of structures we had on \{0,1\}^{\mathbb{N}} after we moved to [0,1)? In egeneral we can encode everything via zeros and ones but we may lose symmetries and structures that we had on one side when we moved to the other side. So why nowadays we start teaching Fourier analysis on C([0,2\pi), dx) and not on \{0,1\}^{\mathbb{N}} equipped with a uniform measure (I will explain below what is meant by uniform measure on \{0,1\}^{\mathbb{N}})?

I do not have complete answers to these questions, perhaps Zygmund’s beautiful book on trigonometric series had a nontrivial influence on historical development of Fourier analysis on [0,2\pi).

In this post I will try briefly talk about Fourier analysis on \{0,1\}^{\mathbb{N}}. I saw a question on mathoverflow asking what harmonic analysis we have on \{0,1\}^{\mathbb{N}}. I wrote the answer there, and I thought I would share it here as well.

Harmonic analysis on \{0,1\}^{\mathbb{N}} can be different depending on how one enumerates orthonormal basis, so called Walsh functions.

One usually considers Haar measure dm on \{0,1\}^{\mathbb{N}} which is a probability measure in the following sense: if you have a function f :\{0,1\}^{\mathbb{N}} \to \mathbb{R} which depends only on the first n variables, say f(x_{1}, ...) =g(x_{1}, \ldots, x_{n}) then

\begin{aligned} \int_{\{0,1\}^{\mathbb{N}}} f(x)dm(x) = \frac{1}{2^{n}} \sum_{(x_{1}, \ldots, x_{n}) \in \{0,1\}^{n}}g(x_{1}, \ldots, x_{n}). \end{aligned}

Orthonormal system on L^{2}(\{0,1\}^{\mathbb{N}}, dm) is called Walsh system, and perhaps it was first time introduced by Walsh in 1923: this is a certain family of functions on \{0,1\}^{\mathbb{N}} taking values +1 or -1.

I have seen 4 different enumerations of this family of functions used in the literature (perhaps there are more). The first 3 of them (original Walsh, Paley, and Walsh–Kaczmarz) is well described in the book F. Schipp, WR. Wade, P. Simon, “Walsh series: an introduction to dyadic harmonic analysis”, see Section 1.4. The 4th one which I find kind of exceptional is not there.

1). Walsh’s original enumeration: Walsh defined his system recursively. Later, it turned out that they can be written as follows:

\begin{aligned} \varphi_{n}(x) = (-1)^{\sum_{k=0}^{\infty}(n_{k}+n_{k+1})x_{k+1}}, \quad n \in \mathbb{N}, \end{aligned}

where n=\sum_{k=0}^{\infty} n_{k} 2^{k} with n_{k} \in \{0,1\}, and x=(x_{1}, \ldots ) \in \{0,1\}^{\mathbb{N}}. In other words (n_{0}, n_{1}, \ldots) is the binary representation of the positive integer n. The system \{\varphi_{n}(x)\}_{n\geq 1} is the Walsh’s original system.
Partial sums S_{m}f = \sum_{k=0}^{m} c_{k} \varphi_{k}, where c_{k} = \int_{\{0,1\}^{\mathbb{N}}}f \varphi_{k} dm, converge to f in L^{p}, p>1. The latter is equivalent to the statement that

\begin{aligned}\sup_{n \geq 1}\| S_{n} f\|_{p} \lesssim \|f\|_{p}, \quad 1<p<\infty \end{aligned}.

Moreover S_{n} f converges to f for any f \in L^{p}(\{0,1\}^{\mathbb{N}}, dm) provided that p \in (1,\infty). This is a consequence of Carleson’s maximal inequality

\begin{aligned} \| \sup_{n \geq 1} |S_{n} f| \|_{p} \lesssim \|f\|_{p}, \quad 1<p<\infty. \end{aligned}

Luzin asked in the trigonometric case on the unit circle \mathbb{T}. Carleson proved for p=2; Hunt extended to 1<p<\infty. Billard proved for the original Walsh system for p=2; Sjölin extended for 1<p<\infty.
Fejer means are bounded, i.e.,

\begin{aligned} \sup_{n\geq 1}\left\| \frac{S_{1}+...+S_{n}}{n} f\right\|_{p} \lesssim \|f\|_{p}, \quad 1\leq p \leq \infty \end{aligned}

In the book F. Schipp, WR. Wade, P. Simon, there is a more “martingale proofs” of these results. Dyadic martingales naturally appear here. For example, \{ S_{2^{k}}f\}_{k \geq 0} is a martingale. In fact,
S_{2^{k}}f = \mathbb{E} (f | \mathcal{F}_{k}), where \mathcal{F}_{k} is \sigma-algebra, which is generated by atoms, where atoms are I_{k}(x) = \{ y \in \{0,1\}^{\mathbb{N}} : y_{1}=x_{1}, \ldots ,y_{k}=x_{k}\}. In probabilistic language you condition on the first k variables, and let the tail be the everything. Then, for example, the statement that

\begin{aligned} \|\sup_{k \geq 0}|S_{2^{k}} f|\|_{p} \lesssim \|f\|_{p} \quad 1<p<\infty \end{aligned}

is exactly Doob’s maximal inequality. The book F. Schipp, WR. Wade, P. Simon, “Walsh series: an introduction to dyadic harmonic analysis” provides more martingale techniques. The paper A guide to Calreson’s theorem by Ciprian Demeter has more time-frequency analysis introduction.

  1. Paley’s enumeration (Walsh–Paley functions). In 1932, Paley proposed a different enumeration of Walsh functions

    \begin{aligned} w_{n}(x) = (-1)^{\sum_{k=0}^{\infty} n_{k}x_{k+1}}, \quad n \in \mathbb{N}. \end{aligned}

    Paley’s enumeration perhaps is more close to the classical trigonometric system \sin(ns) and \cos(ns). We can “identify” [0,1) with \{0,1\}^{\mathbb{N}} as follows:

    \begin{aligned} s \in [0,1), \quad s = \sum_{k=1}^{\infty} s_{k} 2^{-k} \end{aligned}

    for some s_{k} \in \{0,1\}. This identification is not unique, for example, \frac{1}{2^{2}} can be written as (0,1,0,0,\ldots) and also as (0,0,1,1,1\ldots,). Among these two one choose the first one, i.e., when number of 1’s terminate starting from some place. Such bad points are not too many, they have zero measure so we do not bother with them. In particular \omega_{n}(s) is defined for s \in [0,1). Then

    \begin{aligned} w_{2^{k}}(s) = (-1)^{s_{k+1}} =\mathrm{sign}(\sin(2^{k+1} \pi s)) \approx \sin(2^{k+1} \pi s), \quad s \in (0,1]. \end{aligned}

    In general, \omega_{n}(s) is close to \sin(ns) or \cos(ns) depending n is even or odd.
    Frequencies \sin(2^{k+1} \pi s) are independent pretty much in the same way as \omega_{2^{k}}(s) = (-1)^{s_{k+1}} are independent: Khinchin’s inequality holds for both of them \| \sum c_{k} \omega_{2^{k}} \|_{2} \asymp \| \sum c_{k} \omega_{2^{k}} \|_{1}, and \| \sum c_{k} \sin(2^{k+1} \pi s) \|_{1} \asymp \| \sum c_{k} \sin(2^{k+1} \pi s)\|_{2}.
    One has the notion of “dyadic derivative”

    \begin{aligned} df(x) = 2^{0} \frac{f(x_{1}, x_{2}, \ldots) - f(1-x_{1}, x_{2}, \ldots)}{2}+2^{1}\frac{f(x_{1}, x_{2}, \ldots) - f(x_{1},1-x_{2}, \ldots)}{2}+2^{2} \ldots \end{aligned}

    provided that the sum converges. Perhaps one should call it “dyadic Laplacian” instead of the derivative. One can verify that Walsh–Paley functions are eigenfunctions: d\omega_{n}(x) = n \omega_{n}(x). One can study polynomials of degree m, say finite sums x \mapsto \sum_{k=0}^{m} c_{k}\omega_{k}(x), and ask questions similar to Markov–Brother’s and Bernstein estimates for such polynomials. One can ask regularity questions: if f is regular on (0,1] how regular it is on \{0,1\}^{\mathbb{N}} with respect to dyadic derivative and vice versa. What happens with partial sums if we have some regularity in terms of dyadic derivatives etc. The results about L^{p} convergence of partial sums, Carleson–Hunt theorem also hold for this enumeration.

  2. Sneider’s enumeration (Walsh-Kaczmarz system) was introduced by Sneider:

    \begin{aligned} k_{0}(x) =1, \quad k_{n}(x) = (-1)^{x_{A}+\sum_{k=0}^{A-1} n_{A-k-1}x_{k+1}}\quad \text{where} \quad 2^{A}\leq n < 2^{A+1} \end{aligned}

    I won’t say much about this system except that the same L^{p} convergence theorems hold in this case. Perhaps an interesting observation in all these 3 enumerations is that Dirichlet Kernels
    D_{n}(x) = \sum_{k=0}^{n-1} a_{n}(x) where a_{n} is either Walsh, Walsh–Paley, Walsh–Kaczmarz, for n=2^{k} are the same. In fact D_{2^{k}}(x) =\prod_{j=1}^{k}(1+(-1)^{x_j}).

Frequencies: when working with these 3 enumerations one usually thinks that a_{n}(x) is “high-frequency” element if n is large, and “low-frequency” element if n is small. For example \omega_{2^{k}}(x) = (-1)^{x_{k+1}} for large k are high frequency elements similar to \sin(2^{k+1}\pi s). All these 3 enumerations and concepts of frequencies tend to mimic the harmonic analysis on L^{p}(\mathbb{T}, d\theta), where d\theta is the uniform measure on the unit circle, i.e., trigonometric polynomials.

  1. Probabilistic enumeration (from independent towards dependent). In this enumeration the concept of “high/low frequencies” is changed in a very opposite way. In fact, the elements \{ \omega_{2^{k}}(x)\}_{k \geq 0} will be the lowest frequencies, say frequencies of degree 1. Elements \omega_{n}(x) where n has two bits in its binary representation will be frequencies of order 2. These are precisely Walsh functions (-1)^{x_{i}+x_{j}}, i<j. In a certain sense enumeration starts from “independent frequencies to dependent ones“, and probabilistic ideas, measure concentration phenomena, dimension independent phenomena, and the concept of “independence” happen to be useful. The goal is to represent functions f : \{0,1\}^{\mathbb{N}} \to \mathbb{R} as follows:

    \begin{aligned} f(x_{1},x_{2}, \ldots) = c_{0} + \underbrace{c_{1}(-1)^{x_{1}}+c_{2}(-1)^{x_{2}}+\ldots+c_{i}(-1)^{x_{i}}+\ldots}_{\mathrm{linear\; terms}} \end{aligned}
    \begin{aligned}+\underbrace{c_{12}(-1)^{x_{1}+x_{2}}+c_{13}(-1)^{x_{1}+x_{3}}+\ldots+c_{ij}(-1)^{x_{i}+x_{j}}+\ldots}_{\mathrm{second\, order\, terms}}+\mathrm{higher\, order \, terms} \end{aligned}

    Of course, if one takes f \in L^{2}(\{0,1\}^{\mathbb{N}}, dm), writes into Walsh–Paley system f(x) = \sum_{n\geq 0} a_{k} \omega_{k}(x), and wants to rearrange the terms to write in the form (1) then rearranged infinity sums may give different results unless \sum |a_{n}| <\infty which is very unlikely for a “typical” f \in L^{2}. However, if one considers functions f which depend only on the first n variables the rearranged sums will be the same. This is not a big restriction on the class of functions because when one wants to prove a statement for functions on \{0,1\}^{\mathbb{N}} one does not start with f \in L^{2} which depends on all its variables, one starts from a “finite dimensional problem” when f depends only on the first n variables, and one proves the corresponding statements for such functions, requires the constants involved in the statement to be independent of n, and then passes to a limit, i.e., to an infinite dimensional problem.
    Thus if we consider functions f which depend only on the first n variables then

    \begin{aligned} f(x_{1}, \ldots, x_{n}) = c_{0}+c_{1}(-1)^{x_{1}}+\ldots +c_{n}(-1)^{x_{n}}+\sum_{1\leq i<j \leq n} c_{ij}(-1)^{x_{i}+x_{j}} + \ldots  = \sum_{S \subset \{1, \ldots, n\}} a_{S}\, (-1)^{\sum_{j \in S} x_{j}} \end{aligned}

    The latter notation seems to be compact and convenient. Here a_{S} are Fourier coefficients, and (-1)^{\sum_{j \in S} x_{j}} are Walsh functions (enumerated in a strange way) indexed by the sets S. Since this enumeration works well provided that f depends only on finite number of variables, the questions about L^{p} convergence, Carleson–Hunt theorem etc do not make any sense. So the direction of harmonic analysis in which Luzin was interested ends here. Well, one can still consider partial sums

    \begin{aligned} S_{d} f (x) = \sum_{\substack{S \subset \{1, \ldots, n\}\\ |S|\leq d}} a_{S} \, (-1)^{\sum_{j \in S} x_{j}}, \end{aligned}

    where |S| denotes the cardinality of the set S, and ask similar questions

    \begin{aligned} \| S_{d} f\|_{p} \lesssim \|f\|_{p} \end{aligned}

    with uniform bound for all d, 1\leq d \leq n, for all n\geq 1, and all f depending on the first n variables. Such estimate holds if and only if p=2.
    Interesting connection arises with Gauss space. For example

    \begin{aligned} \frac{\ell!}{n^{\ell/2}} \sum_{\substack{S \subset\{1, \ldots, n\}\\ |S|=\ell}} (-1)^{\sum_{j \in S} x_{j}} \stackrel{d}{\to} H_{\ell}(\xi) \end{aligned}

    where H_{\ell} is degree \ell probabilists’ Hermite polynomial, \xi \in \mathcal{N}(0,1) is the standard normal Gaussian random variable, and the convergence is in the sense of distributions. By Taking tensor products of the above example one recovers Hermite polynomials on \mathbb{R}^{N} with arbitrary N. So, because of this limit, it is reasonable to think about (-1)^{\sum_{j \in S} x_{j}} as low frequency element if |S| is small, and high frequency element if |S| is large (which is very opposite compared to the original Walsh enumeration). In fact, the analysis with such enumeration (1) for functions f depending on n variables is similar to the one on L^{p}(\mathbb{T}^{n}, d\theta^{n}). Dscrete Laplacian is defined in a different way

    \begin{aligned} \Delta f(x) = \sum_{j=1}^{n} D_{j}f(x), \end{aligned}

    where D_{j}f(x_{1}, \ldots, x_{n}) = \frac{f(x_{1}, \ldots,x_{j}, \ldots, x_{n})-f(x_{1}, \ldots, 1-x_{j}, \ldots, x_{n})}{2}. One also considers discrete gradient

    \begin{aligned} Df(x) = (D_{1} f(x), \ldots, D_{n}f(x)). \end{aligned}

    One has integration by parts formula

    \begin{aligned} \int_{\{0,1\}^{\mathbb{N}}} g \Delta f dm=\int_{\{0,1\}^{\mathbb{N}}} Df \cdot Dg\, dm \end{aligned}

    for all functions f,g depending on the first n variables. Heat semigroup

    \begin{aligned} e^{-t\Delta} f = \sum_{S \subset \{1, \ldots, n\}}a_{S} e^{-t|S|} (-1)^{\sum_{j\in S}x_{j}} \end{aligned}

    satisfies heat equation \frac{\partial }{\partial t} e^{-t\Delta}f = -\Delta e^{-t\Delta}f. Discrete gradient

    \begin{aligned} |\nabla f|(x) = \sqrt{\sum_{j=1}^{n} |D_{j}f(x)|^{2}} \end{aligned}

    when applied to f(x_{1}, \ldots, x_{n}) = g\left(\frac{(-1)^{x_{1}}+\ldots+(-1)^{x_{n}}}{\sqrt{n}}\right) for some smooth bounded g :\mathbb{R} \to\mathbb{R}, converges to the classical derivative

    \begin{aligned} \int_{\{0,1\}^{\mathbb{N}}} M(|\nabla f|(x)) dm(x) \stackrel{n \to \infty}{\to} \int_{\mathbb{R}}M(|g'(s)|) \frac{e^{-s^{2}/2}}{\sqrt{2\pi}}ds \end{aligned}

    for any smooth functions M. The case M(t)=|t|^{p} is a typical application. Denoting (-1)^{x_{j}} = \varepsilon_{j}, then \ell degree polynomials with respect to this enumeration can be written as

    \begin{aligned} g(\varepsilon_{1}, \ldots, \varepsilon_{n}) =\sum_{\substack{S \subset \{1, \ldots, n\} \\ |S|\leq \ell}} a_{S} \prod_{j \in S} \varepsilon_{j} \end{aligned}

    and the latter expression can be extended in a natural way for all \varepsilon_{1}, \ldots, \varepsilon_{n} \in \mathbb{R} as a multilinear classical polynomial in n variable of degree d. Due to the identity \|g\|_{L^{\infty}([-1,1]^{n})} = \|g\|_{L^{\infty}({-1,1}^{n})} tools from approximation theory enter in an unexpected way applied to actual degree \ell multilinear polynomials. Discrete derivatives coincide with the actual derivatives for such multilinear polynomials. Boolean functions f :\{0,1\}^{\mathbb{N}} \to \{0,1\} have an extra property f^{2}=f and this gives more cancelations when proving statements for all real valued functions.

To summarize: Harmonic analysis on \{0,1\}^{\mathbb{N}} depends on how one encodes Walsh functions, i.e., frequencies in the Hilbert space L^{2}(\{0,1\}^{\mathbb{N}}, dm), who are the low and who are the high frequency functions.