Classical Jensen’s inequality
If is concave then for any probability measure
on
we have Jensen’s inequality
holds for all measurable real valued functions
provided that
is measurable.
Probabilist would write Jensen’s inequality as where
is Borel measurable real valued random variable.
Jensen’s inequality with several variables has series of applications. Pick an arbitrary set and assume that
is concave, where
denotes convex hull of the set
. Then for any Borel measurable random variable
we have
Cauchy–Schwarz inequality: Let be a map
. Notice that
is concave on
. Therefore by Jensen’s inequality
holds for all nonnegative Borel measurable random variables . In particular, if
is a probability measure on
and
are measurable functions then
If is not a probability measure then the following trick using
-homogeneity of
still gives the inequality in this case. Indeed, consider a new probability measure
And integrate the following inequality (in fact equality) with respect to
Applying Jensen we get
and we obtain the Cauchy–Schwarz inequality. This argument gives the following
Remark: If
is concave and 1-homogeneous then
for all (not necessarily probability) measures
provided that
is measurable
Let us list other applications of Jensen’s inequality:
Holder’s inequality:
holds for any measure , all nonnegative measurable
, and all powers
such that
.
Hint: use the fact that is concave on
. Integrate the inequality
with respect to the probability measure and apply Jensen’s inequality.
Minkowski inequality: Let . Then
Hint: without loss of generality assume . Use the fact that
is concave. Integrate the inequality
with respect to a probability measure
and apply Jensen’s inequality.
Hanner’s inequality: for we have
and the inequality reverses if .
Hint: WLOG . Use the fact that
, where
, is convex if
, and it is concave if
.
Remark: Let
. Choose
so that
. Then By Hanner’s inequality we have
. In particular if
then
, i.e.,
space is uniformly convex. In fact
space is uniformly convex for
as well. Indeed, denote
and
in Hanner’s inequality with
and
for
. Then Hanner’s inequality rewrites as
.
Since the map
is even and convex, it is (strictly) increasing in each variable. In particular, if
then the largest possible value for
, call it
, is the solution of the equation
. Now it should be a calculus exercise to show that
. (for example, if
then
is the solution. As soon as
, then because of the fact that
is strictly increasing in each variable we have
). One can show that these upper bounds on
obtained by Hanner’s inequality are sharp.
Before we move to another application let me mention an (open) problem, which if true would serve as a natural extension of Hanner’s inequality to functions.
Question 1. Let
be symmetric
Bernoulli random variables. Then for any
we have
holds for, and the reverse inequality if
.
Remark: Question 1 boils down to showing that
is convex if
in the domain
, and it is concave if
. This paper claims to have positively resolved this question but there is a gap in the proof, they prove convexity/concavity in each variable separately which is not sufficient. The concavity for
is due to Gideon Schechtman, so the only open case is to show convexity for
, and concavity for
. Also, notice that if we do not care about the constants then applying Khinchin’s inequality to the both sides of Hanner’s inequality with
functions we obtain
holds forand reverse inequality for
. Notice that the latter inequality simply follows from Minkowski’s inequality:
for
, and reverse inequality for
with
. Reverse Minkowski can be proved by observing that the function
is convex for
in the domain
.
Pinsker’s inequality: Let be probability distribution functions such that
is absolutely continuous with respect to
. Then
Hint: show the pointwise inequality
Substitute and integrate with respect to a probability measure
. We obtain
where the last inequality follows by Jensen applied to a convex function for
,
.
An abstract theorem: concave envelope
Concave/convex functions arise naturally when one tries to maximize/minimize a certain functionals over all “test functions”. For example, here is an abstract theorem which happens to be useful for these kind of applications. Consider the following problem
where is fixed,
is a fixed subset, and
is a fixed map. The supremum is taken over all random variables
given on any probability space with values in
(here we also assume that both
and
are measurable.
Theorem 1. Let
be defined as above
Thencoincides with the minimal concave function defined on the convex set
with the obstacle condition
.
This abstract theorem provides with more “unexpected” applications of Jensen’s inequality. Here is one such application: let . Suppose we would like to estimate
from above in terms of
and
. A trivial (and sharp) estimate is triangle inequality (Minkowski’s inequality)
. One drawback of this estimate is that if
have disjoint support then in fact we have equality
, whereas the triangle inequality gives
. Obviously
because
for all
. We see that Minkowski’s inequality becomes too rough when applied to functions which “do not overlap”. It becomes even worse when we apply it to n functions, say we want to bound from above
. To measure the overlap between
one possibility is to introduce a parameter
. One asks what is the best possible upper bound on
in terms of
. The answer is the following
where . The equality holds if and only if both
are nonnegative (or non-npositive) such that
for some constant
. If
have disjoint support
, then
and we recover
. In general, the map
is increasing on , and one can show that this sharpening of triangle inequality, does indeed sharpen it:
by estimating , and noticing the identity
To prove (1) one observes that
is concave on . Without loss of generality assume
, consider a new probability measure
, integrate the identity
with respect to and use Jensen’s inequality. The obtained final estimate will be (1).
Of course, one may ask how could one ever guess this expression for ? And why you ever believed that there is such a function at all? The starting point was the abstract theorem. We wanted to find
where supremum is taken over all nonnegative random variables over any probability space. The abstract theorem tells us that is minimal concave function in the domain
with the obstacle condition
. So eventually we are solving the problem from geometry, i.e., find the concave envelope. See my talk about how to solve such geometric problems in a systematic way.
I should mention that in this case we were lucky and the function turned out to have an explicit formula. Sometimes (in fact most of the times) such optimal functions are implicit. Here is an example of an implicit function: having Hanner’s inequality one may ask to find
where is an arbitrary realm valued measure such that its support contains at least two points. It immediately follows from Hanner’s inequality that
Perhaps one may guess that we have actually equality here. Unfortunately ‘No’. This would be “too good to be true”. In fact there is a linear function one can put between and Hanner, namely,
Inequality (C) is known as Clarkson’s inequality. And the inequality (*) is simple, it follows from the fact that the right hand side, as a concave function, dominates its tangent plane at point .
So what is the value of this mysterious function ? It was calculated in my PhD thesis, see Section 2.7. The function
is defined implicitly, it involves solutions of finite number of algebraic equations, I do not intend to reproduce its value here.
Gaussian Jensen’s inequality
Suppose we want to understand under what conditions on we have
holds for all test functions, say real valued , where
are some random variables (not necessarily all possible random variables!). If
, i.e.,
and
are one and the same random variables, then the best we can hope for is that
must be concave, this is necessary and sufficient condition. If
are independent then
must be separately concave, i.e., concave in each variable separately (but not necessarily jointly). This observation suggests that maybe if there is some fixed correlation between random variables
then perhaps the right condition on
will be something between jointly concave and separately concave. The answer is yes, and in case of normal random vectors we have the following nice characterization
Theorem 2. Let
be a normal random vector. Then
![]()
holds for all test functionsif and only if
,
wheredenotes Hadamard product, and
means the matrix is negative semidefinite.
Remark: We assume that
is at least
in a rectangular domain
where
are intervals, rays, or real line. In this case test functions
take values in
,
. Also, negative semi-definiteness
is required to hold for all
.
A short proof is given in Section 2.2 here. To match the notations being normal random vector means that
, where
and
is some
matrix. Clearly
. In the paper one uses
instead of
.
Here I would like to discuss stochastic calculus approach. Essentially these will be the same proofs, perhaps stochastic calculus approach is more intuitive.
Proof of Theorem 2.
Consider stochastic process ,
, such that
and
, i.e.,
has the same law as
. The simplest way to construct such a process is to consider the stochastic PDE
,
where is our
matrix, and
is the standard
dimensional brownian motion starting at zero. One may notice that
is also a martingale, this will play a key role in the proof. Next, we consider a new process
.
We have and
. If we can see that the map
is monotone on , namely, it is non-increasing then we obtain
.
This will prove the Jensen’s inequality in the simple case when , i.e., the test functions
are identity maps.
To investigate the monotonicity of , as a rule of thumb we differentiate the process
in
. Using Ito calculus we have
,
where in the last inequality we used a fact from linear algebra that for rectangular matrices
is
, and
is
. In the second to the last inequality we used the property that
.
Thus .
Taking the expectation of both sides and invoking the martingale property we obtain
because ,
where .
Great! Now consider a general case when are not identity maps. One natural process to look at is
.
If we repeat the same computations we encounter with several issues. First notice that the dimensional process
is not a martingale. Also
which is not the right hand side in Jensen’s inequality.
The make the dimensional process
the martingale there is one obvious way to do so. Consider a vector function
Then the process is the martingale:
.
Let . Then Feynman–Kac tells us
Therefore Ito calculus gives
.
If we let to be a column vector, and
be the Jacobian matrix of
in variable
, then we can write
.
Thus
Integrating in and taking the expectation of both sides we obtain
Next, notice that depends only on the coordinate
for
. Indeed,
where denotes
‘th coordinate of the vector, and
denotes
row of the matrix
. Thus
is the diagonal matrix, say with vector
on its diagonal. Thus, we have
This implies
finishing the proof of one implication in Theorem. To show that implies the infinitesimal inequality
, is more or less easy and is left as an exercise. The hint is the following: by dilating and shifting the test functions show that the inequality
implies the inequality
for all
. Now send
and differentiate at zero.
.
Remark: Let
be a random vector with mean
and covariance matrix
. To see the appearance of
“immediately” one can perhaps make the following argument rigorous: let
and
. Take test function
. Then
. Take
. Taylor’s formula gives
Dividing both sides of the inequality by
and taking
one recovers
.
Summary of the proof: what did we do?
Let us summarize how did we prove Gaussian Jensen’s inequality. For those who are not familiar with stochastic calculus we constructed a path between two measures and the delta measure
so that the map
,
is such that
is monotone. One natural way to construct such path is done through stochastic process
. In general one can write down what are these measures
, they satisfy Fokker–Planck equation, as it can be seen from the stochastic PDE for
. These kind of questions happen to arise quite often in completely different areas of mathematics. Mass transportation problem falls into this framework, where one tries to minimize the quantity
over all couplings
so that
has fixed law
, and
has another fixed law
. The point is we can take
to be independent from each other, or put some dependence between each other.
Remark (very briefly): One may hope that
can be expressed as as
for some map
(assume for simplicity that our random variables take values in
). This is possible most of the times but not always, for example, if
has density
, and
has density
, then this is not possible. But if
and
, and such
exists and it is
diffeomorphism, then
must satisfy Monge–Ampere equation
whereis the Jacobi matrix of the map
. We say
pushes forward measure
to
, i.e.,
in the sense that
for all measurable
. Such maps
can be many (in fact they are always many!). If one looks for solutions
among
, where
is convex, then one expects that such
is unique (up to additive constant), and such maps
for a convex
are very special, they are called Brenier’s map. They are special because if
is strictly convex function of a distance
then
is the minimizer of optimal transport problem (Monge’s problem)
In Kantorovic’s formulation
one usually hopes that these are the same problems, but as we noticed not always. However, as soon as Monge’s problem has a solution (minimizer exists), then yes, these problems are the same.
Ifit is very typical to interpolation measures
and
like this
This kind of interpolation turns out to be helpful when proving certain estimates (Brunn–Minkowski, Marton–Talagrand). It gives a different point of view on interpolation between two measures, which seems to be different from stochastic calculus.
Going back to Kantorovich’s formulation one can show
under some mild additional assumptions on.
Also, when measures live on the real line then
whereare cumulative distribution functions of
and
respectively, and
. Furthermore, assuming
do not give point masses then
is the optimizer in Monge’s problem. In other words real line is well understood.
Another example that comes to my mind is Gaussian Correlation Conjecture (now theorem). One equivalent way to state the theorem is as follows. Pick arbitrary positive numbers . Let
. Show that for any normal random vector
we have
, where
is another normal random vector such that
, i.e.,
and
have the same laws, and
is independent from
. The inequality
is true for any
. Royen found a path, call it
, such that
, and
and so that
is monotone of
. In fact, one can choose
to be normal random vector with covariance matrix
, where
is the covariance matrix of
, and
is
covariance matrix of
. One can show
by a remarkable computation.
In general it is an interesting and open problem to understand the following: given two random variables and
, for simplicity say they are on the same probability space
. They can be from some family of random variables, say gaussians, or from all random variables (without any restriction). How to find a path
,
,
, and what conditions should one put on
so that
,
is monotone.
Applications of Gaussian Jensen’s inequality
Any theorem may easily be forgotten unless one knows about interesting applications. I will briefly list some of them here:
Hypercontractivity:
We talked about hypercontractivity on the hypercube in one of my previous blog post. Here I will speak about hypercontractivity in Gauss space.
For let
where
be the Hermite operator. Let
. Then
,
, for all test functions
if and only if
.
Proof: the necessity follows by considering linear functions . Then
. Now substitute these functions into our inequality, take
, and expand
norms using Taylor’s formula.
To prove sufficiency part, WLOG . By duality we can rewrite the inequality as
for standard independent Gaussians . Here
.
To give a “Jensen’s inequality form”, we can further rewrite it as for all nonnegative bounded functions
, where
. Let us see what Gaussian Jensen’s inequality tells us for this case. The covariance matrix for the normal random vector
is
Therefore
if and only if . On the other hand
.
.
To put the argument in a different way: in general ,
, is concave if and only if
. But when we take Hadamard product of
with
then the resulted matrix can be made semidefinite even in the range
provided that
is sufficiently small.
In what follows I will briefly list some applications of Gaussian Jensen’s inequality, and for simplicity I will consider applications only in dimension . Higher dimensions easily follow from “higher dimensional version” of Gaussian Jensen’s inequality where one considers normal random vectors
, where each
is another normal random vector, and
. In this case the proof of “higher dimensional Gaussian Jensen” is absolutely the same as the one I presented here, only the linear algebra will be a little bit involved.
1. Brascamp–Lieb inequality: by playing with and its reverse forms one obtains “Gaussian Brascamp–Lieb inequality” with constant 1, which is not the classical Brascamp–Lieb as it involves Lebesgue measures and a certain sharp constants. To get the classical Brascamp–Lieb there is certain “change of variables”, and an unpleasant passage to a limit to get rid off the Gaussian measures.
2. Ehrhard inequality: for all measurable sets , and all nonnegative
we have
where is the standard Gaussian measure, and
is the Gaussian cumulative distribution function.
An interesting case to keep in mind is when and
. The Ehrhard inequality implies Gaussian isoperimetric inequality, the proof is the same as Brunn–Minkowski implies classical isoperimetric inequality. Sometimes one states the Ehrhard inequality in a more general functional form
for all Borel measurable with values in
, and by applying the functional form of Ehrhard to
one obtains the Ehrhard inequality for sets. However, the Ehrhard inequality for sets (in dimension n) implies the functional Ehrhard in dimensions n-1. The proof of the functional Ehrhard using Gaissian Jensen’s inequality is tricky, because we have “supremum”, i.e., integral of an
in the left hand side of the functional Ehrhard inequality. But this can be seen as a limit of
norms as
, and then one can rewrite it further by duality as one pure integral. But this is not enough, one also changes gaussian measure in a sophisticated way when taking the limit of
norms. In these slides one can see how the main idea works. The complete proof in a more generality is given here.
3. Borell’s Gaussian noise stability: pick two standard Gaussian random variables with correlation
. Then
where is the standard gaussian measure of the set
, and
is the Gaussian cumulative distribution function. To prove this inequality one considers the following function
and applies Gaussian Jensen’s inequality with test functions
and
.
In applications it is a different function , sometimes defined implicitly, sometimes explicitly, that solves the problem. Such
‘s arise as pointwise minimal, “concave in the sense”
functions satisfying an obstacle condition
for some fixed
. How to find such “concave envelopes” remains unknown to me.