Classical Jensen’s inequality
If is concave then for any probability measure on we have Jensen’s inequality
holds for all measurable real valued functions provided that is measurable.
Probabilist would write Jensen’s inequality as where is Borel measurable real valued random variable.
Jensen’s inequality with several variables has series of applications. Pick an arbitrary set and assume that is concave, where denotes convex hull of the set . Then for any Borel measurable random variable we have
Cauchy–Schwarz inequality: Let be a map . Notice that is concave on . Therefore by Jensen’s inequality
holds for all nonnegative Borel measurable random variables . In particular, if is a probability measure on and are measurable functions then
If is not a probability measure then the following trick using -homogeneity of still gives the inequality in this case. Indeed, consider a new probability measure
And integrate the following inequality (in fact equality) with respect to
Applying Jensen we get
and we obtain the Cauchy–Schwarz inequality. This argument gives the following
Remark: If is concave and 1-homogeneous then
for all (not necessarily probability) measures provided that is measurable
Let us list other applications of Jensen’s inequality:
holds for any measure , all nonnegative measurable , and all powers such that .
Hint: use the fact that is concave on . Integrate the inequality
with respect to the probability measure and apply Jensen’s inequality.
Minkowski inequality: Let . Then
Hint: without loss of generality assume . Use the fact that is concave. Integrate the inequality
with respect to a probability measure and apply Jensen’s inequality.
Hanner’s inequality: for we have
and the inequality reverses if .
Hint: WLOG . Use the fact that , where , is convex if , and it is concave if .
Remark: Let . Choose so that . Then By Hanner’s inequality we have . In particular if then
, i.e., space is uniformly convex. In fact space is uniformly convex for as well. Indeed, denote and in Hanner’s inequality with and for . Then Hanner’s inequality rewrites as
Since the map is even and convex, it is (strictly) increasing in each variable. In particular, if then the largest possible value for , call it , is the solution of the equation
. Now it should be a calculus exercise to show that . (for example, if then is the solution. As soon as , then because of the fact that is strictly increasing in each variable we have ). One can show that these upper bounds on obtained by Hanner’s inequality are sharp.
Before we move to another application let me mention an (open) problem, which if true would serve as a natural extension of Hanner’s inequality to functions.
Question 1. Let be symmetric Bernoulli random variables. Then for any we have
holds for , and the reverse inequality if .
Remark: Question 1 boils down to showing that is convex if in the domain , and it is concave if . This paper claims to have positively resolved this question but there is a gap in the proof, they prove convexity/concavity in each variable separately which is not sufficient. The concavity for is due to Gideon Schechtman, so the only open case is to show convexity for , and concavity for . Also, notice that if we do not care about the constants then applying Khinchin’s inequality to the both sides of Hanner’s inequality with functions we obtain
holds for and reverse inequality for . Notice that the latter inequality simply follows from Minkowski’s inequality: for , and reverse inequality for with . Reverse Minkowski can be proved by observing that the function is convex for in the domain .
Pinsker’s inequality: Let be probability distribution functions such that is absolutely continuous with respect to . Then
Hint: show the pointwise inequality
Substitute and integrate with respect to a probability measure . We obtain
where the last inequality follows by Jensen applied to a convex function for , .
An abstract theorem: concave envelope
Concave/convex functions arise naturally when one tries to maximize/minimize a certain functionals over all “test functions”. For example, here is an abstract theorem which happens to be useful for these kind of applications. Consider the following problem
where is fixed, is a fixed subset, and is a fixed map. The supremum is taken over all random variables given on any probability space with values in (here we also assume that both and are measurable.
Theorem 1. Let be defined as above
Then coincides with the minimal concave function defined on the convex set with the obstacle condition .
This abstract theorem provides with more “unexpected” applications of Jensen’s inequality. Here is one such application: let . Suppose we would like to estimate from above in terms of and . A trivial (and sharp) estimate is triangle inequality (Minkowski’s inequality) . One drawback of this estimate is that if have disjoint support then in fact we have equality , whereas the triangle inequality gives . Obviously because for all . We see that Minkowski’s inequality becomes too rough when applied to functions which “do not overlap”. It becomes even worse when we apply it to n functions, say we want to bound from above . To measure the overlap between one possibility is to introduce a parameter . One asks what is the best possible upper bound on in terms of . The answer is the following
where . The equality holds if and only if both are nonnegative (or non-npositive) such that for some constant . If have disjoint support , then and we recover . In general, the map
is increasing on , and one can show that this sharpening of triangle inequality, does indeed sharpen it:
, and noticing the identity
To prove (1) one observes that
is concave on . Without loss of generality assume , consider a new probability measure
, integrate the identity
with respect to and use Jensen’s inequality. The obtained final estimate will be (1).
Of course, one may ask how could one ever guess this expression for ? And why you ever believed that there is such a function at all? The starting point was the abstract theorem. We wanted to find
where supremum is taken over all nonnegative random variables over any probability space. The abstract theorem tells us that is minimal concave function in the domain with the obstacle condition . So eventually we are solving the problem from geometry, i.e., find the concave envelope. See my talk about how to solve such geometric problems in a systematic way.
I should mention that in this case we were lucky and the function turned out to have an explicit formula. Sometimes (in fact most of the times) such optimal functions are implicit. Here is an example of an implicit function: having Hanner’s inequality one may ask to find
where is an arbitrary realm valued measure such that its support contains at least two points. It immediately follows from Hanner’s inequality that
Perhaps one may guess that we have actually equality here. Unfortunately ‘No’. This would be “too good to be true”. In fact there is a linear function one can put between and Hanner, namely,
Inequality (C) is known as Clarkson’s inequality. And the inequality (*) is simple, it follows from the fact that the right hand side, as a concave function, dominates its tangent plane at point .
So what is the value of this mysterious function ? It was calculated in my PhD thesis, see Section 2.7. The function is defined implicitly, it involves solutions of finite number of algebraic equations, I do not intend to reproduce its value here.
Gaussian Jensen’s inequality
Suppose we want to understand under what conditions on we have
holds for all test functions, say real valued , where are some random variables (not necessarily all possible random variables!). If , i.e., and are one and the same random variables, then the best we can hope for is that must be concave, this is necessary and sufficient condition. If are independent then must be separately concave, i.e., concave in each variable separately (but not necessarily jointly). This observation suggests that maybe if there is some fixed correlation between random variables then perhaps the right condition on will be something between jointly concave and separately concave. The answer is yes, and in case of normal random vectors we have the following nice characterization
Theorem 2. Let be a normal random vector. Then
holds for all test functions if and only if
where denotes Hadamard product, and means the matrix is negative semidefinite.
Remark: We assume that is at least in a rectangular domain where are intervals, rays, or real line. In this case test functions take values in , . Also, negative semi-definiteness is required to hold for all .
A short proof is given in Section 2.2 here. To match the notations being normal random vector means that , where and is some matrix. Clearly . In the paper one uses instead of .
Here I would like to discuss stochastic calculus approach. Essentially these will be the same proofs, perhaps stochastic calculus approach is more intuitive.
Proof of Theorem 2.
Consider stochastic process , , such that and , i.e., has the same law as . The simplest way to construct such a process is to consider the stochastic PDE
where is our matrix, and is the standard dimensional brownian motion starting at zero. One may notice that is also a martingale, this will play a key role in the proof. Next, we consider a new process
We have and . If we can see that the map
is monotone on , namely, it is non-increasing then we obtain
This will prove the Jensen’s inequality in the simple case when , i.e., the test functions are identity maps.
To investigate the monotonicity of , as a rule of thumb we differentiate the process in . Using Ito calculus we have
where in the last inequality we used a fact from linear algebra that for rectangular matrices is , and is . In the second to the last inequality we used the property that .
Taking the expectation of both sides and invoking the martingale property we obtain
Great! Now consider a general case when are not identity maps. One natural process to look at is
If we repeat the same computations we encounter with several issues. First notice that the dimensional process is not a martingale. Also which is not the right hand side in Jensen’s inequality.
The make the dimensional process the martingale there is one obvious way to do so. Consider a vector function
Then the process is the martingale:
Let . Then Feynman–Kac tells us
Therefore Ito calculus gives
If we let to be a column vector, and be the Jacobian matrix of in variable , then we can write
Integrating in and taking the expectation of both sides we obtain
Next, notice that depends only on the coordinate for . Indeed,
where denotes ‘th coordinate of the vector, and denotes row of the matrix . Thus is the diagonal matrix, say with vector on its diagonal. Thus, we have
finishing the proof of one implication in Theorem. To show that implies the infinitesimal inequality , is more or less easy and is left as an exercise. The hint is the following: by dilating and shifting the test functions show that the inequality implies the inequality for all . Now send and differentiate at zero. .
Remark: Let be a random vector with mean and covariance matrix . To see the appearance of “immediately” one can perhaps make the following argument rigorous: let and . Take test function . Then . Take . Taylor’s formula gives
Dividing both sides of the inequality by and taking one recovers .
Summary of the proof: what did we do?
Let us summarize how did we prove Gaussian Jensen’s inequality. For those who are not familiar with stochastic calculus we constructed a path between two measures and the delta measure so that the map , is such that is monotone. One natural way to construct such path is done through stochastic process . In general one can write down what are these measures , they satisfy Fokker–Planck equation, as it can be seen from the stochastic PDE for . These kind of questions happen to arise quite often in completely different areas of mathematics. Mass transportation problem falls into this framework, where one tries to minimize the quantity over all couplings so that has fixed law , and has another fixed law . The point is we can take to be independent from each other, or put some dependence between each other.
Remark (very briefly): One may hope that can be expressed as as for some map (assume for simplicity that our random variables take values in ). This is possible most of the times but not always, for example, if has density , and has density , then this is not possible. But if and , and such exists and it is diffeomorphism, then must satisfy Monge–Ampere equation
where is the Jacobi matrix of the map . We say pushes forward measure to , i.e., in the sense that for all measurable . Such maps can be many (in fact they are always many!). If one looks for solutions among , where is convex, then one expects that such is unique (up to additive constant), and such maps for a convex are very special, they are called Brenier’s map. They are special because if is strictly convex function of a distance then is the minimizer of optimal transport problem (Monge’s problem)
In Kantorovic’s formulation
one usually hopes that these are the same problems, but as we noticed not always. However, as soon as Monge’s problem has a solution (minimizer exists), then yes, these problems are the same.
If it is very typical to interpolation measures and like this
This kind of interpolation turns out to be helpful when proving certain estimates (Brunn–Minkowski, Marton–Talagrand). It gives a different point of view on interpolation between two measures, which seems to be different from stochastic calculus.
Going back to Kantorovich’s formulation one can show
under some mild additional assumptions on .
Also, when measures live on the real line then
where are cumulative distribution functions of and respectively, and . Furthermore, assuming do not give point masses then is the optimizer in Monge’s problem. In other words real line is well understood.
Another example that comes to my mind is Gaussian Correlation Conjecture (now theorem). One equivalent way to state the theorem is as follows. Pick arbitrary positive numbers . Let . Show that for any normal random vector we have
, where is another normal random vector such that , i.e., and have the same laws, and is independent from . The inequality is true for any . Royen found a path, call it , such that , and and so that is monotone of . In fact, one can choose to be normal random vector with covariance matrix
, where is the covariance matrix of , and is covariance matrix of . One can show by a remarkable computation.
In general it is an interesting and open problem to understand the following: given two random variables and , for simplicity say they are on the same probability space . They can be from some family of random variables, say gaussians, or from all random variables (without any restriction). How to find a path , , , and what conditions should one put on so that , is monotone.
Applications of Gaussian Jensen’s inequality
Any theorem may easily be forgotten unless one knows about interesting applications. I will briefly list some of them here:
We talked about hypercontractivity on the hypercube in one of my previous blog post. Here I will speak about hypercontractivity in Gauss space.
For let where be the Hermite operator. Let . Then , , for all test functions if and only if .
Proof: the necessity follows by considering linear functions . Then . Now substitute these functions into our inequality, take , and expand norms using Taylor’s formula.
To prove sufficiency part, WLOG . By duality we can rewrite the inequality as
for standard independent Gaussians . Here .
To give a “Jensen’s inequality form”, we can further rewrite it as
for all nonnegative bounded functions , where . Let us see what Gaussian Jensen’s inequality tells us for this case. The covariance matrix for the normal random vector is
if and only if . On the other hand . .
To put the argument in a different way: in general , , is concave if and only if . But when we take Hadamard product of with then the resulted matrix can be made semidefinite even in the range provided that is sufficiently small.
In what follows I will briefly list some applications of Gaussian Jensen’s inequality, and for simplicity I will consider applications only in dimension . Higher dimensions easily follow from “higher dimensional version” of Gaussian Jensen’s inequality where one considers normal random vectors , where each is another normal random vector, and . In this case the proof of “higher dimensional Gaussian Jensen” is absolutely the same as the one I presented here, only the linear algebra will be a little bit involved.
1. Brascamp–Lieb inequality: by playing with and its reverse forms one obtains “Gaussian Brascamp–Lieb inequality” with constant 1, which is not the classical Brascamp–Lieb as it involves Lebesgue measures and a certain sharp constants. To get the classical Brascamp–Lieb there is certain “change of variables”, and an unpleasant passage to a limit to get rid off the Gaussian measures.
2. Ehrhard inequality: for all measurable sets , and all nonnegative we have
where is the standard Gaussian measure, and is the Gaussian cumulative distribution function.
An interesting case to keep in mind is when and . The Ehrhard inequality implies Gaussian isoperimetric inequality, the proof is the same as Brunn–Minkowski implies classical isoperimetric inequality. Sometimes one states the Ehrhard inequality in a more general functional form
for all Borel measurable with values in , and by applying the functional form of Ehrhard to one obtains the Ehrhard inequality for sets. However, the Ehrhard inequality for sets (in dimension n) implies the functional Ehrhard in dimensions n-1. The proof of the functional Ehrhard using Gaissian Jensen’s inequality is tricky, because we have “supremum”, i.e., integral of an in the left hand side of the functional Ehrhard inequality. But this can be seen as a limit of norms as , and then one can rewrite it further by duality as one pure integral. But this is not enough, one also changes gaussian measure in a sophisticated way when taking the limit of norms. In these slides one can see how the main idea works. The complete proof in a more generality is given here.
3. Borell’s Gaussian noise stability: pick two standard Gaussian random variables with correlation . Then
where is the standard gaussian measure of the set , and is the Gaussian cumulative distribution function. To prove this inequality one considers the following function and applies Gaussian Jensen’s inequality with test functions and .
In applications it is a different function , sometimes defined implicitly, sometimes explicitly, that solves the problem. Such ‘s arise as pointwise minimal, “concave in the sense” functions satisfying an obstacle condition for some fixed . How to find such “concave envelopes” remains unknown to me.