A Practical Introduction to Polynomial Optimization using the Moment-SOS Hierarchy

published by Luca Wellmeier on January 29, 2026

These notes are mostly based on the great survey article (Laurent, 2010). This text and the code used to perform the experiments are hosted at gh:lcwllmr/momsos. Read the instructions on GitHub if you would like to experiment with the code on your own. Please feel free to contact me for questions, suggestions etc. at: luca dot wellmeier at uit dot no. These notes have been used for the following talks:

Contents:

  1. Non-negativity and sums of squares
  2. Automated proofs
  3. Optimization over semialgebraic sets
  4. Duality
  5. Moment matrices and optimizers

Non-negativity and sums of squares

Let R[x]\R[x] be the polynomial ring in variables x=(x1,,xn)x = (x_1, \dots, x_n) with real coefficients. We start with the following lead problem: decide whether a given polynomial pR[x]p \in \R[x] defines a globally non-negative function (in short, p0p \geq 0). While we expect this to be very hard to solve in general, we may consider some special cases of polynomials.

Example (Univariate case): Let n=1n = 1 and assume that pR[x]p \in \R[x] is non-negative. Necessarily, we must have that deg(p)=2d\deg(p) = 2d is even, and that the leading coefficient of pp is positive (we might as well take pp monic). We also note that any real root pp might have must appear with even multiplicity as pp otherwise flip sign there. Consider a factorization of pp into real roots and complex conjugate root pairs:

p(x)=i(xri)2mij(xzj)nj(xzj)nj=(i(xri)mij(xzj)nj)g(x)(i(xri)mij(xzj)nj).g(x)\begin{aligned} p(x) &= \prod_i (x - r_i)^{2 m_i} \prod_j (x - z_j)^{n_j} (x - \overline{z_j})^{n_j} \\ &= \underbrace{\left(\prod_i (x - r_i)^{m_i} \prod_j (x - z_j)^{n_j}\right)}_{g(x)} \underbrace{\left(\prod_i (x - r_i)^{m_i} \prod_j (x - \overline{z_j})^{n_j}\right).}_{\overline{g(x)}} \end{aligned}

Thus, with g(x)=u(x)+iv(x)g(x) = u(x) + i v(x), we obtain the expression p(x)=g(x)g(x)=u(x)2+v(x)2p(x) = g(x) \overline{g(x)} = u(x)^2 + v(x)^2.

We have proven that every non-negative univariate polynomial is a sum of squares of other polynomials. Of course the reverse, i.e. being a sum of squares, trivially implies non-negativity. Is it possible that non-negativity can be characterized by the sum-of-squares (or abbreviated, sos) criterion? Let's see another case.

Example (Quadratic case): We now let nn be arbitrary and pick pR[x]p \in \R[x] such that p0p \geq 0 and deg(p)=2\deg(p) = 2. We may write

p(x)=xTAx+bTx+c, p(x) = x^T A x + b^T x + c,

where AA is a symmetric (n×n)(n \times n)-matrix, be an nn-vector and cc a constant. First, observe that the non-negativity of pp implies that AA is positive semidefinite (or psd for brevity): indeed, if this were not the case, there would be a direction uu s.t. uTAu<0u^T A u < 0 and thus p(tu)=t2(uTAu)+O(t)p(tu) = t^2 (u^T A u) + O(t) \to -\infty as tt \to \infty. Next, consider the homogenization p~\tilde{p} of pp in vectorized form:

p~(x,t)=xTAx+bTxt+ct2=[xt][A12b12bTc]Q[xt]. \tilde{p}(x, t) = x^T A x + b^T x t + c t^2 = \begin{bmatrix} x & t \end{bmatrix} \underbrace{\begin{bmatrix} A & \frac 12 b \\ \frac 12 b^T & c \end{bmatrix}}_{Q} \begin{bmatrix} x \\ t \end{bmatrix}.

Note that p~(x,0)=xTAx0\tilde p (x, 0) = x^T A x \geq 0 since AA is psd and p~(x,t)=t2p(x/t)0\tilde p (x, t) = t^2 p(x / t) \geq 0 since p0p \geq 0 whenever t0t \neq 0, so that p~0\tilde p \geq 0 globally on Rn×R\R^n \times \R. It follows that its defining matrix QQ is psd itself and we may consider a Cholesky factorization Q=UTUQ = U^T U to find

p(x)=p~(x,1)=(U[x1])T(U[x1])=U[x1]2. p(x) = \tilde p(x, 1) = \left( U \begin{bmatrix} x \\ 1 \end{bmatrix} \right)^T \left( U \begin{bmatrix} x \\ 1 \end{bmatrix} \right) = \left\| U \begin{bmatrix} x \\ 1 \end{bmatrix} \right\|^2.

This last expression for pp is again a sum of squares of affine functions.

As for the univariate case, it turns out that non-negativity and being a sos is equivalent in case of quadratic polynomials as well. This should provide enough motivation to introduce dedicated symbols.

Definition: The set of non-negative polynomials and the set of sos polynomials will be denoted respectively by Pn+={pR[x]   ⁣:  p0} \NN_n = \set{p \in \R[x]}{p \geq 0} and Σn={iqi2   ⁣:  qiR[x]}\SOS_n = \set{\sum_i q_i^2}{q_i \in \R[x]}. Their respective truncations at degree 2d2d are denoted by by Pn,2d+\NN_{n,2d} and Σn,2d\SOS_{n,2d}. The subscript nn may be dropped if the number of variables is clear from the context.

Both P+\NN and Σ\SOS are convex cones: they are closed under addition and multiplication by non-negative scalars and clearly ΣP+\SOS \subseteq \NN. We have seen two families for which we even have equality - how far can we go with this? Not much further unfortunately - a result which motivated Problem 17 in Hilbert's 1900 list of 23 problems.

Theorem (Hilbert, 1888): The equality Pn,2d+=Σn,2d\NN_{n,2d} = \SOS_{n,2d} holds if and only if (n,2d)(n,2d) is either of (1,2d)(1,2d), (n,2)(n,2) or (2,4)(2,4).

In fact, the situation looks rather hopeless:

Theorem (Blekherman, 2006): Fix an even degree dd and let HH be the compact set of all pPn,d+p \in \NN_{n,d} such that Sn1pdσ=1\int_{\mathbb{S}^{n-1}} p \, d\sigma = 1. Then, as nn \to \infty

vol(Σn,dH)vol(Pn,d+H)0.\frac{\operatorname{vol}(\SOS_{n,d} \cap H)}{\operatorname{vol}(\NN_{n,d} \cap H)} \to 0.

The paper gives exact asymptotic rates (think ndn^{-d}; very fast), but the point is clear: the density of Σ\SOS in P+\NN is vanishingly small in many variables or high degree. Nonetheless, sos-based methods have proven very practical for a wide variety of applications. The next section will justify this claim.

Automated proofs

Even though Hilbert proved that the equality P+=Σ\NN = \SOS holds only in some special cases, the first explicit counterexample was considered much later by Motzkin in 1967:

M(x,y)=x4y2+x2y43x2y2+1.M(x,y) = x^4 y^2 + x^2 y^4 - 3 x^2 y^2 + 1.

When plotted as a function on R2\R^2 its surface looks like this:

Proving that it is globally non-negative is a simple application of the AM-GM inequality:

x2y2=x4y2  x2y4  13x4y2+x2y4+13.x^2 y^2 = \sqrt[3]{x^4 y^2 \; x^2 y^4 \; 1} \leq \frac{x^4 y^2 + x^2 y^4 + 1}{3}.

Rearranging yields M0M \geq 0. The next result is the key to algorithmically deciding whether a given polynomial is sos.

Proposition (Parameterization of Σ\SOS): Let vn,d(x)=(1,x1,,xn,x12,x1x2,,xnd)Tv_{n,d}(x) = (1, x_1, \dots, x_n, x_1^2, x_1 x_2, \dots, x_n^d)^T be a monomial basis of the truncated polynomial ring R[x]d\R[x]_d written as column vector. A polynomial pR[x]2dp \in \R[x]_{2d} is sos if and only if there exists a psd matrix QQ such that

p(x)=vn,d(x)TQvn,d(x).p(x) = v_{n,d}(x)^T Q v_{n,d}(x).

The exact choice of basis does not matter of course. This matrix QQ will be refered to as the Gram matrix.

Proof: If there exists such a QQ we may proceed as in the quadratic polynomial case above: using a Cholesky factorization Q=UTUQ = U^T U and plugging this in yields a sum-of-squares expression p(x)=Uvn,d(x)2p(x) = \| U v_{n,d}(x) \|^2. Conversely, if p=iqi2p = \sum_i q_i^2 is sos, we may write each qiq_i (which is necessarily of degree at most dd) in terms of the basis: qi(x)=ciTvn,d(x)q_i(x) = c_i^T v_{n,d}(x). Thus, qi(x)2=vn,d(x)T(ciciT)vn,d(x)q_i(x)^2 = v_{n,d}(x)^T (c_i c_i^T) v_{n,d}(x) and by summing these up we get p(x)=vn,d(x)T(iciciT)vn,d(x)p(x) = v_{n,d}(x)^T (\sum_i c_i c_i^T) v_{n,d}(x). Finally, we observe that Q=iciciTQ = \sum_i c_i c_i^T is psd as a sum of rank-1 psd matrices. \square

Checking whether the Motzkin polynomial M(x,y)M(x,y) (or any polynomial for that matter) is sos is therefore equivalent to checking whether the linear system

M(x,y)=v2,3(x,y)TQv2,3(x,y) M(x,y) = v_{2,3}(x,y)^T Q v_{2,3} (x,y)

obtained via coefficient matching has a psd solution QQ. In other words, it is equivalent to checking whether the intersection of an affine subspace with the convex cone of psd matrices is non-empty. A problem of this type can be solved in practice by semidefinite programming (short, sdp) solvers. The experiment motzkin_not_sos.py computes exactly this - see the first of the two tests.

(.venv) $ motzkin-not-sos

Starting test: Is M(x,y) a sum of squares?  ...  Computation finished.
  Solution status: infeasible
  Solve time: 0.00112448s
  Iterations: 9

Starting test: Is (x^2 + y^2) * M(x,y) a sum of squares?  ...  Computation finished.
  Solution status: optimal
  Solve time: 0.005551222s
  Iterations: 12

This tells us that the the decomposition failed and that the Motzkin polynomial is not sos. Some implementations of the so-called interior point method will also return a "dual infeasibility certificate": To understand what that means exactly we need to study the conic dual of Σ\SOS. For now it is enough to think about such certificates as a concrete hyperplane in some closely related cone that separates the compact feasible region from the point of interest. More on this later.

So Motzkin is now verfiably not sos. However, according to the second test that is performed in the experiment, it turns out that the product of Motzkin with some other sos polynomial results in something sos. This is not just a coincidence. Hilbert's result from the last section motivated a new question (which became Problem 17): Is every non-negative polynomial a sum of squares of rational functions? This has been answered positively by (Artin, 1927). In our case, M(x,y)M(x,y) is not sos, but it is fraction of two sos polynomials: the denominator is x2+y2x^2 + y^2 and the numerator was provided by the Gram matrix QQ, which was computed by the sdp solver.

In fact, one does not even have to guess the denominator as we did here. The documentation of the Julia package SumsOfSquares.jl demonstrates how to build a program that not only searches the decomposition of the product but also the denominator in one run given some upper bound on the degree. Let's rephrase this: Any polynomial inequality can - in principle - be automatically proven or disproven by means of semidefinite programming. If it is true, then the sdp solver will provide a proof in the shape of a Gram matrix, which is equivalent to an explicit sos decomposition, which in turn, proves non-negativity. On the other hand, if the inequality is false, then the solver may return an infeasibility certificate as we have seen before.

As powerful as this sounds, there are two important caveats. For one, it is always important to be aware of possible issues with accuracy since interior point methods are implemented using numerical linear algebra routines. Moreover, the dimensions of such sdps explode very quickly if constructed naively. The monomial vector vn,d(x)v_{n,d}(x), for instance, will have length (n+dn)\binom{n + d}{n}, and an interior point solver itself will have rather high runtime costs being a second-order method working on large matrices. The research communities around these methods have found various tricks to circumvent cost or trade it off with accuracy (e.g. by exploiting more structure of problem instances or using more specialized solvers).

Optimization over semialgebraic sets

In this section we will localize the non-negativity to sets described by polynomial inequalities and formulate a effective procedure for solving optimization problems over them.

Let g1(x),,gm(x)R[x]g_1(x), \dots, g_m(x) \in \R[x] be arbitrary polynomials and let them define the basic semialgebraic set

K={xRn   ⁣:  g1(x)0,,gm(x)0}. K = \set{x \in \R^n}{g_1(x) \geq 0, \dots, g_m(x) \geq 0}.

Such sets KK may take a wide variety of shapes. Here are just a few basic compact examples:

Note that this last example illustrates how to achieve equality constaints as well: simply add the negative of the polynomial to the mix. We will assume throughout that KK is compact.

Next, consider a polynomial objective function pR[x]p \in \R[x] and consider the polynomial optimization problem (or pop for short)

pmin=infxKp(x).\pmin = \inf_{x \in K} p(x).

Problems of this rather innocent looking form encode a large variety of problems appearing in practice: for starters, note that this already includes the whole classes of linear and quadratically constrained programming (themselves classes of broad application) as well as that of non-linear programming in binary variables which is of high interest to the field of combinatorial optimization. Researchers have applied the method we are going to discuss in domains like power systems design, dynamical systems analysis, quantum information theory and to find sharp approximations to many famous NP-hard problems like MaxCut or StableSet.

While pops are hard, non-convex problems in general, we can reformulate them into a convex problem (albeit an infinite dimensional one) by using the language of non-negativity. Define the new convex cone P+(K)={pR[x]   ⁣:  p(x)0  xK}\NN(K) = \set{p \in \R[x]}{p(x) \geq 0 \; \forall x \in K} (exercise: verify that this is really a convex cone). With it, reformulate

pmin=sup{λR   ⁣:  pλP+(K)}.\pmin = \sup \set{\lambda \in \R}{p - \lambda \in \NN(K)}.

Indeed, let xKx^* \in K be a minimizer in the first formulation (remember: KK was assumed to be compact) and λ\lambda^* be the maximizer of the second formulation. Then, the polynomial pλp - \lambda^* is non-negative on KK. If p(x)λp(x^*) - \lambda^* were strictly positive on KK, it would contradict the optimality of λ\lambda^*, and therefore pmin=p(x)=λ\pmin = p(x^*) = \lambda^*.

How lucky for us, that we already know very well how to relax non-negativity - let's go back to sums of squares.

Definition: The quadratic module associated to the polynomials g1,,gmg_1, \dots, g_m is defined as

Q(g1,,gm)={σ0+j=1mσjgj   ⁣:  σ0,σ1,σmΣ}.Q(g_1, \dots, g_m) = \set{\sigma_0 + \sum_{j = 1}^{m} \sigma_j g_j }{\sigma_0, \sigma_1 \dots, \sigma_m \in \SOS}.

For a given degree 2d2d we also define the truncated quadratic module

Q(g1,,gm)2d={pQ(g1,,gm)   ⁣:  deg(p)2d}.Q(g_1, \dots, g_m)_{2d} = \set{p \in Q(g_1, \dots, g_m)}{\deg(p) \leq 2d}.

The purpose is clear: any pQ(g1,,gm)p \in Q(g_1, \dots, g_m) is non-negative on KK since gi(x)0g_i(x) \geq 0 on KK for all i=1,,mi = 1, \dots, m. Thus, similar to how Σ\SOS relaxes P+\NN, the new convex cone Q(g1,,gm)Q(g_1, \dots, g_m) relaxes P+(K)\NN(K).

Note that membership in the truncated quadratic module will only certify non-negativity on KK if all of the summands appear. The truncation degree 2d2d must be chosen to be greater or equal max(deg(g1),,deg(gm))\max(\deg(g_1), \dots, \deg(g_m)).

Moreover, observe that we declared explicit dependence on the defining inequalities instead of KK itself. Different descriptions may result in different quadratic modules, but we can only write down an effective certificate if we know a description.

Now we are ready to define the sums-of-squares hierarchy:

pdsos=sup{λR   ⁣:  pλQ(g1,,gm)2d}.\psos_d = \sup \set{\lambda \in \R}{p - \lambda \in Q(g_1, \dots, g_m)_{2d}}.

These define a sequence of convex, finite-dimensional optimization problems. As we increase the hierarchy level dd, the search space grows and we find ourselves in a chain of inequalities

pdsospd+1sospmin.\psos_d \leq \psos_{d + 1} \leq \pmin.

The first immediate question is: will this converge? That is, will pdsospmin\psos_d \to \pmin as dd \to \infty? Lucky us again:

Theorem (Putinar's Positivstellensatz, 1993): Assume that KK is compact and the quadratic module Q=Q(g1,,gm)Q = Q(g_1, \dots, g_m) is Archimedean, i.e. if Rx2QR - \|x\|^2 \in Q for some R>0R > 0. If pP+(K)p \in \NN(K), then p+εQp + \varepsilon \in Q for all ε>0\varepsilon > 0.

The word Positivstellensatz was chosen in analogy to the German Nullstellensatz in algebraic geometry. A constructive and elementary proof of this result can be found in (Schweighofer, 2005).

Let us first have a look at the Archimedean condition, which is a technical assumption only and closely related to the discussion above: the quadratic module depends on the description of the set and not the set itself. All it says is that you need to have such a ball constraint in the module for guaranteed convergence. Practically, this is not a deal-breaker. If you already know (maybe approximately) diameter and location of KK, or if you have an a-priori bound on where the minimizers should be, it is enough to add this inequality without changing anything.

So how does this help us? Think of ε\varepsilon as the sharpness of the bound. For any desired sharpness, Putinar tells us that we will be able to decompose a polynomial pλp - \lambda into the form required for membership in the quadratic module as long as we are willing to wiggle λ\lambda by ε\varepsilon. Therefore, as we go up in hierarchy level, i.e. in truncation degree, we will be able to tighten our bounds, i.e. to let ε0\varepsilon \to 0, because we gradually enlarge the search space. This proves asymptotic convergence of the sos hierarchy.

Since we did not leave the framework of the last section, we are again able to use semidefinite programming to solve such optimization problems. Let's take again the Motzkin polynomial M(x,y)M(x,y) from the previous section and try to minimize it on a non-centered ball. In other words, we attempt to solve the problem of maximizing a scalar λ\lambda such that the linear coefficient matching system

M(x,y)+λ=σ0+σ1(r2(xx0)2(yy0)2)M(x,y) + \lambda = \sigma_0 + \sigma_1 (r^2 - (x - x_0)^2 - (y - y_0)^2)

is satisfied and such that the the two polynomials σ0\sigma_0 and σ1\sigma_1 are sos, i.e. can be described by two psd matrix variables. Before, we were dealing only with feasibility and now we also need to optimize a linear function at the same time. Also, note that the first hierarchy level that makes sense here is d=3d=3 - anything below and the coefficient matching system would be over-determined. This problem is implemented in the experiment motzkin_minimize.py; check out the code to see how to hand these data to a solver. Here is the result:

This is converging to the minimum (or rather to numerical precision) quite fast. Is the Archimedean condition satisfied here?

Duality

As so often in mathematics, taking a closer look at the dual of a given problem/concept/space/... can be very fruitful. In this section we are going to examine exactly the same problem, i.e. the polynomial optimization problem

pmin=infxKp(x)\pmin = \inf_{x \in K} p(x)

for the compact semialgebraic set K={g10,,gm0}K = \{g_1 \geq 0, \dots, g_m \geq 0\} but we choose a different road to arrive at the hierarchies of lower bounds. This will lead us naturally to the dual theory of moments.

Consider the following reformulation: Let us denote by Prob(K)\operatorname{Prob}(K) the convex (!) set of all probability measures supported on KK (all mass lies inside KK or equivalently the measure of XKX \setminus K is zero for any XRnX \subseteq \R^n). Then,

pmin=infμProb(K)p(x)dμ(x).\pmin = \inf_{\mu \in \operatorname{Prob}(K)} \int p(x) \, d\mu(x).

At first glance, this might seem a bit strange but let us quickly verify that this is in fact true:

The language that we are going to use for these two last sections will be introduced now for yet another reformulation of this integral expression.

Definition: The moment cone M\MOM is the set of all linear functionals L ⁣:R[x]RL \colon \R[x] \to \R on the polynomial ring for which there is a measure μ\mu such that

L(p)=p(x)dμ(x).L(p) = \int p(x) \, d\mu(x).

Similarly, for a set KRnK \subseteq \R^n we define the localized moment cone M(K)\MOM(K) as the set of all linear functions on R[x]\R[x] that are integration w.r.t. a measure supported on KK.

Note that we did not require these measures to be probability measures. The reason is that M\MOM would fail to be a cone if we did (verify this for yourself!). That said, if LL is such a functional then requiring it to be integration w.r.t. to some probability measure is as simple as enforcing L(1)=1L(1) = 1.

In this notation, we may write

pmin=inf{L(p)   ⁣:  LM(K),L(1)=1}.\pmin = \inf\set{L(p)}{L \in \MOM(K), L(1) = 1}.

We present now the key theorem that establishes duality. Let us denote the dual of a vector space VV by VV^*. If CVC \subseteq V is any cone inside VV then we define the dual cone as C={LV   ⁣:  L(c)0  cC}C^* = \set{L \in V^*}{L(c) \geq 0\;\forall c \in C}.

Theorem (Haviland, 1936): If KRnK \subset \R^n is a closed subset, then the moment is the dual of the cone of non-negative polynomials, i.e.

M(K)=(P+(K))={LR[x]   ⁣:  L(p)0  pP+(K)}.\MOM(K) = \left(\NN(K)\right)^* = \set{L \in \R[x]^*}{L(p) \geq 0\;\forall p \in \NN(K)}.

Functionals that preserve positivity like this one are sometimes called positive functionals. One direction is quite simple: if LL is integration w.r.t. to some measure, then one verifies immediately that non-negative polynomials map to non-negative values. The other direction is a bit more involved; here is the outline: First, extend to positive functionals on the space of smooth functions that grow at most polynomially. There, apply the measure-theoretic variant of Riesz' representation theorem for the space of compactly supported functions (as presented for instance in (Folland, 1999)) to assert the existence of a measure. Lastly, show that the representation still holds for smooth polynomially growing functions - in particular for polynomials themselves. The detailed proof can be found in Section 4.6 of (Laurent, 2010). Note also that the assumption is that KK is merely closed. For our purposes it would be enough to show the theorem in the compact case.

Now, that we have established duality with the cone of non-negative polynomials, we also realize that testing membership is at least as hard as for P+\NN. We will therefore introduce a relaxation as the dual to the sums of squares cone and quadratic modules:

Definition: The pseudo-moment cone M\PMOM is defined as the conic dual to Σ\SOS, i.e.

M=Σ={LR[x]   ⁣:  L(σ)0  σΣ}.\PMOM = \SOS^* = \set{L \in \R[x]^*}{L(\sigma) \geq 0 \; \forall \sigma \in \SOS}.

Analogously, for polynomials g1,,gnR[x]g_1, \dots, g_n \in \R[x] defining a basic semialgebraic set K={g10,,gm0}K = \{g_1 \geq 0, \dots, g_m \geq 0\} we define the localized pseudo-moment cone as

M(g1,,gm)=Q(g1,,gm)={LR[x]   ⁣:  L(q)0  qQ(g1,,gm)}.\PMOM(g_1, \dots, g_m) = Q(g_1, \dots, g_m)^* = \set{L \in \R[x]^*}{L(q) \geq 0 \; \forall q \in Q(g_1, \dots, g_m)}.

The prefix "pseudo" is justified: while the sos cone ΣP+\SOS \subset \NN was an inner approximation, we are now in the situation of an outer approximation MM\PMOM \supset \MOM, i.e. not all pseudo-moments come from a measure. Indeed, being positive on all sos polynomials is a more inclusive condition than being non-negative. Moreover, as we will detail in the next section, membership is again tractable in the sense of semidefinite programming.

As before, we now have all the ingredients together to define the dual lower bound on pmin\pmin:

pmom=inf{L(p))   ⁣:  LM(g1,,gm),L(1)=1}pmin.\pmom = \inf\set{L(p))}{L \in \PMOM(g_1, \dots, g_m), L(1) = 1} \leq \pmin.

This is a lower bound simply because we enlarge the feasible set.

Warning: Both the inner approximation in the sos case and the outer approximation in the moment case led to lower bounds. Keep in mind that the former used a little trick which transformed the problem into maximization, while the moment side stayed a minimization problem. For this reason, the sos side is usually referred to in the literature as the dual and the moment side as the primal - this is somewhat opposite to the flow of this introduction to the Moment-SOS hierarchy. Have a look at one of the original expositions in [(Lasserre,2001)](https://doi.org/10.1137/S1052623400366802)[(Lasserre, 2001)](https://doi.org/10.1137/S1052623400366802) to verify this yourself.

We may also introduce the respective truncated (pseudo-)moment cones M2d\MOM_{2d}, M2d\PMOM_{2d}, M(K)2d\MOM(K)_{2d} and M(g1,,gm)2d\PMOM(g_1, \dots, g_m)_{2d} in the expected way: Instead of considering functionals on the whole polynomial ring, we only look at R[x]2d\R[x]_{2d}. This leads then to the (completely analogous) hierarchy of finite-dimensional convex optimization problems

pdmom=inf{L(p))   ⁣:  LM(g1,,gm)2d,L(1)=1}\pmom_{d} = \inf\set{L(p))}{L \in \PMOM(g_1, \dots, g_m)_{2d}, L(1) = 1}

with the chain of inequalities pdmompd+1mompmompmin\pmom_{d} \leq \pmom_{d + 1} \leq \dots \leq \pmom \leq \pmin. It should not come as a surprise now that both bounds p2dsos\psos_{2d} and p2dmom\pmom_{2d} are tightly linked:

Proposition (Weak duality): For dd large enough (for the optimization problem to be well-defined) we have pdsospdmom\psos_d \leq \pmom_d and psospmom\psos \leq \pmom.

Proof: Let λ\lambda be a feasible solution for the optimization problem psos\psos: there exists some qQ(g1,,gm)q \in Q(g_1, \dots, g_m) such that pλ=qp - \lambda = q. If LL is any feasible solution for the optimization problem pmom\pmom (i.e. LL is non-negative on the quadratic module generated by g1,,gmg_1, \dots, g_m), then in particular

L(p)λ=L(p)λL(1)=L(pλ)=L(q)0.L(p) - \lambda = L(p) - \lambda L(1) = L(p - \lambda) = L(q) \geq 0.

Thus, λL(p)\lambda \leq L(p). Because the feasible solutions from both problems were arbitrary, the inequality holds for the optima due to closedness. \square

Under certain conditions, often encountered in practice, the two problems are also strongly dual (i.e. psos=pmom\psos = \pmom). One such sufficient condition is that the semialgebraic set KRnK \subseteq \R^n has non-empty interior (see Theorem 6.1 in (Laurent, 2010)).

The weak duality result has a strong consequence: Recall that Putinar's Positivstellensatz established that pmin=psos\pmin = \psos in case of compact KK with Archimedean description. Since pmin=psospmompmin\pmin = \psos \leq \pmom \leq \pmin, it also shows asymptotic convergence of the moment hierarchy in that case.

Moment matrices and optimizers

The purpose of this final section is to show how to pass the data of the moment relaxation at some degree truncation to a solver. Afterwards, we answer an important question: how can we possibly know at which degree dd we can stop going up in the hierarchy? Under such circumstances it will also be possible to use the solver outputs to extract concrete minimizers.

First, let us show that the levels of the moment hierarchy are again semidefinite programs. In the following proposition we may easily replace the infinite localized pseudo-moment cone by a truncation with the same result.

Proposition: Extend the description of the semialgebraic set by the constant polynomial g0=1g_0 = 1 (this does not change the set as trivially 101 \geq 0). To a linear functional LR[x]L \in \R[x]^* and for each j=0,1,,mj = 0, 1, \dots, m we associate the bilinear form

BjL ⁣:R[x]×R[x]R,(p1,p2)L(gjp1p2).B^L_j \colon \R[x] \times \R[x] \to \R, \quad (p_1, p_2) \mapsto L(g_j p_1 p_2).

Then, a functional LR[x]L \in \R[x]^* is a member of the localized pseudo-moment cone if and only if all associated bilinear forms BjLB^L_j are psd (j=0,1,,mj = 0, 1, \dots, m).

This is the analogon to the positive semidefinite description of the sos cone and of the truncated module.

Proof: If LM(g1,,gm)L \in \PMOM(g_1, \dots, g_m) then it is non-negative on the truncated module. In particular, it is non-negative on each polynomial of the form gjh2g_j h^2 (choose the zero polynomial for all other terms of a generic member of the quadratic module). But then BjL(p,p)0B^L_j (p, p) \geq 0 for any polynomial pR[x]p \in \R[x], so BjLB^L_j is psd. Conversely, let LR[x]L \in \R[x]^* be a linear functional such that all bilinear forms BjLB^L_j are psd. Let q=g0p02+g1p12++gmpm2q = g_0 p_0^2 + g_1 p_1 ^2 + \dots + g_m p_m^2 be an arbitrary element of the quadratic module. Thus,

L(q)=B0L(p0,p0)+B1L(p1,p1)++BmL(pm,pm)0,L(q) = B^L_0 (p_0, p_0) + B^L_1(p_1, p_1) + \dots + B^L_m (p_m, p_m) \geq 0,

so LL is a localized pseudo-moment functional. \square

Let us now pass to the truncated case. Here, of course we would like to talk about concrete matrices instead of bilinear forms or functionals so that we can pass them numerically to a solver. We will need to choose a basis. For simplicity, and because it is standard we will again choose the monomial basis truncated at some degree 2d2d. For the polynomial ring in nn variables x=(x1,,xn)x = (x_1, \dots, x_n), the (unsorted) monomial basis in multi-index notation is

{xα   ⁣:  αZ0n,  α2d}.\set{x^\alpha}{\alpha \in \Z_{\geq 0}^n,\; |\alpha| \leq 2d}.

Note that any functional L(R[x]2d)L \in \left(\R[x]_{2d}\right)^* on the truncated polynomial ring evaluated at a polynomial hh can be expressed as

L(h)=Ly(h)=α2dyαhα=yTh,L(h) = L_y(h) = \sum_{|\alpha| \leq 2d} y_\alpha h_\alpha = y^T h,

where hαh_\alpha is the corresponding coefficient of hh and yy is now a vector of size equal to that of the basis. In the last expression, we simply interpret hh as a vector in the monomial basis as a shorthand notation. Thus, we will from now on speak about (pseudo-)moment vectors yy instead of (pseudo-)moment functionals LL.

In the literature the following notation for the matrices of bilinear forms from the last proposition is very common. Starting with j=0j = 0, i.e. for the trivial addition g0=1g_0 = 1, the moment matrix corresponding to the bilinear form B0LB^L_0 associated to LyL_y is defined as the matrix Md(y)M_d(y) with entries indexed by multi-indices α,βd|\alpha|, |\beta| \leq d

Md(y)α,β=Ly(xαxβ)=yα+β.M_d(y)_{\alpha, \beta} = L_y(x^\alpha x^\beta) = y_{\alpha + \beta}.

This might seem a bit strange at first glance; it's always good to see an example. Let's consider the case n=2n = 2 and d=2d = 2. Let yy be a moment vector with one entry for each monomial in two variables up to degree 2d=42d = 4. Then, the associated moment matrix is

M2(y)=[y00y10y01y20y11y02y10y20y11y30y21y12y01y11y02y21y12y03y20y30y21y40y31y22y11y21y12y31y22y13y02y12y03y22y13y04].M_2(y) = \left[\begin{array}{c:cc:ccc} y_{00} & y_{10} & y_{01} & y_{20} & y_{11} & y_{02} \\ \hdashline y_{10} & y_{20} & y_{11} & y_{30} & y_{21} & y_{12} \\ y_{01} & y_{11} & y_{02} & y_{21} & y_{12} & y_{03} \\ \hdashline y_{20} & y_{30} & y_{21} & y_{40} & y_{31} & y_{22} \\ y_{11} & y_{21} & y_{12} & y_{31} & y_{22} & y_{13} \\ y_{02} & y_{12} & y_{03} & y_{22} & y_{13} & y_{04} \end{array}\right].

Note that you can find M1(y)M_1(y) for n=3n = 3 as the upper left (3×3)(3 \times 3)-submatrix. It's a good exercise to write out a moment matrix like this for yourself (e.g. try the case n=3n = 3 and d=1d = 1).

For j>0j > 0, we need to shift and add monomials according to the vector gjg_j. For multi-indices α,βd|\alpha|, |\beta| \leq d we define the localizing (moment) matrix

Md(gjy)=Ly(gjxαxβ)=γ(gj)γyα+β+γ.M_d(g_j y) = L_y (g_j x^\alpha x^\beta) = \sum_{\gamma} \left( g_j \right)_\gamma y_{\alpha + \beta + \gamma}.

As an example, let us consider a centered ball constraint (those that are required by Putinar's theorem) with n=2n = 2 and d=1d = 1, so g1(x)=4x2=2x12x22g_1(x) = 4 - \|x\|^2 = 2 - x_1^2 - x_2^2. Under this setup the localizing matrix is

M1(g1y)=[4y00y20y024y10y30y124y01y21y034y10y30y124y20y40y224y11y31y134y01y21y034y11y31y134y02y22y04].M_1(g_1 y) = \left[\begin{array}{c:cc} 4 y_{00} - y_{20} - y_{02} & 4 y_{10} - y_{30} - y_{12} & 4 y_{01} - y_{21} - y_{03} \\ \hdashline 4 y_{10} - y_{30} - y_{12} & 4 y_{20} - y_{40} - y_{22} & 4 y_{11} - y_{31} - y_{13} \\ 4 y_{01} - y_{21} - y_{03} & 4 y_{11} - y_{31} - y_{13} & 4 y_{02} - y_{22} - y_{04} \\ \end{array}\right].

This is exactly the moment matrix M1(y)M_1(y) in which each entry is scaled and shifted by the coefficients and exponents of the constaint.

In this new matrix notation we can reformulate pmom\pmom into a version that can be passed to solvers. Let's make this a bit more explicit than previously to figure out which degrees we really need (compare to the degree constraints on the individual terms of the quadratic module). Write

d0=degg02=1,d1=degg12,,dm=deggm2d_0 = \left\lceil \frac{\deg g_0}{2}\right\rceil = 1, d_1 = \left\lceil \frac{\deg g_1}{2} \right\rceil, \dots, d_m = \left\lceil \frac{\deg g_m}{2} \right\rceil

to denote the rounded up half-degrees of all polynomials that define the semialgebraic set KK. Then, the minimum degree for a moment (or sos) relaxation to make sense will denoted by dmin=max{degp/2,d0,d1,,dm}.d_{\text{min}} = \max\{ \lceil\deg p/2 \rceil, d_0, d_1, \dots, d_m \}. Then, for any hierarchy level ddmind \geq d_{\text{min}} and letting N=dimR[x]2dN = \dim \R[x]_{2d}, we arrive at our final formulation

pdmom=inf{yTp   ⁣:  yRN,y0=1,Mddj(gjy)0,j=0,,m},\pmom_d = \inf\set{y^T p}{y \in \R^N, y_0 = 1, M_{d - d_j}(g_j y) \succeq 0, j = 0, \dots, m },

where we used the shorthand A0A \succeq 0 for a matrix to be psd. Note that Mdd0(g0y)=Md(y)M_{d - d_0}(g_0 y) = M_d (y). This is now precise enough to be passed to a solver. Check out the first part of the code for the experiment motzkin_moment.py to see this in action.

We finish this introduction with a short discussion of the question: at which degree dd can I stop going up in the hierarchy? When have I reached the optimum? This is a very relevant question as the dimension of R[x]2d\R[x]_{2d} (and with it the size of our moment matrices) grow rapidly leading to numerical difficulty. See Sections 6.6 and 6.7 in (Laurent, 2010)) for a detailed discussion.

Here is a first simple criterion. Denote by e1,,ene_1, \dots, e_n the multi-indices corresponding to the monomials x1,,xnx_1, \dots, x_n.

Proposition: Let yy be an optimal solution for the problem pdmom\pmom_d. Define x=(ye1,,yen)Rn\overline{x} = (y_{e_1}, \dots, y_{e_n}) \in \R^n. If xK\overline x \in K and p(x)=pdmomp(\overline x) = \pmom_d, then pmin=pdmom\pmin = \pmom_d and x\overline x is a minimizer to the original problem, i.e. p(x)=pminp(\overline x) = \pmin.

Here is some intuition for this result. If yy were given by integration w.r.t. a probability measure μ\mu, then one could think of μ\mu as being concentrated tightly around a true minimizer (close to a Dirac at that point), so that the moments yei=xidμ(x)y_{e_i} = \int x_i \, d\mu(x) are large only if and only if xix_i is close to the true ii-th coordinate of the minimizer.

Proof: Since xK\overline x \in K, we certainly have pminp(x)\pmin \leq p(\overline x). But by assumption we also have p(x)=pdmompminp(\overline x) = \pmom_d \leq \pmin since the hierarchy produces lower bounds. \square

While this is generally a rare situation to be in, there are heuristic arguments to be made that this has a good chance of being true in case of a unique global minimizer to the original problem (cf. the sections of the survey mentioned above). We perform this test in the experiment motzkin_moment.py. As before we attempt to minimize the Motzkin polynomial around a small ball around one of the minimal points at (1,1)(1, -1) with value 00. At each hierarchy level we log the found lower bound and check numerically if the criteria for the last proposition are satisfied. Here is the result on my system:

$ motzkin-moment

hierarchy level d = 6
  -> solver status: optimal
  -> pmom_6 = -1.0611336720423026e-06
  -> xbar = [0.9300696568198683, -0.923372680505846]
  -> constraint value g1(xbar): 0.0507619989901501 (feasible)
  -> |p(xbar) - pmom_6| = 0.05421546272582045
hierarchy level d = 7
  -> solver status: optimal
  -> pmom_7 = 1.7405777708034975e-08
  -> xbar = [0.9193142154224452, -0.9144839411955861]
  -> constraint value g1(xbar): 0.053823192146335685 (feasible)
  -> |p(xbar) - pmom_7| = 0.0680625302440645
hierarchy level d = 8
  -> solver status: optimal
  -> pmom_8 = 2.7041787764581215e-07
  -> xbar = [0.9400211494212625, -0.9412264389430985]
  -> constraint value g1(xbar): 0.04705179399605586 (feasible)
  -> |p(xbar) - pmom_8| = 0.03677300078418422
hierarchy level d = 9
  -> solver status: optimal
  -> pmom_9 = 3.097746410496427e-07
  -> xbar = [0.9636166109044235, -0.9623476028476482]
  -> constraint value g1(xbar): 0.04274145401339868 (feasible)
  -> |p(xbar) - pmom_9| = 0.015074659455620143
hierarchy level d = 10
  -> solver status: optimal
  -> pmom_10 = 1.2811140037705115e-07
  -> xbar = [0.9690466266312561, -0.9692733473958901]
  -> constraint value g1(xbar): 0.04190223850315855 (feasible)
  -> |p(xbar) - pmom_10| = 0.010615989938524306
hierarchy level d = 11
  -> solver status: optimal
  -> pmom_11 = 3.0463441458294938e-06
  -> xbar = [0.8830502005680481, -0.8842745049144566]
  -> constraint value g1(xbar): 0.06706964579996799 (feasible)
  -> |p(xbar) - pmom_11| = 0.12301890909944158
hierarchy level d = 12
  -> solver status: optimal
  -> pmom_12 = 8.13755707262942e-07
  -> xbar = [0.956053920985409, -0.9587413539555535]
  -> constraint value g1(xbar): 0.043633533734177554 (feasible)
  -> |p(xbar) - pmom_12| = 0.019707880115166
hierarchy level d = 13
  -> solver status: optimal
  -> pmom_13 = 8.671382947245121e-07
  -> xbar = [0.97056788804639, -0.9704954375080612]
  -> constraint value g1(xbar): 0.04173676842189067 (feasible)
  -> |p(xbar) - pmom_13| = 0.009723313378852128

As it turns out, we do indeed approach the true (and unique!) minimizer but we do so very, very slowly. Note that although the lower bound already hits numerical precision, we cannot apply the result yet. Intuitively, increasing the hierarchy level brings the pseudo-moment vector closer to the true infinite moment vector of the optimal measure, but we need a lot more information to produce minimizers than to get tight lower bounds.

There is a more general theorem which includes this strategy as a special case and but with broader applicability. Loosely speaking, one is able to extract minimizers if the sequence of ranks of the moment matrices in the hierarchy level stabilizes.

Theorem (Rank stabilization): Let dK=max{d0,,dm}d_K = \max\{ d_0, \dots, d_m \} be the maximum rounded up half-degree of the polynomials that define KK. For ddmind \geq d_{\text{min}} let yy be an optimal solution to pdmom\pmom_d. Assume that there is some ss with dminsdd_{\text{min}} \leq s \leq d for which

r=rankMs(y)=rankMsdK(y).r = \rank M_{s}(y) = \rank M_{s - d_K} (y).

In this case, the associated functional LyL_y is integration w.r.t. an rr-atomic probability measure

μ=i=1rλiδxi,\mu = \sum_{i=1}^r \lambda_i \delta_{x^i},

which is convex combination of Dirac deltas. Moreover, the points xiKx^i \in K are global minimizers to the pop, i.e. p(xi)=pminp(x^i) = \pmin.

Without going into detail, it is possible to extract the minimizers from such a setting by means of numerical linear algebra.

These results highlight the advantages of considering both the sos and the moment side of the hierarchy: We've seen that the sos side gives us algebraic proofs for a given bound by producing an explicit element of the quadratic module. The moment side on the other hand is closer in nature to distributions on the set KK itself and as such might give us access to stopping criteria and concrete minimizers.