\( \min_{x} f(x)\) More...

Collaboration diagram for Optimisation methods:

Functions
template<class Gradient , class InvHessian , class ContainerType >
unsigned	dg::mat::newton (Gradient grad, InvHessian invhess, ContainerType &x0, double tol=1e-5, unsigned max_iter=1000)
	Newton iteration.

template<class Func , class Jacobian , class ContainerType0 , class ContainerType1 >
unsigned	dg::mat::levenberg_marquardt (Func fun, Jacobian jac, ContainerType0 &x0, const ContainerType1 &copyable, double tol=1e-8, unsigned max_iter=1000)
	The Levenberg Marquardt algorithm.

Detailed Description

\( \min_{x} f(x)\)

Function Documentation

◆ levenberg_marquardt()

template<class Func , class Jacobian , class ContainerType0 , class ContainerType1 >

unsigned dg::mat::levenberg_marquardt	(	Func	fun,
		Jacobian	jac,
		ContainerType0 &	x0,
		const ContainerType1 &	copyable,
		double	tol = 1e-8,
		unsigned	max_iter = 1000 )

The Levenberg Marquardt algorithm.

Template Parameters

Func	Callable with signature `void operator()(ContainerType0& x0, ContainerType1& rs)`
Jacobian	Callable with signature `void operator()(const ContainerType0& x0, std::vector<ContainerType1>& jacs)` (no need to resize the given jacs, its size is `x0.size()`)
ContainerType0	Typically a host container (Determines the 1st argument type of `Func` and `Jacobian`)
ContainerType1	Determine the target architecture to run the algorithm on (and the 2nd argument type of `Func` and `Jacobian`)

Parameters

fun	Compute the vector of residuals \( r_j\) given parameters `x`
jac	Compute the gradients of the residuals: each element i of the outer `std::vector` contains the gradient \( \partial \vec r/ \partial x_i\)
x0	The initial guess on input, solution on output. The size of `x0` determines the number of free parameters \(n \) to optimize.
copyable	Determine the size \(m\) and type of the second arguments of `fun` and `jac` The contents are irrelevant, just the size and type are important.
tol	Tolerance determines termination condition `if( \|\|p_k\|\| <= tol*(\|\|x_k\|\| + 1.))`
max_iter	Maximum number of iterations

Returns: Number of iterations

Trust region algorithms

A line search algorithm first chooses an appropriate descent direction and then tries to find the optimal step length. A trust region algorithm first chooses a (maximum) step length and then tries to find the direction that minimizes the function. Both algorithms (at least of Newton type) assume a Taylor expansion of the function

\[ f(x_k + p) \approx f(x_k) + g_k^T p + \frac{1}{2} p^T H_k p \equiv m_k(p) \]

A trust region algorithm is an algorithm that minimizes \(m_k(p)\) over \(p\) with the condition that \(||p|| \leq \Delta_k\) where \(\Delta_k\) is the trust region radius:

\[ \begin{align} p &= \text{arg min}_p m_k(p) \\ ||p || &\leq \Delta_k \end{align} \]

It turns out somewhat surprisingly that even though \(m_k(p)\) is a quadratic function a closed solution is somewhat non-trivial (at least the literature immediately goes on to only finding approximate solutions)

Let us first state an important result: The trust region minimization has no solution \(p\) with \(||p|| = \Delta_k\) if and only if \(H\) is positive definite and \(|| H_k^{-1} g_k|| < \Delta_k\). This result means we can focus on the case were the minimization has a solution \(||p|| = \Delta_k\).

Least squares problem

Assume our objective function (the minimum of which we seek) has the form

\[ f( x) = \frac{1}{2}\sum_{j=1}^m r_j(x)^2 \]

where \(r_j(x)\) (the residuals) are smooth functions and \( x\) are the parameters (of the fit function) and we assume \(m>n\) with \(n\) the number of parameters (or dimension of \(x\)). Typically, this appears if \( f\) is an error function and \( x\) are a set of \(n\) parameters that we try to optimize to reduce the residuals at various test points j. In the literature minimization of this form is treated under nonlinear least squares problems (because the \( r_j\) depend nonlinearly on the parameters \(x\), if the residuals depend linearly on the parameters then the problem is a linear least squares problem that can be solved straighforwardly).

See also: dg::least_squares

Let us define

\[ J = \frac{\partial r_j}{\partial x_i} \]

Then (with \(r(x) = (r_1(x), r_2(x) , ...)\))

\[ \begin{align} g(x) =& \nabla f(x) = J^T r(x)\\ H(x) =& J^T J + \sum_{j=1}^m r_j(x) \nabla\nabla r_j(x) \end{align} \]

Often, the second term in the Hessian \(H(x)\) can be neglected, which is the distinctive feature of least squares problems.

In fact, Newton's method where \(H\) is replaced with \(J^T J\) is called Gauss-Newton method and the search direction is the solution to the linear least squares problem (at each step \(k\))

\[ \text{min}_p || J p + r||^2 \]

Of course, in practice Gauss-Newton still needs to be combined with a line-search method.

Levenberg-Marquardt algorithm (for nonlinear least squares problems)

The LM algorith is a trust region algorithm where the Hessian is replaced by \(H\approx J^T J\) and the Jacobian is the \(m\times n\) derivatives of the \(r_i(x)\). Doing so, all Eigenvalues are \(\lambda_i \geq 0\).

The way to solve a constraint minimization problem (the trust region minimization above) is through Lagrange multipliers

\[ L(p, \lambda) = r^T J p + \frac{1}{2} p^T J^T J p + \lambda( ||p||_W^2 - \Delta^2)/2 \simeq ||Jp +r||^2/2 + \lambda( ||p||_W^2 - \Delta^2)/2 \]

The Euler Lagrange equations read

\[ \begin{align} J^T r + J^T J p + \lambda W p =& 0 \\ p^T W p =& \Delta^2 \end{align} \]

The proposed search direction is therefore

\[ p = -( J^T J + \lambda W)^{-1} J^T r \]

with \(\lambda > 0\). The case \(\lambda = 0\) is equivalent to a Gauss-Newton step. For \(\lambda\ll 1\) the search direction converges to that of Gauss-Newton while for \(\lambda\gg 1\) the \(p\) is the steepest descent direction. The basis for this algorithm is the result that there exists a \(\Delta\) related to \(\lambda\) such that \(p\) is the solution to the minimization

\[ \begin{align} p =& \text{arg min}_p ||J p + r||^2/2 \\ ||p||_W\leq &\Delta(\lambda) \end{align} \]

Note that there is an inverse relation between \(\Delta\) and \(\lambda\), i.e. if \(\lambda\gg 1\) then \(\Delta\ll 1\). From a (generalized) Eigenvalue decomposition of \(H\equiv J^T J = WE_H \Lambda_H E_H^{-1}\) (columns of \( E_H\) are Eigenvectors, \( E_H^T W E_H = 1\) and \( E_H^{-1} = E_H^T W\)) we get with \(\bar g = E_H^T J^T r \equiv E_H^T g\) (and \(H v_j = \lambda_j W v_j\))

\[ \begin{align} p =& -E_H(\Lambda + \lambda 1)^{-1} E_H^T J^T r \equiv E_H \bar p\\ p^T W p =& \sum_j \frac{\bar g_j^2}{(\lambda_j + \lambda)^2} \end{align} \]

Finally, note that in practice one needs to take care to make the algorithm scale invariant by using \(|| W p ||\leq \Delta\) in the constrained minimization.

Two ideas seem to exist going forward. The original paper by Marquardt suggests to use \(\lambda\) directly as the adaptive parameter (i.e. a substitute for the trust radius). Some notes suggest that in this case the dogleg method is preferable. Later, a paper by Moré (1977) "The Levenberg-Marquardt algorithm: Implementation and Theory" suggests to consider \(\Delta\) as given and iteratively finding \(\lambda\) such that

\[ ||p||_W = \Delta \]

(which is attributed to Hebden). They suggest to use QR decomposition and cleverly combine that with a root finding for \(\lambda\). In our own implementation we use lapack's Eigenvalue decomposition of \(J^T J\) and combine it with a 1d Newton root finding, to find \(\lambda\). The disadvantage to compute \(J^TJ\) is not too high compared to computing QR of \(J\) for low dimensions. Nocedal&Wright suggest to use the target function

\[ \phi(\lambda)= 1/\Delta - 1/||p(\lambda)||_W \]

for the root finding because \(\phi(\lambda)\simeq \lambda\) near the optimum, however one must take care because the target function is not differentiable at \(\lambda = -\lambda_i\). \(\lambda_0 > -\lambda_1\) where \(\lambda_1\) is the smallest Eigenvalue should be a good starting value.

For this to work with simple Newton iteration we need the derivative

\[ \partial_\lambda \phi = - \partial_\lambda ||p||_W = - ||p||_W^{-3} \sum_j \frac{\bar g_j^2}{(\lambda_j + \lambda)^3} < 0 \text{ for } \lambda \geq 0 \]

We here see that if \( \phi(0) <= 0\) we must immediately choose \( \lambda = 0\) since no \(\lambda > 0 \) exists such that \(\phi(\lambda) = 0\). In this case we have \(||p||_W <\Delta\). In the case \(\phi(0)>0\) we can use Newton iteration to find \( \lambda>0\) such that \( | ||p||_W - \Delta| \leq \sigma \Delta\). i.e.

\[ |\phi(\lambda) | \leq \frac{\sigma}{||p||_W} \]

where we choose \( \sigma = 10^{-4}\) as the stopping criterion

Adaptive choice of trust region radius

If we implement for given \(\Delta_k\) the above procedure we get a search vector \(p_k\). Similar to adaptive timestep algorithms we can define an error quantity comparing \(f(x+p_k)\) with \(m_k(p_k)\). In practice this is done using the ratio

\[ \begin{align} \rho_k = \frac{f(x_k) - f(x_k+p_k)}{m_k(0) - m_k(p_k)} \end{align} \]

with \( m_k(0) - m_k(p_k) = -0.5 p^T J^T J p - g_k^T p = -0.5 \bar p \Lambda \bar p + (J^T J p + \lambda W p)^T p = 0.5\bar p \Lambda \bar p +\lambda ||p||_W^2\). If \(\rho_k\approx 1\) it means that \(f(x)\) is very well approximated by \(m_k\) and we can extend the trust radius \(\Delta_k\) in the next step. If \(\rho_k \leq 0\) or \(\rho_k \ll 1\) it means that \(m_k\) was such a poor representation of \(f\) that we must reject the proposed step and reduce the trust region. We use Algorithm 4.1. of [Nocedal&Wright]

Choice of weights

The weights are chosen according to [Nocedal&Wright, More1967] as the decreasing sequence

\[ \begin{align} W^{k} =& \text{diag}( W_0^k, ..., W_{n-1}^k)\\ W_i^k = & \text{max}( || J_i||^2, W_i^{k-1}) \end{align} \]

See also: "Numerical optimization" by Nocedal & Wright (Springer 2006)

◆ newton()

template<class Gradient , class InvHessian , class ContainerType >

unsigned dg::mat::newton	(	Gradient	grad,
		InvHessian	invhess,
		ContainerType &	x0,
		double	tol = 1e-5,
		unsigned	max_iter = 1000 )

Newton iteration.

Template Parameters

Gradient	a callable with signature `void operator()(const ContainerType& x0, ContainerType& grad)`
InvHessian	a callable with signature `void operator()(const ContainerType& x0, const ContainerType& grad, ContainerType& p)`

Parameters

grad	The gradient of the target function
invhess	The inverse Hessian matrix
x0	initial guess on input, solution on output
tol	succes condition is \( \|\|\nabla f(x_k)\|\| < \epsilon\)
max_iter	Maximum number of allowed iterations

Returns: Number of iterations

A minimization algorithm based on the recursion

\[ x_{k+1} = x_k - H^{-1}(x_k) \vec g (x_k) \]

where \(H\) is the Hessian matrix and \(\vec g(x_k) = \nabla f |_{x_k}\) is the gradient of \(f(\vec x)\). The Newton iterate is essentially a root finding routine for the gradient of \(f(\vec x)\).

This already highlights the problem of using Newton for minimization:

it can also find maxima and (more importantly)
it finds saddle points.

In essence Newton only works if the Hessian is positive definite at all search points including the initial guess. Neither of these conditions is met often in practise, i.e. what we observe often is that * the Newton method is attracted by saddle points and moves in the wrong diretion

(Failed, not implemented) Line search and the Wolfe conditions

The Newton iterate produces a direction

\[ p_k = - H_k^{-1} \vec g_k \]

The idea of a line search algorithm is to introduce a step-length \(\alpha\) into the iterates:

\[ x_{k+1} = x_k + \alpha_k p_k \]

and to find \(\alpha\) through the condition

\[ \alpha_k = \text{arg min}_{\alpha} f(\vec x_k + \alpha \vec p_k) \equiv \text{arg min}_{\alpha} \phi_k(\alpha) \]

Typically, this minimization is not solved exactly. It is solved only to the degree that the Wolfe conditions are satisfied

\[ \begin{align} \phi(\alpha) &\leq \phi(0) + c_1 \alpha \phi'(0)\\ \phi'(\alpha) &\geq c_2 \phi'(0) \end{align} \]

with \(0<c_1<c_2<1\) and in practise \(c_1 = 10^{-4}\) and \(c_2 = 0.9\). A slightly stronger version are the strong Wolfe conditions

\[ \begin{align} \phi(\alpha) &\leq \phi(0) + c_1 \alpha \phi'(0)\\ |\phi'(\alpha)| &\leq |c_2 \phi'(0)| \end{align} \]

The problem with this is that

it is quite tedious to come up with an efficient line search that comes up with an efficient step length satisfying the Wolfe conditins
for example: negative Eigenvalues in \(H\) may lead to \(\vec p_k = -H^{-1}\vec g_k \) not a descent direction (but \(p_k\) being a descent direction is necessary for the existence of \(\alpha_k >0\) satisfying the Wolfe conditions)

See also: "Numerical optimization" by Nocedal & Wright (Springer 2006); https://arxiv.org/pdf/1406.2572; https://www.scientific.net/AMM.347-350.2586

Functions