|
Extension: Matrix functions
#include "dg/matrix/matrix.h"
|
\( \min_{x} f(x)\) More...
Functions | |
| template<class Gradient , class InvHessian , class ContainerType > | |
| unsigned | dg::mat::newton (Gradient grad, InvHessian invhess, ContainerType &x0, double tol=1e-5, unsigned max_iter=1000) |
| Newton iteration. | |
| template<class Func , class Jacobian , class ContainerType0 , class ContainerType1 > | |
| unsigned | dg::mat::levenberg_marquardt (Func fun, Jacobian jac, ContainerType0 &x0, const ContainerType1 ©able, double tol=1e-8, unsigned max_iter=1000) |
| The Levenberg Marquardt algorithm. | |
\( \min_{x} f(x)\)
| unsigned dg::mat::levenberg_marquardt | ( | Func | fun, |
| Jacobian | jac, | ||
| ContainerType0 & | x0, | ||
| const ContainerType1 & | copyable, | ||
| double | tol = 1e-8, | ||
| unsigned | max_iter = 1000 ) |
The Levenberg Marquardt algorithm.
| Func | Callable with signature void operator()(ContainerType0& x0, ContainerType1& rs) |
| Jacobian | Callable with signature void operator()(const
ContainerType0& x0, std::vector<ContainerType1>& jacs) (no need to resize the given jacs, its size is x0.size()) |
| ContainerType0 | Typically a host container (Determines the 1st argument type of Func and Jacobian) |
| ContainerType1 | Determine the target architecture to run the algorithm on (and the 2nd argument type of Func and Jacobian) |
| fun | Compute the vector of residuals \( r_j\) given parameters x |
| jac | Compute the gradients of the residuals: each element i of the outer std::vector contains the gradient \( \partial \vec r/ \partial x_i\) |
| x0 | The initial guess on input, solution on output. The size of x0 determines the number of free parameters \(n \) to optimize. |
| copyable | Determine the size \(m\) and type of the second arguments of fun and jac The contents are irrelevant, just the size and type are important. |
| tol | Tolerance determines termination condition if( ||p_k|| <= tol*(||x_k|| + 1.)) |
| max_iter | Maximum number of iterations |
A line search algorithm first chooses an appropriate descent direction and then tries to find the optimal step length. A trust region algorithm first chooses a (maximum) step length and then tries to find the direction that minimizes the function. Both algorithms (at least of Newton type) assume a Taylor expansion of the function
\[ f(x_k + p) \approx f(x_k) + g_k^T p + \frac{1}{2} p^T H_k p \equiv m_k(p) \]
A trust region algorithm is an algorithm that minimizes \(m_k(p)\) over \(p\) with the condition that \(||p|| \leq \Delta_k\) where \(\Delta_k\) is the trust region radius:
\[ \begin{align} p &= \text{arg min}_p m_k(p) \\ ||p || &\leq \Delta_k \end{align} \]
It turns out somewhat surprisingly that even though \(m_k(p)\) is a quadratic function a closed solution is somewhat non-trivial (at least the literature immediately goes on to only finding approximate solutions)
Let us first state an important result: The trust region minimization has no solution \(p\) with \(||p|| = \Delta_k\) if and only if \(H\) is positive definite and \(|| H_k^{-1} g_k|| < \Delta_k\). This result means we can focus on the case were the minimization has a solution \(||p|| = \Delta_k\).
Assume our objective function (the minimum of which we seek) has the form
\[ f( x) = \frac{1}{2}\sum_{j=1}^m r_j(x)^2 \]
where \(r_j(x)\) (the residuals) are smooth functions and \( x\) are the parameters (of the fit function) and we assume \(m>n\) with \(n\) the number of parameters (or dimension of \(x\)). Typically, this appears if \( f\) is an error function and \( x\) are a set of \(n\) parameters that we try to optimize to reduce the residuals at various test points j. In the literature minimization of this form is treated under nonlinear least squares problems (because the \( r_j\) depend nonlinearly on the parameters \(x\), if the residuals depend linearly on the parameters then the problem is a linear least squares problem that can be solved straighforwardly).
Let us define
\[ J = \frac{\partial r_j}{\partial x_i} \]
Then (with \(r(x) = (r_1(x), r_2(x) , ...)\))
\[ \begin{align} g(x) =& \nabla f(x) = J^T r(x)\\ H(x) =& J^T J + \sum_{j=1}^m r_j(x) \nabla\nabla r_j(x) \end{align} \]
Often, the second term in the Hessian \(H(x)\) can be neglected, which is the distinctive feature of least squares problems.
In fact, Newton's method where \(H\) is replaced with \(J^T J\) is called Gauss-Newton method and the search direction is the solution to the linear least squares problem (at each step \(k\))
\[ \text{min}_p || J p + r||^2 \]
Of course, in practice Gauss-Newton still needs to be combined with a line-search method.
The LM algorith is a trust region algorithm where the Hessian is replaced by \(H\approx J^T J\) and the Jacobian is the \(m\times n\) derivatives of the \(r_i(x)\). Doing so, all Eigenvalues are \(\lambda_i \geq 0\).
The way to solve a constraint minimization problem (the trust region minimization above) is through Lagrange multipliers
\[ L(p, \lambda) = r^T J p + \frac{1}{2} p^T J^T J p + \lambda( ||p||_W^2 - \Delta^2)/2 \simeq ||Jp +r||^2/2 + \lambda( ||p||_W^2 - \Delta^2)/2 \]
The Euler Lagrange equations read
\[ \begin{align} J^T r + J^T J p + \lambda W p =& 0 \\ p^T W p =& \Delta^2 \end{align} \]
The proposed search direction is therefore
\[ p = -( J^T J + \lambda W)^{-1} J^T r \]
with \(\lambda > 0\). The case \(\lambda = 0\) is equivalent to a Gauss-Newton step. For \(\lambda\ll 1\) the search direction converges to that of Gauss-Newton while for \(\lambda\gg 1\) the \(p\) is the steepest descent direction. The basis for this algorithm is the result that there exists a \(\Delta\) related to \(\lambda\) such that \(p\) is the solution to the minimization
\[ \begin{align} p =& \text{arg min}_p ||J p + r||^2/2 \\ ||p||_W\leq &\Delta(\lambda) \end{align} \]
Note that there is an inverse relation between \(\Delta\) and \(\lambda\), i.e. if \(\lambda\gg 1\) then \(\Delta\ll 1\). From a (generalized) Eigenvalue decomposition of \(H\equiv J^T J = WE_H \Lambda_H E_H^{-1}\) (columns of \( E_H\) are Eigenvectors, \( E_H^T W E_H = 1\) and \( E_H^{-1} = E_H^T W\)) we get with \(\bar g = E_H^T J^T r \equiv E_H^T g\) (and \(H v_j = \lambda_j W v_j\))
\[ \begin{align} p =& -E_H(\Lambda + \lambda 1)^{-1} E_H^T J^T r \equiv E_H \bar p\\ p^T W p =& \sum_j \frac{\bar g_j^2}{(\lambda_j + \lambda)^2} \end{align} \]
Finally, note that in practice one needs to take care to make the algorithm scale invariant by using \(|| W p ||\leq \Delta\) in the constrained minimization.
Two ideas seem to exist going forward. The original paper by Marquardt suggests to use \(\lambda\) directly as the adaptive parameter (i.e. a substitute for the trust radius). Some notes suggest that in this case the dogleg method is preferable. Later, a paper by Moré (1977) "The Levenberg-Marquardt algorithm: Implementation and Theory" suggests to consider \(\Delta\) as given and iteratively finding \(\lambda\) such that
\[ ||p||_W = \Delta \]
(which is attributed to Hebden). They suggest to use QR decomposition and cleverly combine that with a root finding for \(\lambda\). In our own implementation we use lapack's Eigenvalue decomposition of \(J^T J\) and combine it with a 1d Newton root finding, to find \(\lambda\). The disadvantage to compute \(J^TJ\) is not too high compared to computing QR of \(J\) for low dimensions. Nocedal&Wright suggest to use the target function
\[ \phi(\lambda)= 1/\Delta - 1/||p(\lambda)||_W \]
for the root finding because \(\phi(\lambda)\simeq \lambda\) near the optimum, however one must take care because the target function is not differentiable at \(\lambda = -\lambda_i\). \(\lambda_0 > -\lambda_1\) where \(\lambda_1\) is the smallest Eigenvalue should be a good starting value.
For this to work with simple Newton iteration we need the derivative
\[ \partial_\lambda \phi = - \partial_\lambda ||p||_W = - ||p||_W^{-3} \sum_j \frac{\bar g_j^2}{(\lambda_j + \lambda)^3} < 0 \text{ for } \lambda \geq 0 \]
We here see that if \( \phi(0) <= 0\) we must immediately choose \( \lambda = 0\) since no \(\lambda > 0 \) exists such that \(\phi(\lambda) = 0\). In this case we have \(||p||_W <\Delta\). In the case \(\phi(0)>0\) we can use Newton iteration to find \( \lambda>0\) such that \( | ||p||_W - \Delta| \leq \sigma \Delta\). i.e.
\[ |\phi(\lambda) | \leq \frac{\sigma}{||p||_W} \]
where we choose \( \sigma = 10^{-4}\) as the stopping criterion
If we implement for given \(\Delta_k\) the above procedure we get a search vector \(p_k\). Similar to adaptive timestep algorithms we can define an error quantity comparing \(f(x+p_k)\) with \(m_k(p_k)\). In practice this is done using the ratio
\[ \begin{align} \rho_k = \frac{f(x_k) - f(x_k+p_k)}{m_k(0) - m_k(p_k)} \end{align} \]
with \( m_k(0) - m_k(p_k) = -0.5 p^T J^T J p - g_k^T p = -0.5 \bar p \Lambda \bar p + (J^T J p + \lambda W p)^T p = 0.5\bar p \Lambda \bar p +\lambda ||p||_W^2\). If \(\rho_k\approx 1\) it means that \(f(x)\) is very well approximated by \(m_k\) and we can extend the trust radius \(\Delta_k\) in the next step. If \(\rho_k \leq 0\) or \(\rho_k \ll 1\) it means that \(m_k\) was such a poor representation of \(f\) that we must reject the proposed step and reduce the trust region. We use Algorithm 4.1. of [Nocedal&Wright]
The weights are chosen according to [Nocedal&Wright, More1967] as the decreasing sequence
\[ \begin{align} W^{k} =& \text{diag}( W_0^k, ..., W_{n-1}^k)\\ W_i^k = & \text{max}( || J_i||^2, W_i^{k-1}) \end{align} \]
| unsigned dg::mat::newton | ( | Gradient | grad, |
| InvHessian | invhess, | ||
| ContainerType & | x0, | ||
| double | tol = 1e-5, | ||
| unsigned | max_iter = 1000 ) |
Newton iteration.
| Gradient | a callable with signature void operator()(const ContainerType& x0, ContainerType& grad) |
| InvHessian | a callable with signature void operator()(const ContainerType& x0, const ContainerType& grad, ContainerType& p) |
| grad | The gradient of the target function |
| invhess | The inverse Hessian matrix |
| x0 | initial guess on input, solution on output |
| tol | succes condition is \( ||\nabla f(x_k)|| < \epsilon\) |
| max_iter | Maximum number of allowed iterations |
A minimization algorithm based on the recursion
\[ x_{k+1} = x_k - H^{-1}(x_k) \vec g (x_k) \]
where \(H\) is the Hessian matrix and \(\vec g(x_k) = \nabla f |_{x_k}\) is the gradient of \(f(\vec x)\). The Newton iterate is essentially a root finding routine for the gradient of \(f(\vec x)\).
This already highlights the problem of using Newton for minimization:
In essence Newton only works if the Hessian is positive definite at all search points including the initial guess. Neither of these conditions is met often in practise, i.e. what we observe often is that * the Newton method is attracted by saddle points and moves in the wrong diretion
The Newton iterate produces a direction
\[ p_k = - H_k^{-1} \vec g_k \]
The idea of a line search algorithm is to introduce a step-length \(\alpha\) into the iterates:
\[ x_{k+1} = x_k + \alpha_k p_k \]
and to find \(\alpha\) through the condition
\[ \alpha_k = \text{arg min}_{\alpha} f(\vec x_k + \alpha \vec p_k) \equiv \text{arg min}_{\alpha} \phi_k(\alpha) \]
Typically, this minimization is not solved exactly. It is solved only to the degree that the Wolfe conditions are satisfied
\[ \begin{align} \phi(\alpha) &\leq \phi(0) + c_1 \alpha \phi'(0)\\ \phi'(\alpha) &\geq c_2 \phi'(0) \end{align} \]
with \(0<c_1<c_2<1\) and in practise \(c_1 = 10^{-4}\) and \(c_2 = 0.9\). A slightly stronger version are the strong Wolfe conditions
\[ \begin{align} \phi(\alpha) &\leq \phi(0) + c_1 \alpha \phi'(0)\\ |\phi'(\alpha)| &\leq |c_2 \phi'(0)| \end{align} \]
The problem with this is that