Abstract
In this paper we propose new numerical algorithms in the setting of unconstrained optimization problems and we give proof for the rate of convergence in the iterates of the objective function.
Furthermore, our algorithms are based upon splitting and symplectic methods and they preserve the energy properties of the inherent continuous dynamical system that contains a Hessian perturbation.
At the same time, we show that Nesterov gradient method is equivalent to a Lie-Trotter splitting applied to a Hessian driven damping system.
Finally, some numerical experiments are presented in order to validate the theoretical results.
Authors
Cristian Daniel Alecsa
Tiberiu Popoviciu Institute of Numerical Analysis Romanian Academy Cluj-Napoca, Romania
Romanian Institute of Science and Technology, Cluj-Napoca, Romania
Keywords
unconstrained optimization; rate of convergence; inertial algorithms; Nesterov; convex function; Hessian; dynamical system; splitting operator; vanishing damping; Lie-Trotter.
Paper coordinates
Cristian-Daniel Alecsa, The long time behavior and the rate of convergence of symplectic convex algorithms obtained via splitting discretizations of inertial damping systems, arXiv:2001.10831
About this paper
Journal
Publisher Name
DOI
Print ISSN
Online ISSN
google scholar link
[1] C.D. Alecsa, S.C. Laszlo, T. Pinta, An extension of the second order dynamical system that models Nesterov’s convex gradient method, 2019,arxiv:1908.02574
[2] C.D. Alecsa, I. Boros, T. Pinta, New optimization algorithms for neural network training using operator splitting techniques, 2019, arxiv:1904.12952
[3] F. Alvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, J. Math. Pures Appl. 81 (2002), no. 8, pp. 747-779.
[4] H. Attouch, Z. Chbani, J. Fadili, H. Riahi, First-order optimization algorithms via inertial systems with Hessian driven damping, 2019, arxiv:1907.10536
[5] H. Attouch, A. Cabot, Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity, J. Differential Equations 263 (2017), pp. 5412-5458, doi:10.1016/j.jde.2017.06.024
[6] H. Attouch, A. Cabot, Convergence rates of inertial forward-backward algorithms, SIAM J. Optim (28) 2018, no.1, pp. 849-874, https://doi.org/10.1137/17M1114739
[7] H. Attouch, X. Goudou, P. Redont, The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative system, Commun. Contemp. Math. 2 (2000), pp. 1-34.
[8] H. Attouch, J. Peypouquet, P. Redont, Fast convex minimization via inertial dynamics with Hessian driven damping, J. Differential Equations 261 (2016), pp. 5734-5783.
[9] N. Bansal, A. Gupta, Potential-function proofs for gradient methods, Theory of Computing (2019), pp. 1-32.
[10] A. Batkai, P. Csomos, B. Farkas, G. Nickel, Operator splitting for non-autonomous evolution equations, J. Functional Analysis 260 (2011), pp. 2163-2190.
[11] A. Beck, Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB, vol. 19, SIAM, 2014.
[12] S. Blanes, F. Casas, A. Murua, Splitting and composition methods in the numerical integration of differential equations, Bol. Soc. Esp. Mat. Apl. 45 (2008), pp. 89-145.
[13] R.I. Bot, R. Csetnek, S.C. Laszlo, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applic. Analysis, 2018, doi:10.1080/00036811.2018.1495330
[14] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second-order differential equations with asymptotically small dissipation, Trans. Amer. Math. Soc. 361 (2009), pp. 5983-6017.
[15] A. Cabot, H. Engler, S. Gadat, Second order differential equations with asymptotically small dissipation and piecewise flat potentials, Electron J. Differential Equations 17 (2009), pp. 33-38.
[16] C. Castera, J. Bolte, C. Fevotte, E. Pauwels, An inertial Newton algorithm for deep learning, 2019, arXiv:1905.12278.
[17] A. Chambolle, C. Dossal, On the convergence of the iterates of FISTA, HAL Id : hal-01060130, 2014, https://hal.inria.fr/hal-01060130v3https://hal.inria.fr/hal-01060130v3.
[18] L. Chen, H. Luo, First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow, 2019, arxiv:1912.09276.
[19] A. Defazio, On the curved geometry of accelerated optimization, Adv. Neural Inf. Processing Syst. 33 (NIPS 2019), 2019.
[20] K. Feng, On difference schemes and symplectic geometry, Proceedings of the 5-th Intern. Symposium on differential geometry & differential equations, August 1984, Beijing (1985), pp. 42-58.
[21] G. Franca, D.P. Robinson, R. Vidal, Gradient flows and accelerated proximal splitting methods, 2019, arXiv:1908.00865.
[22] G. Franca, J. Sulam, D.P. Robinson, R. Vidal, Conformal symplectic and relativistic optimization, 2019, arXiv:1903.04100.
[23] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration. Structure-Preserving Algorithms for ODE’s, second edition, Springer-Verlag Berlin Heidelberg, 2006, doi:10.1007/3-540-30666-8.
[24] E. Hairer, G. Wanner, Euler methods, explicit, implicit and symplectic. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 451-455.
[25] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilinear parabolic problems, Appl. Num. Math. 62 (2012), pp. 1428-1435.
[26] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Math. of Comput. 78 (2009), no. 267, pp. 1485-1496.
[27] S.C. Laszlo, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, 2018, arXiv:1807.00387.
[28] G. Marchuk, Some applications of splitting-up methods to the solution of mathematical physics problems, Aplikace Matematiky 13 (1968), pp. 103-132.
[29] M. Muehlebach, M.I. Jordan, A dynamical systems perspective on Nesterov acceleration, 2019, arXiv:1905.07436.
[30] Y. Nesterov, A method for solving a convex programming problem with convergence rate O(1/k 2 ) (in Russian), Soviet Math. Dokl. 27 (1983), no. 2, pp. 543-547.
[31] Y. Nesterov, Introductory lectures on convex optimization : A basic course, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, 2004.
[32] N.C. Nguyen, P. Fernandez, R.M. Freund, J. Peraire, Accelerated residual methods for iterative solution of systems of equations, SIAM J. Sci. Computing 40 (2018), pp. A3157-A3179.
[33] M. Laborde, A.M. Oberman, A Lyapunov analysis for accelerated gradient methods : from deterministic to stochastic case, 2019, arXiv:1908.07861.
[34] B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys. (4) 1964, pp. 1-17.
[35] J.M. Sanz-Serna, Symplectic methods. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 1451-1458.
[36] B. Shi, S.S. Du, W.J. Su, M. I. Jordan, Acceleration via symplectic discretization of high-resolution differential equations, 2019, arXiv:1902.03694.
[37] B. Shi, S.S. Iyengar, A conservation law method based on optimization. In : Mathematical Theories of Machine Learning – Theory and Applications, Springer, Cham, 2019, pp. 63-85 https://doi.org/10.1007/978-3-030-17076-9_8.
[38] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numer. Anal. 5 (1968), no. 3, pp. 506-517.
[39] W.J. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradient method : Theory and insights, J. Machine Learning Res. (17) 2016, pp. 1-43.
[40] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic convergence analysis of heavy-ball algorithms, AAAI, 2019.
[41] H.F. Trotter, On the product of semi-groups of operators, Proc. Am. Math. Soc. 10 (1959), no. 4, pp. 545-551.
[42] P. Tseng, On accelerated proximal-gradient methods for convex-concave optimization, Manuscript, May 21 2018, mit.edu/~dimitrib/PTseng/papers/apgm.pdf.
[43] R. de Vogelaere, Methods of integration which preserve the contact transformation property of the Hamiltonian equations, Report no. 4, Dept. Math., Univ. of Notre Dame, Notre Dame, Ind., 1956.
Paper (preprint) in HTML form
The long time behavior and the rate of convergence of symplectic convex algorithms obtained via splitting discretizations of inertial damping systems.
Abstract. In this paper we propose new numerical algorithms in the setting of unconstrained optimization problems and we study the rate of convergence in the iterates of the objective function. Furthermore, our algorithms are based upon splitting and symplectic methods and they preserve the energy properties of the inherent continuous dynamical system that contains a Hessian perturbation. In particular, we show that Nesterov gradient method is equivalent to a Lie-Trotter splitting applied to a Hessian driven damping system. Finally, some numerical experiments are presented in order to validate the theoretical results.
Key Words. unconstrained optimization problems, rate of convergence, inertial algorithms, Nesterov convex gradient method, convex function, Hessian driven damping, dynamical system, Lie-Trotter splitting
1 Preliminaries
The theory of accelerated optimization algorithms initiated from the work of Polyak (see [34]) and it peaked with the convex gradient method developed by Nesterov in [30]. The theory of dynamical systems associated to Nesterov accelerated method was introduced first in [39] by Su, Boyd and Candès, where they pointed out that the method they have developed was informal. Also for the case of convex objective functions, the first true model for Nesterov discretization was considered very recently by Muehlebach and Jordan in [29] and in the generalization given in [1] by Alecsa, László and Pinţa, where under the gradient was considered a perturbation of the exact solution of the dynamical system. On the other hand, also recently, in [19] Defazio showed that the accelerated gradient method can be seen as a proximal point method in a Riemannian manifold, but the theoretical analysis was done under the assumption of strong convexity. The first contribution of the present paper is to give a rigorous alternative to these approaches and show that Nesterov method is in fact a Lie-Trotter discretization of a Hessian-driven damping dynamical system (see Theorem 2 and Subsection 2.1). Our second important contribution is to show that many well-known optimization algorithms are in fact a combination between splitting and symplectic methods and we further show that the algorithms based upon Hessian-driven damping discretizations (similar to those in [4]) have a continuous dynamical system that is also constructed with respect to the aforementioned methods (see Subsection 2.2 and Theorem 4). On the other hand, we introduce some generalized optimization algorithms associated to Hessian-driven damping dynamical systems that have many advantages from a practical point of view(see LT-S-IGAHD and Section 4), and this represents our third and last contribution of the present article.
We consider the following unconstrained optimization problem
(1) |
where is a convex and Fréchet differentiable function, for which its gradient is -Lipschitz, i.e. there exists such that for each
(2) |
For simplicity, we consider that all of the second order equations that appear in this paper are coupled with some initial conditions and , respectively (where ). This assumption is imposed only for the clarity of the paper, since (as in [4] and other similar papers), we are interested in the asymptotic behavior of a trajectory of the underlying dynamical system. On the other hand, this is also true for the algorithmic counterparts, where, given and taking arbitrary, the accelerated inertial methods start from in order to compute .
Now, we give a brief presentation regarding the discrete and the continuous analysis of optimization problems. In [34], Polyak has introduced the heavy ball with friction dynamical system, defined as
(HBFγ) |
Here is a fixed viscous damping parameter that represents the coefficient of friction of the dynamical system (HBFγ). Furthermore, the combination between the second derivative of the solution and is used in order to accelerate the gradient methods using inertial coefficients. For the (HBFγ) the objective function does not decrease along the trajectories. On the other hand, the global energy (kinetic + potential) is monotone decreasing, by being a continuous Lyapunov-type functional. Furthermore, we remind that Attouch. et. al. (in [7]) have studied the asymptotic properties of this dynamical system.
The natural discretization of (HBFγ) is Polyak’s inertial method (see [34]), i.e.
(PIMγ) |
where, in contrast with descent methods, it has an additional inertial term . If we divide this difference between two consecutive iteration values by the stepsize , then we obtain a basic approximation to the first order derivative of the solution of (HBFγ). As a side note, we mention that in [40], the authors made the first non-ergodic analysis of the rate of convergence for a variant of Polyak’s algorithm (PIMγ), that contains a non-constant stepsize, related to the inertial term, under the assumption that the objective function is coercive. Now, we shift our attention to the most popular inertial algorithm, developed by Y. Nesterov in [30]. The following update rule is often called Nesterov’s second accelerated method and it takes the following form :
(AGM2α) |
where, in practical applications, (see the work of Chambolle and Dossal [17]). A different set of update rules is given by the so-called Nesterov’s first accelerated method (briefly AGM1), which gives the best convergence rate in the objective function, and it is defined by a sequence that has the general term defined recursively as . The equivalence between (AGM1) and (AGM2α) for has been used in [9] regarding a Lyapunov analysis of Nesterov’s algorithm. Also, in [42] some unified generalizations were made using Bregman distances. In contrast with Polyak’s method (PIMγ), in [30] it was shown that Nesterov’s algorithm has an improved rate of convergence, namely . Further, we point out that, in a nonconvex setting, a two-step inertial method similar to (AGM2α) has been recently studied in [27]. In [39], Su et. al. introduced the continuous counterpart of Nesterov’s dynamical system of the form
(AVDα) |
The dynamical system with asymptotic vanishing damping was introduced in the framework of convex optimization problems. For , it was shown that each trajectory of (AVDα) satisfies the rate of convergence :
.
The relationship between Nesterov’s algorithm (AGM2α) and the dynamical system (AVDα) is based upon vanishing stepsize arguments, and so the Nesterov’s optimization scheme can not be considered a proper discretization of the associated differential equation (even the authors of [39] pointed out that their derivation of the dynamical system is informal). A general version of AVDα can be found in the seminal papers [14], [15] and in their extensions given in [5], where is replaced by a general function .
In [13], a perturbed variant of (AVDα) was analyzed in the setting of nonconvex minimization problems, using the Kurdyka-Lojasiewicz property imposed on a regularization of the objective function.
Very recently, in [29], it was shown that Nesterov convex gradient method is the natural discretization of a modified second order dynamical system. Further, a generalized version of the one from [29] was studied in [1], in which the authors proved an rate of convergence as for the trajectories of the dynamical system.
that contains a perturbation under the gradient (see [1]):
(Extended-AVD) |
On the other hand, for this continuous version of Nesterov’s algorithm some theoretical results were derived related to the regularity of the solution of the associated initial value problem.
Now, we shift our attention to generalizations of the aforementioned dynamical systems, in which the derivative of the solution is perturbed with the value of the Hessian along the trajectories. In [4], Attouch et. al. introduced the Hessian-driven damping dynamical system, i.e.
(DIN-AVD) |
Furthermore, the inertial system with constant damping that represents a generalization of the heavy ball with friction was first studied in [3]. Also see [8] for the case of . The idea behind the Hessian dynamics is the fact that the oscillations that are present both in the continuous and in the discrete case are neutralized in the presence of convex objective functions. Now, we present some particular cases that play a key role in the present paper. The Hessian system was considered in [33], where M. Laborde and A.M. Oberman pointed out that Nesterov’s discretization is in fact a forward Euler discretization for the Hessian dynamical system. In fact, they showed the method is a forward Euler discretization with a predictor value under the gradient, but not a true discretization. So, in this paper, we claim that Nesterov accelerated gradient method is indeed a proper discretization scheme of , namely a full Lie-Trotter splitting discretization composed of forward Euler methods. Attouch et. al. proposed a Nesterov-like algorithm but with two consecutive gradients that represent a discretization of the Hessian. This is linked with the term that acts like an additional dissipation term. We shall show a link between the true discretization of a modified DIN-AVD type system that is a Polyak-type method and the algorithm introduced by Attouch et. al. On the other hand, another particular case is , where in [16], the forward Euler method was constructed as an optimizer to deep learning neural networks using this dynamical system. At the same time, this discretization scheme is a forward Euler applied to a first order dynamical system that is equivalent to the second order differential equation, taking account the regularity of , i.e.
(3) |
We mention that the particular case of the vanishing damping Hessian system can also be written as a first order dynamical system. The following form is much simpler than the former first order system. This will be used frequently in our section of main results.
(4) |
Finally, we end this section by pointing out that, for , the dynamical system (4) reduces to system that is equivalent to the second order evolution equation (AVDα) with . For other details, we refer to [33].
1.1 Some notes on splitting operator techniques
In this subsection we present the most used splitting techniques for a general autonomous evolution equation of the form
(5) |
where , is the flow of the dynamical system. If the vector field from the right hand side can be decomposed into two vector fields, i.e. , then we attach the associated sub-problems of (5), namely
(6) |
Let be the solution of and be the solution of . Then, the Lie-Trotter splitting or sequential splitting (see [41]) after one stepsize is :
(LTS1) |
Another Lie-Trotter splitting operator is to first apply the exact solution of as an initial condition to , i.e.
(LTS2) |
Here we have employed the usual notation: for . For a discrete time , (LTS1), becomes and (LTS2) is equivalent with . A more advanced type of splitting technique is the so-called Strang or Marchuk splitting (see [28] and [38]). From a numerical point of view, this has an advandage over the Lie-Trotter splitting by the fact that is second-order accurate. Again, giving a stepsize , the Strang splitting at , is defined as
(SS1) |
Again, the Strang splitting operator can appear into a different symmetric form, namely
(SS2) |
Applying recursively, we obtain the splitting operator formula for the discrete time . Quite interestingly, if, for example, the exact flows and are not available, then one can use consistent approximations to these solutions. In this manner, from now on, and represents numerical discretization of the flows on the two sub-problems and at time , respectively. Then, for example the Lie-Trotter splitting (LTS1) can be defined as
(7) |
This is valid also for the Strang splitting operator formula (see [35]). On the other hand, for the case of multiple vector fields, when , we consider sub-problems, i.e.
(8) |
In this case, the Lie-Trotter splitting becomes (see [23])
(9) |
For example, since in the next sections we will use the case of , we observe that the formula (9) is in fact
(10) |
This is equivalent to the fact that we divide the evolution equation into two-subproblems and we use a numerical approximation to the solution of the second sub-problem. Furthermore, the first sub-problem will also be divided into another two-subproblems that can also be decomposed according to the Lie-Trotter splitting.
The case of unbounded linear operator has been studied in [26], when . Furthermore, in [25], Hansen et. al. considered the splitting for the case of semilinear evolution equations, when one of the sub-problems is not linear, i.e. . For both this cases, Lie-Trotter splitting and Strang-Marchuk splitting ca be defined with the above formulas, using the exact flows or some approximations to the exact solution on the sub-problems.
Now, we turn our attention to the case of non-autonomous evolution equations of the form
(11) |
There are some approaches for this kind of problem. In [10], Batkai et. al. have used a similar approach as Hansen, Kramer and Ostermann. They have considered the case when . In their paper, they applied the Lie-Trotter splitting, when and . For brevity, we recall the sequential splitting (or Lie-Trotter splitting) on the the interval of discretization :
(12) |
where represent the previous discretization solution that was applied to . Also, from the above numerical scheme we get that . On the other hand, taking , we obtain the full discretization of Lie-Trotter splitting, given by the formula (LTS2). In the following sections, we will employ this method, even though, Batkai et. al. have provided another formula when the evolution equation can be solved for each . The only difference is that in (12), we can evaluate exactly in the current discretization time element . Since we consider the updated value of much more appropriately for the discretization schemes, we will use the previous formula. Also, we provide a reference for some splitting techniques that can be applied to optimization problems. For this, see the work of Blanes et. al. [12]. We emphasize the fact that, recently, in [2] the authors have used theses splitting techniques in order to devise some optimizers used in neural network training. The analysis is in contrast with the present article by the fact that in [2], the optimization algorithms were developed in an heuristic manner based upon algebraic manipulations. Nevertheless, this shows the applicability of the splitting-type methods to machine learning. Finally, we point out that splitting techniques have been used recently also in [21], where the authors have employed balanced and rebalanced splitting methods in connection with the equilibrium states of some ODE’s associated to proximal algorithms. This differs from our analysis by the fact that we use only the Lie-Trotter discretization scheme.
1.2 A brief review of symplectic methods
We consider the setting of autonomous evolution equation given by the formula (5). Let and consider to be the Hamiltonian. We write the first order differential equation as a Hamiltonian dynamical system, using from (5) with in the following form (see [22]):
(HS) |
For the pioneering works on differential geometry and basic Hamiltonian systems, we refer to [20] and [43]. Furthermore, for brevity, we recall (see [23] and [24]) the symplectic Euler methods related to (HS)
(SE1) |
respectively
(SE2) |
Other symplectic-type algorithms can be constructed with combinations of the form and . Also from the work of Hairer, Lubich and Wanner [23], we recall a more advanced type of implicit-explicit symplectic integrators, namely the Stormer-Verlet schemes :
(SV1) |
and a similar symmetric method
(SV2) |
Fro brevity, we point out that from (SV1) and from (SV2) represent just a notation for intermediate steps that are used in the construction of the algorithms. Interestingly, under general assumptions, for a symplectic integrator of convergence order , one has that , where the constant does not depend on the iteration over exponentially long time intervals, namely . This leads to a very good energy behavior of symplectic methods and so these methods represent a good choice for the discretization of a conservative Hamiltonian system.
Also, it is known that if the Hamiltonian is separable, i.e. , then (HS) becomes
(13) |
Furthermore, dividing (13) into two sub-problems
(14) |
and applying the explicit forward Euler discretizations to both the sub-problems, then the symplectic Euler methods (SE1) and (SE2) lead to the Lie-Trotter full discretizations (LTS1) and (LTS2). The same analysis can be depicted for the Stormer Verlet methods (SV1) and (SV2), where these methods are equivalent to the Strang-Marchuk discretization schemes.
Recently, in [22], França et. al. devised some optimization schemes based on some Hamiltonian systems perturbed by a dissipation term , of the form
(HSDγ) |
where the separable Hamiltonian is defined such as , with related to the inertial positive coefficient . Here, the first term is the kinetic energy and is the potential energy of the system. Furthermore, the heavy ball with friction system (HBFγ) is a particular case of (HSDγ) and can be written as
(15) |
and the Hamiltonian of the problem is a Lyapunov function and satisfies along the trajectories. Furthermore, it was shown that Polyak’s method (PIMγ) is a conformal symplectic Euler method for (15), but Nesterov’s method is not. Also, Polyak’s algorithm can be shown to be equivalent to a Lie-Trotter splitting (in connection with the decomposition of HSDγ), where for the conservative subproblem one applies a symplectic Euler method and for the dissipative subproblem an explicit forward Euler for the exact solution is used (for details see [22]). Finally, we remind that for heavy ball system (HBFγ) without friction (when ), the symplectic Euler and the symplectic Stormer-Verlet algorithms can be considered as the natural discretizations of the underlying ODE (see [23] and [37]).
1.3 The organization of the paper
The outline of the present paper is the following. In Subsection 2.1, we show that Nesterov’s algorithm can be viewed as a Lie-Trotter or sequential splitting for a Hessian driven damping system. Furthermore, we also present an alternative version of writing the momentum based methods with asymptotic vanishing damping that resembles the continuous dynamical system. Also, we present some well known results for the heavy ball system and the extended dynamical system with asymptotic vanishing damping in connection to symplectic Euler integrator.
In Subsection 2.3 and Subsection 2.4, we consider some results regarding various algorithms linked to symplectic integrators for the extended AVD system. In Section 3, we prove the rate of convergence of the Hessian-based algorithms in the framework of convex , differentiable objective functions, such that the gradient satisfies the Lipschitz continuity property. Finally, in Section 4, we validate our theoretical results with some numerical experiments involving various choices concerning the parameters of our newly introduced algorithms in the framework of classical convex test functions. For conciseness, we point out a brief analysis that will be done in the following sequel.
In Subsection 2.2, we show some connections between the continuous version and the discrete version of some dynamical systems. Then dynamical system (Extended-AVD) can be seen as a generalization of (AVDα) with an extra perturbation under the gradient that is similar to the updates in Nesterov’s algorithm iterations. On the other hand, the (AVDα) dynamical system can be generalized through the Hessian driven damping evolution equation (DIN-AVD), by using a perturbation term consisting on the product of the Hessian matrix and the gradient of the underlying solution. For conciseness, we remind that the difference between Polyak’s method (PIMγ) and (AGM2α) lies on the fact that the momentum method is a splitting method for a system similar to (AVDα) but with constant damping, while Nesterov’s method is also a splitting based-discretization combined with a symplectic Euler method, but for the extended continuous system. Quite suprinsingly, our novelty was given in the second section concerning the fact that Nesterov’s method can be seen also a discretization (by using a triple splitting technique) but for the Hessian driven damping system (DIN-AVD) with and . At the same time, regarding the Hessian-based dynamical system, as in the Nesterov’s case, IGAHD algorithm (see [4]), is also a splitting-based method where on the potential forces we apply the symplectic Euler method. The key role in our analysis is the fact that the symplectic methods are playing a special role in the discretizations of the continuous dynamical systems. We emphasize the fact that we have four types of symplectic methods, namely (SE1), (SE2), (SV1) and (SV2). By a trial and error approach, these methods, in combination with splitting discretizations, applied on the Hessian driven damping system (DIN-AVD) lead to unfeasible algorithms due to the fact that the continuous evolution equation is of rather complicated nature. Also, using these techniques, one obtains algorithms containing at least with three gradient evaluations or with a product on the form that is a computational burden (even after a forward-type discretization). So, our aim Subsection 2.3 and Subsection 2.4, is to focus only on the (Extended-AVD) dynamical system. We also have observed that the alternative variant of the symplectic Euler in contrast with the one used in [29] leads to the same empirical convergence as Nesterov’s algorithm but with an additional gradient evaluation. On the other hand, Stormer-Verlet algorithms leads to three gradient evaluations, which is unfeasible. In this way, we will analyze in an intuitive manner only the Stormer-Verlet integrator (SV2) and the symplectic Euler integrator (SE1) that induces two gradient evaluations. This is similar to the two gradient scheme IGAHD. Moreover, we will introduce another dynamical system that for a Lie-Trotter discretization we will obtain the perturbed Nesterov-type algorithm introduced in [32].
2 The discretizations of the dynamical systems
2.1 Nesterov’s accelerated method is a Lie-Trotter splitting method applied to a Hessian-based dynamical system
In [33], it was shown that the system with the vanishing damping can also be written as a first order dynamical system, namely (4). For brevity, in our first result, we will show that the general vanishing damping system with Hessian perturbation can be written as a first order system. This result will be used also in this section when we will prove that Nesterov’s method is actually a Lie-Trotter splitting applied to three sub-problems of this dynamical system.
Proposition 1.
Let and let and . Further, consider the second order evolution equation, i.e.
(16) |
On the other hand, for consider the non-autonomous first order dynamical system :
(17) |
Then, the dynamical systems (16) and (17) are equivalent, in the sense that each solution of the second order dynamical system is a solution of the first order system and viceversa.
Proof .
Let be a solution for (17). We will show that also satisfies the Hessian driven damping system (16). Firstly, using the fact that
by differentiation we obtain that
Since , one obtains that
which is equivalent to (16).
On the other hand, let be a solution to the Hessian damping system (16). We will show that, if we define as in the first equation of (17), then the derivative of the function satisfies the second equation of (17). In this sense, by the definition of , we have that
By differentiation, as before, we obtain that
Using the fact that satisfies (16), it follows that
Finally, this equivalent exactly with the second line of (17).
Now, the following theorem is the main result of this subsection. In [33], the authors showed the equivalence between (AGM2α) and (NaG) for the case when . Moreover, we emphasize that their method closely resembles the forward Euler method applied to the dynamical system (17). This intuitive idea was pointed out in the authors work. On the other hand, since under the gradient we use a predictor and not the true value at the discretization time , this method can not be interpreted as a proper discretization to the first order dynamical system. Following this idea, we will show that Nesterov’s method (NaG) can be viewed as a discrete counterpart of (17) using the idea of Lie-Trotter splitting. In fact, we show that Nesterov’s accelerated gradient method is a Lie-Trotter splitting applied to a decomposition consisting of three subproblems at time .
Theorem 2.
Nesterov’s accelerated gradient method (AGM2α) is the numerical discretization of Lie-Trotter splitting applied to two non-autonomous subproblems and one autonomous subproblem at the discretization time . Furthermore, for each continuous subproblem, an explicit forward Euler method is applied to each of them.
Proof .
(I.) The first step of the proof is inspired by the work of A.M. Oberman and M. Laborde [33]. We will present an alternative form of the Nesterov’s algorithm (AGM2α) with general vanishing damping with coefficient , that, we will observe, it is very similar to the continuous counterpart (17). For this, let and consider the Nesterov’s discretization scheme (AGM2α). At the same time, consider the following alternate form of the Nesterov’s accelerated gradient method :
(NaG) |
Moreover, let and . Then, we show that the algorithms (AGM2α) and (NaG) are equivalent. Furthermore, is equivalent to and is equivalent to the choice . We begin by rewritting the discretization scheme (NaG) as follows
Multiplying the second equation by and adding it to the first one, we obtain that :
Now, using the fact that and that , it follows that
In this way, we obtain Nesterov’s algorithm (AGM2α), with the inertial coefficient . Finally, if we take the discretization time to be , where is the stepsize, then . On the other hand, the choice leads to . These two represent, in general, the classical choices for the inertial sequence .
(II.) Now, the second step is to consider splitting the first order dynamical system (17) in the following subproblems:
Using the notation from the first section, the splitting discrete solution is . We consider to be the numerical solution for the first subproblem at time . Then, applying a forward Euler discretization on and taking , one obtains that
where is the current splitting solution at time . Now, also using forward Euler discretization we find the numerical solution at time for the continuous subproblem . As before, we employ splitting the subproblem into two non-autonomous subproblems, the first one containing constant velocity and the second one preserving the value of . So, let
Then, the numerical Lie-Trotter splitting on at the discrete time is defined as , where is the forward Euler operator that approximates the flow of the subproblem . If we apply the forward Euler numerical scheme on both the new subproblems, we obtain the following :
Then, it follows that the Lie-Trotter approximation to can be written as follows
where is the approximation of the solution of the continuous problem at time and since we are employing a sequential-type splitting, it is the numerical solution at time of . In this way, putting all together, we obtain that
By taking and using the first step of the proof, we observe that the conclusion follows easily.
Finally, we give some observations regarding the Lie-Trotter splitting that was employed in the proof presented before.
Remark 3.
The numerical solution at time can be written as , where the three operators are the approximation to the flows of the three subproblems considered above. Using the ideas that we have presented, we observe that all of the approximations are made at the discrete time . Consequently, in contrast to the Strang-Marchuk splitting, we have used the stepsize and not in the numerical solutions of the subproblems. Finally, we observe that the splitting technique has the advantage that, at a fixed discrete time , the same stepsize and the same value of are used in all of the discretized subproblems.
Finally, we point out that, in the present proof, we have considered the following technique: we have taken the dynamical system (17) and then we applied the Lie-Trotter splitting with the choice . Then, we have showed the equivalence with NaG. But, we remind that in NaG we have chosen and this is valid since this choice does not correspond to a dynamical system, by the fact that it is a parameter from the discretization scheme.
2.2 The role of symplectic methods in optimization algorithms
In this subsection we will make a brief review concerning the role of the sympletic integrators in the setting of unconstrained optimization problems. Furthermore, we point out that many classical inertial algorithms that are based upon dynamical systems with vanishing or with constant damping can be interpreted as a combination between the Lie-Trotter splitting and choices of symplectic methods. Firstly, let’s consider the heavy ball with friction model (HBFγ). Also, consider its natural discretization, namely the Polyak’s algorithm given by (PIMγ). In [22], it was pointed out that the momentum method can be interpreted as a splitting discretization between conservative and dissipative vector fields. Indeed, consider the subproblems associated to (HBFγ) as follows
where the Hamiltonian is the energy of the entire dynamical system. The first subproblem represents the dissipative system and the second one, i.e. is the conservative system. Then, if one applies the explicit forward Euler method on and the symplectic Euler method (SE2) on , it follows that the Lie-Trotter discretization scheme at time , i.e. becomes
which is an alternative form of writting the momentum method (PIMγ), using the idea that, from the symplectic discretization of the second subproblem, is the discretization of the velocity of the underlying system.
Now, we turn our attention to the dynamical system that models Nesterov’s accelerated gradient method (AGM2α). This dynamical system is the true continuous counterpart of Nesterov’s algorithm, since under the gradient it contains a continuous version of the inertial term , i.e. this inertial term is a discretization of a combination between the exact flow and its first order derivative. For brevity, we recall the idea behind [29], where the authors considered Nesterov’s method as a combination of a Lie-Trotter splitting and a symplectic Euler method. For this, we consider a more generalized vanishing damping system Extended-AVD.
It is obvious that taking , one obtains the Nesterov’s algorithm (AGM2α). Furthermore, (Extended-AVD) can be written as a first order evolution equation, using the idea of velocity, as we have pointed out in the case of the heavy ball system :
(18) |
By the fact that the above system has no inherent energetic structure, in [29], the term was added and substracted, such that one can use, as in the case of the heavy ball with friction system, the dissipative and conservative parts of the system. Thus, the following subproblems were considered :
Here, the first subproblem represents the non-potential forces and the second subproblem stands for the potential forces. Moreover, as in the case of the heavy ball with friction system, applying a forward Euler method to the first subproblem and the symplectic Euler integrator (SE2) to the second one leads to the Nesterov’s method (AGM2α).
Now, we shift our focus to the Hessian-driven damping dynamical system . First of all, we will present a natural discretization of this dynamical system that resembles a Polyak-type counterpart to the IGAHD algorithm from [4]. Second of all, we show that a discretization resembling the IGAHD algorithm can be obtained, as in the case of Nesterov’s method, by a combination of a Lie-Trotter splitting method with a symplectic Euler integrator, but for a Hessian-based system endowed with a perturbation under the gradient as in (Extended-AVD). We consider the . The natural discretization this system is a Polyak-type method, where at time under the gradient appears the old iteration . We approximate by the difference of consecutive gradients as in the work of Attouch et. al. [4] and we consider the discretization of the continuous system, as follows :
By some easy computations, it follows that
Define . Then, we obtain that
(Polyak-IGAHD) |
Now, our only result of this subsection consists in showing that a similar algorithm to the IGAHD (see [4]) with a predictor under the gradient can be obtained as in the case of (Extended-AVD) by applying a sequential splitting method with a symplectic Euler integrator on the second subproblem. Furthermore, the only difference between our algorithm and IGAHD is that the latter one contains instead . The choice seems unnatural, by the fact that at the discrete time , the gradient of the flow is approximated by and not with the current gradient value. Nevertheless, by using the same technique of approximation, we can easily obtain a similar discretization scheme with the IGAHD algorithm.
Theorem 4.
Let and consider the following Hessian driven damping second order system :
Then, a Lie-Trotter splitting discretization, with a forward Euler method on the first subproblem and a symplectic integrator on the second one leads to an IGAHD-type algorithm.
Proof .
As in the case of (Extended-AVD) we consider two subproblems, one containing the so-called non-potential forces and the other one containing the potential forces :
where
and the second subproblem as
Taking and defining we obtain that
Furthermore, for the second subproblem , we apply the symplectic Euler integrator (SE2) and obtain that
Using the fact that for , the splitting solution is the discrete solution of , namely , we obtain from the first equation given above, that . By using this and the approximation involving the Hessian, i.e. at time by two consecutive gradients, it follows that
(IGAHD-type) |
We end this section by emphasizing the role of choosing the appropriate first order evolution equation for the Hessian driven damping system in the splitting methods.
Remark 5.
1.) At the second order dynamical system from Theorem 4, under the gradient at , we can approximate as . In this way, we obtain the classical IGAHD algorithm from [4].
2.) The idea behind Theorem 4 is the fact that we separate the perturbed gradient from the unperturbed gradient value. Moreover, the latter one is similar to the (Extended-AVD) since, under the gradient, the value contains a combination of both the exact flow and its first order derivative .
3.) In [36], Shi et. al. have studied a symplectic Euler method that was applied to a full Hessian driven damping system similar to DIN-AVD. The key difference between our analysis and theirs is that our algorithms are also based upon a Lie-Trotter discretization in combination with symplectic methods.
2.3 Discussion on some splitting techniques for the extended AVD dynamical system
Until now, we have used the symplectic Euler method (SE2) but, in the present subsection, we consider a Lie-Trotter splitting discretization scheme by combining a forward Euler method on the first subproblem and the symplectic Euler integrator (SE1) on the second one, on the dynamical system (Extended-AVD). In this way we consider the following two subproblems:
By using a forward Euler discretization on and setting , one obtains that
Also, applying the symplectic Euler method (SE1) on the subproblem , it follows that
Now, as before, using the discretization time and denoting , we introduce the auxiliary term . So, we get that
By multiplying the second equation by and substracting them, we obtain the perturbed velocity which differs by the one obtained in the case of (SE2) by an extra gradient, namely:
Finally, we obtain our first algorithm, i.e.
(LT-SE1) |
Now, we consider a similar approach, namely on subproblem we apply the forward Euler method, and on we consider the Stormer-Verlet algorithm (SV2). As before, for the first subproblem we have that
Furthermore, using (SV2) on the second sub-problem, it follows that
As before, using , where is the discretization time and represents the intertial momentum coefficient, it follows that
In a similar spirit with the determination of the algorithm (LT-SE1), we obtain our second algorithm, i.e.
(LT-SV2) |
We emphasize that our newly introduced algorithms (LT-SE1) and (LT-SV2) differ by a term at the gradient , in the case of the Stormer-Verlet integrator the perturbed velocity contains also a term, which in an intuitive sense, it annihilates much more properly large gradient values.
Now, we shift our attention to an algorithm that was developed by N.C. Nguyen et.al. in [32], that was used as a residual method for solving systems of equations. Using a variant related to the constant stepsize , the residual method is the following:
(ARDM) |
where the method reads as follows: accelerated residual descent method. Also, N.C. Nguyen et.al. have considered a restarting strategy that it is out of the scope of the present paper. Furthermore, their arguments into developing (ARDM) are based upon the stability theory related to finite difference approximations and are also based upon an approach similar to that of Su et.al. (see [39]). Our next purpose is to introduce a new dynamical system, such that the optimization algorithm (ARDM) can be considered as a Lie-Trotter splitting for a combination between the forward Euler and the symplectic Euler method (SE2), as in the case of (Extended-AVD) dynamical system. For this, we consider the following second order evolution equation:
(DynSys-ARDM) |
As before, we divide this evolution equation into dissipative and conservative parts, i.e. into non-potential and potential forces:
and
Using a forward Euler explict method on and taking , we obtain that
On the other hand, using the symplectic Euler method (SE2) on , we arrive at the following equalities:
Now, in contrast with the previous algorithm, at the discrete time , we consider the additional iteration . Using this notation, we get the following:
As in our splitting-based algorithms that were presented above, we multiply the second equation by and by substracting them, we get the algorithm (ARDM). Finally, we end this subsection by considered a remark that will represent a key role in our theoretical analysis.
Remark 6.
So far, we have considered three algorithms, namely (ARDM), (LT-SE1) and (LT-SV2). The difference between the first one and the symplectic methods applied to the (Extended-AVD) dynamical system is that the accelerated residual method contains a perturbation only in the additional inertial term . The sequential Euler and Stormer-Verlet algorithms for the extended dynamical system with asymptotically vanishing damping contains a perturbation at and also at the current iteration value . In our terms, if an element appears at and it contains the gradient , we say that the algorithm has a bias correction term. This is the case for the (LT-SE1) and (LT-SV2) algorithms. On the other hand, we can say that (ARDM) is bias corrected free. This difference can be seen in the definition of the inertial momentum term, where at (ARDM), contains . So, now we consider a general form of inertial type algorithms that contain two gradient evaluations and that encompasses (ARDM), (LT-SE1) and (LT-SV2):
(LT-Symplectic) |
where and . Also, we point out that is the underlying weight that appears in the Lie-Trotter discretization and is the bias correction term. Last, we observe that for (LT-SE1) we have that and . Furthermore, for (LT-SV2), these values are halved. On the other hand, (ARDM) can be obtained by setting and , for each .
2.4 Fixed points, spurious solutions and exploding gradients : Heuristic interpretation
In this section we present some intuitive explanations regarding the general algorithm of type LT-Symplectic. We consider the asymptotic behavior, the critical points and the equilibrium states related to our optimization methods. Also, regarding the phenomenons that we observe, we shall present also some remedies.
First of all, we shall start with Nesterov algorithm AGM2α. Let’s suppose that there exists such that the sequence is convergent to . Then, let . We observe that and so . This is equivalent to the fact that . At the same time, define the operator , as . From all of this it follows that if , then we have that , i.e. . So, the behavior of the Nesterov scheme is related to the fact that if the principal sequence converges, then the limit is also a minimizer of the convex function and at the same time it is a fixed point of the associated operator .
Now, we turn our attention to the IGAHD-type algorithm. Keeping in mind our notation from LT-Symplectic, let . In this case . The fact that the limit of the sequence is will be crucial in our analysis. We shall use the fact that two consecutive iterates of the gradient represent a natural approximation of the Hessian, and this is related to , since . As before, we have considered that the limit of . Again, denoting by the limit of , we find that, as in the case of AGM2α, (here an important property relies on the fact that the coefficient of the difference between consecutive gradient is bounded). So, the same conclusion is valid also for the IGAHD-type algorithm, regarding the fixed points of the operator and the minimizers of . So, if the major iteration of the algorithm converges, then the limit is also a minimizer of the convex objective function .
We have said that the property that is crucial. We will call this phenomenon by the name annihilating the exploding gradient, since if the limit was strictly positive then the gradient at would not be disappearing at all, and so we are left with a residual that, as time grows, will tend to the gradient of , which could be large enough if was not a minimizer. For instance, let , such that . As before, let and . It is trivial to show that , and so verifies the following equation:
(19) |
It is easy to see that a minimizer satisfies 19. On the other hand, if is not a minimizer then we have the so called spurious roots. This means that not all the solutions of 19 are minimizers of the objective function . As in the previous cases, by the fact that , let defined as before. A simple analysis shows that the limit of satisfies the relation 19. Also, we have that if and only if . This is the reason we also call these elements spurious fixed points, by the fact that they are fixed points of , but they can be represented as a perturbation of the limit point .
Now, let’s consider the ARDM algorithm. Using, the same notation as before, we have that , which is strictly positive. By making the same analysis we reach the following equation between and , i.e
(20) |
This comes from 19 by taking . Also, as for the case of the previous discussions regarding the modification of in IGAHD-type, we obtain the same conclusion regarding the spurious roots of ARDM.
The algorithm LT-SE1 meets the requirement that . Also, letting and , one can show that we obtain the equation
(21) |
So, in all of the three cases from above, namely 19, 20 and 21 we have that the limit point is not necessarily a minimizer of the convex function . We want also to point out that, for LT-SE1, we have that .In this case, let be defined as . In this case, by simple computations, we obtain that, for the case when is the limit of and is the limit of , we have that . So, we have that is equivalent with the fact that . Nevertheless, let us notice that means that is a minimizer. So, it is worth pointing out that the limit point minimizes if and only if is a fixed point of on the cartesian product , i.e. a fixed point in the classical sense (i.e. on the diagonal set).
Bearing all these in mind, we present a remedy for the algorithms that belong to the class LT-Symplectic, such that they preserve fixed points, do not have the property of gradient explosion and do not introduce spurious roots. For LT-Symplectic, suppose that , and that . As before, let be the limit of and be the limit of . In this case, by the fact that converges to such that the gradient is annihilated, we obtain that . This leads to the fact that , so . In this case, . In this case, is equivalent with . So, with the above conditions on the coefficients, we have a natural extension of the asymptotic behavior of Nesterov algorithm AGM2α, where the operator is extended to the cartesian product by the mean of .
We end this section, by considered another remedy but a little bit more particular, namely we modify explicitly LT-SE1, such that it has the properties presented above. In the case of LT-SE1, by our symplectic approach, we have the perturbed velocity . This is different from the symplectic approach of Nesterov’s algorithm AGM2α by the fact that, in our proposed scheme, . Regarding the Extended-AVD dynamical system, the velocity satisfies , i.e. the velocity of the continuous system approaches as time grows. We want to reflect this property also to our discretization method. So, we want to have that , with . We recall that in the construction of LT-SE1, we have used the technique of adding and subtracting in order to split the potential and non-potential forces. We shall make this trick, but for , where vanishes at . Furthermore, we consider the discretized version, i.e. .
As for LT-SE1, we consider a Lie-Trotter splitting discretization scheme by combining a forward Euler method on the first subproblem and the symplectic Euler integrator (SE1) on the second one, on the dynamical system (Extended-AVD). Let the following two subproblems:
By using a forward Euler discretization on , it follows that
Also, applying the symplectic Euler method (SE1) on the subproblem , it follows that
Using the discrete time and keeping in mind that and that , we get:
By multiplying the second equation by and substracting them, we obtain the modified velocity, i.e.
In this way, we obtain the third symplectic-type method:
(LT-SE3) |
In this case, we have that and also . Finally, we observe that we are in the case of our previous discussion regarding the remedy concerning the exploding gradients and the spurious roots.
3 Proposed algorithms and their rate of convergence
Even though we have introduced new algorithms, namely LT-Symplectic via an intuitive geometric approach by the means of symplectic methods, we will focus our attention on a generalized Hessian driven damping system and not on the extended AVD system. The discussion about our choice is postponed to Remark 11. Hence, our aim is to consider a theoretical and an empirical convergence analysis regarding Nesterov’s algorithm (AGM2α), IGAHD algorithm from [4] and our approach via symplectic integrators, i.e. algorithms of the form LT-S-IGAHD.
3.1 Modified Hessian-driven damping continuous and discrete systems
In the previous sections, we have presented the key idea behind the splitting technique applied to gradient-type dynamical systems. We have considered LT-Symplectic and LT-SE3 algorithms that have a geometric interpretation that is related to symplectic integrators. We observe that all these optimization algorithms have two coefficients in front of the gradient value at , namely and , for each . On the other hand, there are the Hessian-driven damping discretizations like IGAHD-type, which contain the difference of two consecutive gradients, i.e along with , under the gradient of the inertial element . By the fact that the effect of the Hessian is to dampen the oscillations, it is more natural to consider a Hessian-driven damping algorithm. So, in the present subsection we shall modify some of the previous dynamical systems in order to make consistent modification into our proposed algorithms. At the same time, our algorithms that we propose shadow the influence of the spurious roots and exploding gradients (explained in the Subsection 2.4), preserve the Hessian structure and are derived from a symplectic perspective as the other methods derived above. In order to do this, for we consider the following generalization of IGAHD-type, namely the Lie-Trotter-Symplectic-IGAHD, i.e.
(LT-S-IGAHD) |
where and the coefficients and are consistent with the notations from LT-Symplectic. Furthermore, for , and , we obtain as a particular case IGAHD-type. On the other hand, taking into account Subsection 2.4, it is naturally to consider such as . Also, for the case of IGAHD-type this property is easily verifiable, with as . Now, we present our result of this subsection, in which we introduce a new dynamical system such that LT-S-IGAHD is a Lie-Trotter splitting discretization.
Theorem 7.
Let and consider the following Hessian driven damping second order system :
where and are given functions. Also, for a pair of the form we have used the notation
Then, a Lie-Trotter splitting discretization, with a forward Euler method on the first subproblem and a symplectic integrator on the second one leads to LT-S-IGAHD algorithm.
Proof .
As in the case of (IGAHD-type) we consider two subproblems, one containing the so-called non-potential forces and the other one containing the potential forces :
and the second subproblem as
Taking , approximating , and , and defining we obtain that
where we have the notation for the auxiliary iteration element , i.e. . Furthermore, for the second subproblem , we apply the symplectic Euler integrator (SE2) and obtain that
Using the fact that for , the splitting solution is the discrete solution of , namely , we obtain, from the definition of , that . Using the approximation technique for as in Theorem 4, it follows that our algorithm is identical with LT-S-IGAHD.
Now, we will briefly give a chain of observations that will give insights into the construction of LT-S-IGAHD type algorithms.
Remark 8.
The same principle as derived in Theorem 7 can be applied to the DynSys-ARDM. It is easy to see that if we add a perturbation of the form , the we will obtain an additional element in the discrete version of the exact solution, i.e. . Furthermore, we will explain in detail the technical reason why the new added iteration is significant in the process of creating new splitting-type algorithms. For example, we refer to the particular case of the dynamical system from Theorem 7, for the case when . This is the system corresponding to the IGAHD dynamical system from Theorem 4, but with an additional gradient with coefficient . Even though we have two gradients in the right hand side, namely and , the first one is proportional to the one under the gradient of , i.e. . This basic observation (also valid for the term involving the Hessian, which also has as a coefficient of proportionality) leads to the crucial idea in which we can construct algorithms of the form LT-S-IGAHD. This is equivalent to the fact that if a term proportional to the gradient at the current value does not appear under the gradient of , then this term will not appear in the inertial term but will appear in , i.e. it is added to and .
Remark 9.
In the Subsection 2.4 we have pointed out that in order to make the algorithms have the natural behavior in the sense that they do not possess exploding gradients or spurious fixed points, a natural technique is to make the division into potential and non-potential forces and add and substract as in the case of LT-SE3. This method was used for the LT-SE3 algorithm by the fact that we have employed the SE1 symplectic integrator. In the case of LT-S-IGAHD (as in the case of Nesterov’s method AGM2α), the symplectic numerical scheme SE2 does not require such a process.
Remark 10.
One might argue that our Hessian-type algorithms LT-S-IGAHD have a major drawback, namely we require the computation of at , along with . This is not true since was already computed in order to find the value of the inertial term so, the algorithms require the same computational time.
Remark 11.
One might wonder why we do not analyzed theoretically LT-Symplectic type algorithms and why we have introduced an additional gradient value . The motivation is two-fold. First of all, the product is equal to and so it is natural to consider the effects of the Hessian perturbation. Also, the Hessian-type systems lead to a much clearer interpretation of numerical methods in optimization (for this see Subsection 2.1). Second of all, the previous value under the gradient leads to an intuitive discrete Lyapunov function as we will see in the main result of the next subsection. Without this term, through tedious computations, one can see that the Lyapunov function can not be naturally chosen.
Remark 12.
Recently, in [18], L. Chen and H. Luo have presented some Hessian-driven damping dynamical systems. They have incorporated the effects of a dynamically changing damping coefficient. Also, they have considered alternative proofs concerning the continuous and the discrete energy functions using the so-called strong Lyapunov property. On the other hand, their optimization methods are in connection to the first order system associated to the second order evolution equation containing the Hessian. This idea is related to the technique of A.M. Oberman and M. Laborde [33], in which Nesterov type accelerated gradient methods can be interpreted as explicit discretizations of Hessian systems, but with an extra gradient descent step. We point out that the algorithms developed in [18] come from an interpretation different to that developed by us in Subsection 2.1. Our algorithms have a beautiful geometric interpretation related to the idea of symplectic integrators that almost conserve the energy and splitting methods that separate the potential and non-potential forces. Also, they have presented that Nesterov’s method is an explicit method for a Hessian type system but with an addition gradient step. This can be seen in contrast with the present work where we actually show in a rigorous manner how AGM2α can be constructed through the lens of the Lie-Trotter splitting.
3.2 The rate of convergence of the optimization methods
In the present section, we shall consider some necessary tools for the convergence analysis of the Hessian-driven damping discretization LT-S-IGAHD. For this, we recall that an important assumption concerning the theoretical rate of convergence of optimization algorithms that the obhective function has a Lipschitz continuous gradient. An interesting consequence is the so-called descent lemma (see inequality 2.1.6 from page 56 of [31]), which is presented below.
Lemma 13 (Descent lemma).
Let be Fŕechet differentiable, such that is - Lipschitz continuous. Then, one has that
(DL) |
It is worth noticing that if the objective function is convex, then we have actually an equivalence between DL and the - Lipschitz continuity of . Furthermore, Lemma DL presents a quadratic upper bound of the underlying objective function. Now, we shall briefly recall from [4] extended variant of Lemma 13, where a modified variant of DL contains also the norm of the difference between two values of (as a side note, see also inequality 2.17 from page 57 of [31]).
Lemma 14 (Extended descent lemma).
Let be convex and Fréchet differentiable, such that is - Lipschitz continuous. Consider a parameter , such that . Then one has that
(EDL) |
In the following sequel, we present a technical lemma that will play a crucial role in our main result regarding the asymptotic convergence of the Hessian-type algorithms. So, it is the time to ’extend’ the ’Extended descent lemma (Lemma 14). This represents our first result of this section.
Lemma 15 (Three points extended lemma).
Let be a convex and Fréchet differentiable function, such that is - Lipschitz continuous. Let . For simplicity, consider the following notations: , , , and , where . Then , one has that satisfies the following inequality:
(E-EDL) | ||||
Proof .
Let . For simplicity, consider . Applying DL for and , it follows that
This leads to
So, it follows that
(22) |
Taking into account that is - Lipschitz continuous is equivalent to the fact that the conjugate function is - strongly convex and applying this property for the pair , we get that
Using the fact that , it simplifies to
Using Fenchel identity for , namely , we get
So, this is equivalent to
Again using Fenchel identity but applied on , namely , it follows that
(23) |
Taking into account inequalities 22 and 23, thus we obtain
(24) | ||||
From the theorem’s assumptions, we know that , and . From the inequality 24 and from the condition on , one obtains that
Through some simple computations, it follows
Hence,
Using the fact , the last inequality leads to
Using the notations for , , , and , the inequality E-EDL follows easily.
Corollary 16.
Before embarking into our main result of the present paper, we consider some technical lemmas that will be used in the forthcoming sequel. These results involve inequalities that are related to quadratic inequalities with nonconstant coefficients. Even though they are elementary results, for brevity we give them along with their proofs.
Lemma 17.
Let be given functions. If there exists , such that the following inequalities are satisfied: and , then satisfies .
Proof .
First of all, we consider the following computations:
This is equivalent to the fact that
Now, . Taking into account that and dividing by , it follows that . Likewise, let . As before, we observe that these terms are positive and multiplying be , we obtain that . So, wrapping things up, we need to show that , since this is equivalent to the quadratic inequality with nonconstant coefficients. Because , then it follows that . Combining this with the fact that the left hand side of the inequality is positive, the conclusion stands up.
Lemma 18.
Let be given functions. If there exists , such that the following inequalities are satisfied: , and , where: and , then satisfies .
Proof .
As before, we must show that , so we must analyze two cases:
-
1)
, that is equivalent to . Thus, we have that . But, it is easy to see that , i.e. . So, it follows that . Squaring the positive terms, it follows that .
-
2)
, i.e. , which is equivalent to . But, since , it follows that , so . Hence, using that and squaring the positive terms, we get the conclusion that .
Until our main result, we consider some observations regarding the functions that appear in Lemma 17 and Lemma 18.
Remark 19.
Related to Lemma 17 and Lemma 18, we have the following:
-
•
For a given pair of functions , if the element exists, then the quadratic inequality takes place. But this does not mean that such a point exists for all pair of functions .
-
•
It is easy to observe that, in Lemma 17, the conditions and implies that . Furthermore, we do not have a positivity-type inequality on the function .
-
•
In Lemma 18, we observe that and if , then it implies that is trivially satisfied. As before, we do not need the positivity of .
In what follows we have our main result of this section. Given a convex objective function endowed with Lipschitz continuous gradient, we show that the rate of convergence of the algorithms given by the general discretization LT-S-IGAHD is as , under some suitable conditions over the parameters. Given (under the assumption that ), the Lyapunov analysis that we employ is based upon the discrete energy defined as:
(Discrete-Energy) | ||||
where , with . We shall also use , so it follows that . Moreover, our technique for the proof is inspired by the ones given in [4] and [6].
Theorem 20.
Let be a convex and Fréchet differentiable function whose gradient is - Lipschitz continuous. Also, suppose that . Let be generated by the algorithm LT-S-IGAHD for every and, for simplicity, denote as , which satisfies the condition . Suppose that the sequences and are convergent. Likewise, assume that the following assumptions are satisfied:
-
i)
there exists such that ;
-
ii)
.
For , the sequence defined in Discrete-Energy is non-increasing and the following converge rate is satisfied:
Proof .
( Step I ) The first step of the proof is to apply the so-called Three points extended lemma, namely Lemma 15 and then construct the difference of the discrete energy Discrete-Energy.
We apply the inequality E-EDL with the pair , and . Then, we obtain that:
This means that
(25) | ||||
As before, we will apply E-EDL, but with the choices , , and . It follows that:
Using the fact that , so , we obtain
(26) | ||||
Using the definition of , we consider
(27) |
hence . Now, we multiply 25 by and then add it to 26. One obtains that
By grouping up the terms, it follows that
We rewrite the left hand side of the inequality from above as follows:
Then, the above inequality takes the following form:
(28) | ||||
Using the fact that , for every , hence for every satisfying 27, we multiply the inequality 28 by and it follows that:
Since and using the fact that and that (since ) for every and thus for each satisfying 27, it follows
Using the definition of given in Discrete-Energy, one can easily see that
(29) | ||||
In computing the last term involving and , we shall use the following well-known equality:
(30) |
Armed with the definition of given in Discrete-Energy, we have the following estimations:
But, we know that for each , consequently for every satisfying 27, we get that
From the definition of LT-S-IGAHD, it follows that Hence, it follows that
But, from LT-S-IGAHD, we have that . This leads to
(31) |
It is easy to remark that assumption (ii) is equivalent to for every , by taking into account that and so . Now, multiplying by , it follows that . Using this relationship, 27 and 31, for , we have that
(32) |
At the same time, by using 30 and 32, we get
This means that
(33) | ||||
Combining 29 with 33, we obtain
On the other hand, we make the following estimations:
From LT-S-IGAHD, we know that and this leads to
So, it follows that
Using the fact that , one obtains
Recalling the fact that we have already pointed out that assumption (ii) is equivalent to for every , we find that
So, the difference in the discrete energy functional can be written as
(34) |
where
(35) | ||||
( Step II ) The second step of our proof consists in applying Lemma 17 and Lemma 18 to two quadratic inequalities with nonconstant coefficients and show the conditions on the index such that they are satisfied.
For simplicity, let’s denote and . Hence, we obtain that
Using Cauchy-Schwarz inequality, it follows that
(36) | ||||
The next step of the present proof is to apply Lemma 17 and to show that for large enough, . Using Lemma 17, we observe that the coefficient of is positive, namely for . Now, the only thing left for us is to show that
(37) |
This means that
Expanding the terms, it follows that
Grouping by and implies that
Finally, one has that
(38) | ||||
We rewrite 38 as
(39) |
where the coefficients are
Since , in order to use Lemma 18 and taking into account Remark 19, we only need to show that for large enough , hence . Now, this means that 39 is equivalent to the fact that . Based upon 38, we compute as follows:
Considering the above computations for and the form of from assumption (ii), we obtain, after some basic evaluations, exactly assumption (i), where it is satisfied for each .
( Step III ) In the last step of the present proof we recollect the results from above and apply Lemma 17 and Lemma 18 and actually show the asymptotic rate of converge. At the same time, we actually determine the index , such that for , the rate of convergence in the objective function is obtained.
From now on, we consider only when , where , and the index from the hypothesis of the theorem. Now, we know from the theorem’s hypothesis that and are convergent. By using assumption (ii), it follows that is also convergent. This implies that also and are convergent. Since these two sequences are convergent, they are bounded. Hence, the sequence is also bounded. So, let such that . Then, we can take (in this sense is well defined).
Consider 39, with the nonconstant coefficients , and . We apply Lemma 18 with , , and . Since , then we consider find an infinity of points that obviously satisfy , since . Also, the condition is equivalent to . Because , assumption (i) is in fact , i.e. (this stands true for ). This means that indeed we have , for each .
After these computations, now we can consider the quadratic inequality with nonconstant coefficients , with given in 36 and where . We apply Lemma 17 with , , and . So, in this case we consider with . The condition is obviously satisfied for . Furthermore, the condition is equivalent to 37. Moreover, this is in fact 39, that we have shown that is valid for . We conclude that for every with .
By 34, we find that , so , where . By the fact that and utilizing the definition Discrete-Energy, we have that . Finally, we obtain that , for and the proof is over.
Remark 21.
Take the case when the sequence is constant, with . Then, the value of given in Discrete-Energy is . This discrete value differs significantly from that given in [4], where , by the fact that we have used and not in LT-S-IGAHD (see Remark 5 and Subsection 2.2). Last of all, we point out that even though Discrete-Energy is similar to that used in [4], there are subtle differences consisting in the fact that our choice of the discrete functional is optimal, since in we got rid of . This means that every other types of discrete Lyapunov functionals that can be constructed, contain , that can’t be eliminated by the fact that they have a nonconstant term in front of them.
Remark 22.
We observe that assumption (ii) of Theorem 20 reduces to the fact that , namely for each . Furthermore, following the last step of the previous proof and considering Lemma 17 and Remark 19, the following sequence of equivalences stand true:
So, in fact we have a stronger bound for , namely , and this is valid for all .
Remark 23.
In Theorem 20 we have considered the strict inequality belonging to assumption (ii) to be valid for each . But, if we look closer at the third step of the proof from the same theorem, we observe that we could have taken , where was defined through , and . Hence, in this way, we obtain a more relaxed bound on the restriction on the index .
On the other hand, we observe that the assumptions that the sequences and are bounded, so also , can be linked to our reasoning from Subsection 2.4, where we have supposed the convergence of the sequences in our heuristic analysis.
We end this section by giving some examples of algorithms of tpye LT-S-IGAHD that satisfy the assumptions from Theorem 20 and we also show that indeed exists an index for each of them. In our first example, we consider a general class of Hessian driven optimization algorithms with asymptotic vanishing damping, that include three parameters , and that will play a crucial role in the numerical behavior, in the sense that, from a scientific computing point of view, we have a much larger freedom and flexibility into the hypertuning of the parameters.
Example 24.
We consider the choice of the coefficients as follows:
, and , where , and . Now, we validate our algorithm as follows:
With the choices of the sequences and , we obtain the following chain of equivalences:
so assumption (ii) of Theorem 20 is satisfied for each .
We must show assumption (i) of Theorem 20. For this, we have the following equivalences:
Multiplying by , we get that
(40) |
Let . So , which means that , with . We consider the following cases :
-
•
If , namely then we can take , so the inequality is satisfied.
-
•
If , i.e. , then we can take .
Since
and taking into account 40, then it is enough to show that:
Finally, if then we can consider
and if , then we can take
The second example is also based upon three coefficient , and . It is worth noticing that for the particular case when we obtain the IGAHD-type algorithm that differs from that given in [4] (see Remark 5 and Subsection 2.2).
Example 25.
We consider the following choice of the coefficients:
, and , where , and . Now, we validate our algorithm as follows:
With the choices of the sequences and , we obtain the following chain of equivalences:
We must show assumption (i) of Theorem 20. For this, we have the following equivalences:
By the fact that is positive and since , this term can’t be . So, the inequality from above is satisfied if and only if . Multiplying by , it follows that . Hence
(41) |
Since , there exists such that . We consider . From the Archimedean property, we know that for there exists such that , thus the strict inequality implies that . For and which are equivalent to and , we obtain that
Second to last, we can take and we obtain that 41 is satisfied for every . Finally, one can find in an explicit way the index . In order to do this, observe that 41 leads to a quadratic inequality, namely:
In this case, one can see that , where
Our last example consists in choosing to take negative values, but to have only positive values. Thus, is determined through assumption (ii) of Theorem 20.
Example 26.
We consider the choice of the coefficients as follows:
, and , where , and . Now, we validate our algorithm as follows:
With the choices of the sequences and , we obtain the following chain of equivalences:
so assumption (ii) of Theorem 20 is satisfied for each .
Now, regarding assumption (i), we consider the following equivalences:
Multiplying by , we get that
(42) |
Now, we analyze two cases:
1) Suppose that . Then, . We consider . This leads to the fact that . Thus, . At the same time, we have the following inequalities:
2) Suppose that . Then, . We consider . This leads to the fact that . Thus, . At the same time, we have the following inequalities:
Finally, we end this example emphasizing that we can take if and if , and in this way we obtain an index that does not depend on the Lipschitz constant .
4 Numerical experiments involving convex functions
The role of the present section is to validate our theoretical results through some computational simulations, involving some convex test functions. Furthermore, we consider the particular cases of LT-S-IGAHD given in Example 24, Example 25 and Example 26 and we study the role of the parameters for each of them. Also, our codes for the optimization algorithms are implemented in Matlab and the analysis of the convexity of the chosen functions is based upon the a fundamental theory that can be found in [11].
4.1 Algorithmic description and objective functions
We consider the algorithms of the form LT-S-IGAHD (with the stepsize ), starting from . We initialize the starting point for functions . Furthermore, even though in many simulations is considered arbitrary, for consistency we choose , namely the first iteration at is given by a step of gradient descent. Throughout the present section we count the total number of iterations starting from in order to encompass the initial value (even though our algorithms start from ). Furthermore, since for we need only the values , and , for brevity we set . Using tolerance values of low order , we employ the stopping criteria as or as in the case when the objective function admits a unique minimum point . For each of the functions that we consider we compute the Lipschitz constant in order to present the behavior of the discretizations with respect to the parameters. Furthermore, we consider the stepsize by the fact that in the optimal case when no interpretation can be given since the algorithms stop just after a few iterations. We consider the following convex functions:
1) The case of convex, but not strongly convex function (see [4]), namely , defined as
(F1) |
It is easy to observe that . Furthermore, the Hessian of is . Taking , we infer that , so the Hessian matrix . Using the second order characterization of convexity (see Theorem 7.12 from [11]), we find that the function is convex. Furthermore, computing the solution of the equation and using the fact from above that the objective function is convex, by using Theorem 7.9 of [11], we obtain that the set of stationary points is given by . By the fact that we use the norm on the gradient, we aim at evaluating the subordinate matrix norm. By the fact that consists of the largest square root of the absolute value of the matrix eigenvalues, we find that , so that the Lipschitz constant of the gradient is also . On the other hand, since the set contains an infinity of elements, we use the stopping criteria , for a given tolerance sufficiently small. Moreover, it is also natural to adapt the Discrete-Energy function for the case when we do not actually now to what minimizer our algorithm converges. For this, we consider
(43) | |||
where is determined by the stopping condition, i.e. when we reach the tolerance factor , we consider the last iteration value by the fact that for small enough, is close to .
2) The hybrid norm function , defined as
(F2) |
The gradient of the objective function has the form . On the other hand, the Hessian matrix associated to is given by .
Taking
we obtain that , so the Hessian satisfies . As for the case of F1, we use Theorem 7.12 from [11] and we reach the convexity of F2. Furthermore, computing the solution of the equation and using the fact from above that the objective function is convex, we obtain that the set of stationary points is given by (this follows from Theorem 7.9 of [11], namely the necessity and sufficiency of stationary points). By the fact that we use the norm on the gradient, we aim at evaluating the subordinate matrix norm. It is easy to observe that . Since , by elementary algebra computations, we have that , so the tightest Lipschitz constant of the gradient is . At the same time, since the set contains just one global minimum point, we use the stopping criteria , where is the unique element of . Also, we consider the Discrete-Energy for the Lyapunov functional.

the set of minimum points

the set of minimum points
Following the proof of Theorem 20, we consider and given by , and as in the proof of the aforementioned theorem. Now we turn our attention to some choices of the parameters that appear in the three examples from the previous section, concerning the index . However, we specify that even though , and depend on the Lipschitz constant of the gradient (this is also valid for the case of Nesterov’s method AGM2α), the empirical convergence is in fact determined by the index that contain some predetermined values of the inherent parameters.
4.2 Some discussions over the algorithms parameters
In this subsection we consider some situations over the parameters referring to Example 24, Example 25 and Example 26. For each of them, we analyze in an empiric manner the key role of the underlying parameters over the convergence of the optimization algorithms of type LT-S-IGAHD.
Likewise, for two parameters and , we use the heuristic notations ”” for the situation when is much smaller than and ”” for the case is much larger than , respectively. Therefore, for our analysis, we consider a heuristic analysis for our algorithms.
The case involving Example 24) We aim at considering two subcases:
A.) When . This is, in fact, the practical case when the parameters have small value. Because, in general, is relatively small, this allow us to take . For this, we consider the following situations:
-
A1.)
When . Now, if ( has a big value), then we can take , so . On the other hand, when , then we can take , so , i.e. , but annihilates the Lipschitz constant in the index value .
-
A2.)
When . If , then , so we may take . This is in fact the same situation as before when can take a big value, because the difference under the square root, namely , can be close to , in the sense that for , , i.e. . On the other hand, for the case when and , we find that , hence , if we can take large enough such that .
B.) When . By the fact that in many situations , we can take or . For this, we consider the following situations:
-
B1.)
When . If , then . We can take , so , which can be eventually large. On the other hand, if and , we can take , so , i.e. .
-
B2.)
When . If , thus , then we are in a similar situations as presented above, since we can take close to such that the difference in the square root in the definition of is close to . On the other hand, if and , then we can take , thus , which is equivalent to .
The case involving Example 25) For this example, we consider the only valid situation when :
Since the stepsize has small values in many practical applications, we must consider or . For small values of , we observe that . Now, we consider the following subcases:
C1.) When and is small or has moderate values, we have that
C2.) If we are in one of the cases when or , then, by the analysis made above, we obtain that , since or .
C3.) If and dominates the other parameters, then it is easy to see that , i.e. . Furthermore, in order to have , then we must take .
The case involving Example 26) This is similar with the first example involving the non-negative coefficients , and . For brevity, we consider the following cases for our heuristic interpretation:
D.) When . Since many times , we have that . Here, we consider the following subcases:
-
D1.)
When . If and , i.e. small and has large values, we find that . On the other hand, if and , we obtain that .
-
D2.)
When . If , then and, as before, . On the other hand, if and , it follows that , i.e. the index is small.
E.) When , then . Likewise, in practical situations we are obliged to take . We consider the following subcases:
-
E1.)
When . In order to obtain the optimal case when , we can take , more precisely . It is trivial to see that since is large, then is also large.
-
E2.)
When . As in the previous case, we consider the optimal case when , so we can take easily take . In this situation , so .
4.3 A computational perspective over the optimization algorithms
In the present subsection, we consider some numerical experiments over the algorithms of type LT-S-IGAHD, taking into account the theoretical rate of convergence in the convex objective function obtained in the proof of Theorem 20. At the same time, we precisely consider the cases and the subcases presented in Subsection 4.2.
In Figure 2(a), we have considered the convex objective function F2 along with the initial condition
Also, we have set and the tolerance . We have the behavior of the discretization schemes along the trajectories when the stepsize is taken as in order to simulate the underyling dynamical systems. In the Subfigure 2(a), we have considered the IGAHD algorithm and (black color) and the case when and (blue color). Furthermore, we have considered also the LT-S-IGAHD from Example 25, by the fact that when , then the algorithms becomes IGAHD-type. We have chosen the following parameters : , , (red color), , , (saddle brown color, in RGB), , , (yellow color) and , , (rustic red color, in RGB). With this choices of coefficients, we observe that our algorithms have trajectories very close to the ones of IGAHD. This is valided by the theory where when is taken small, then the only difference between the algorithms gave in Example 25 and IGAHD is between and . On the other hand, in Subfigure 2(b), we have considered the algorithm from Example 26, with the following choices of the algorithm’s coefficients: , , (red color), , , (blue color) and finally, , , (black color). In this situation we observe the phenomenon of the high amplification of the coefficients, namely when the parameters have large values, the trajectories have a high acceleration toward the critical point. From this subfigure and also from our preliminary numerical simulations, we have observed that there are some optimal choices of the parameters that lead to a faster convergence, that is given by a low number of iterations for a given tolerance.


Now, we focus on the discrete velocity , where . This is considered in Figure 3. We have taken , and
For the algorithm LT-S-IGAHD from Example 25, we have set the values for the parameters as follows: , and . In Subfigre 3(a) we have plotted the first component of velocity in for , while in Subfigure 3(b) we have taken a smaller stepsize . We observe that, in the beginning, the velocity presents high oscillations and then it stabilizes through the value , and this is related to the fact that the velocity of the underlying evolution equation has the same behavior. The same stands true also for the second component of velocity and for other choices of the algorithm’s parameters. Moreover, we observe that the stabilization process acts much faster than the convergence of the numerical scheme, when the tolerance is of low value.
Second to last, we turn our attention to Table 1, Table 2, Table 3 and Table 4. Here, we have validated all of the cases from Subsection 4.2 involving both convex functions F1 and F2, respectively. First thing to note is that error represents the difference in two consecutive values in the objective function that is less than the given tolerance and along with this, it satisfies and additional stopping criteria, namely , where . We observe that our heuristic analysis stands very close to the simulations that we have made. For this, observe the values of given for each of the cases. An important difference is in the case where the value has an error by the fact that the coefficient is not much larger than . Furthermore, another difference is for the case when we do not have taken and this shows the fact that the heuristic interpretation that we gave leads to instability in the coefficients, in the sense that the values must take extreme values (see also the situation ). This is linked to the fact that in Subsection 4.2 we have considered an asymptotic analysis when some coefficients have very small values while others have very large values. The cases , and justifies the difference between simulations and the heuristic analysis. On the other hand, for all of the other cases presented in the four tables, is very close to the actual theoretical value. Likewise, it is interesting to note that the empirical values and have an empiric relation, namely in the case of moderate or of high value coefficients, can be much greater than , so in this sense, in the majority of cases, the index has a large influence over the convergence of the algorithms of type LT-S-IGAHD. Nevertheless, there are some cases when has a larger influence over the rate of convergence, i.e. when the coefficients have extremely have numerical values, then increases (with respect to the values presented above in the other cases) and decreases. Even though is larger than , the difference is lower than in the previous cases. Finally, we point out that in our numerical simulations, the coefficient is strictly positive, when . This remains valid for all of the cases and for all of the three examples that we gave and it is consistent with the arguments from the third step of the proof of Theorem 20.


for the Example 24 and for the function F1
1.22e-11 | 1e-10 | 1e-02 | 4.0 | 10.0 | 3.91 | -3.56 | 3.91 | |
3.31e-11 | 1e-10 | 1e-02 | 1e-02 | 10.0 | 3.99 | 0.43 | 3.99 | |
2.82e-11 | 1e-10 | 1e-02 | 10.0 | 4.0 | 3.80 | -1.00 | 3.80 | |
1.50e-12 | 1e-10 | 1e-05 | 3.0 | 1e-01 | 3.92 | 2.17 | 3.92 | |
0.0 | 1e-10 | 1.0 | 100.0 | 102.0 | 3.74 | 11.63 | 11.63 | |
0.0 | 1e-10 | 2.0 | 5e-01 | 6.0 | 3.48 | 422.45 | 422.45 | |
0.0 | 1e-10 | 1.5 | 2.0 | 1.75 | 3.58 | 240.29 | 240.29 | |
0.0 | 1e-10 | 1.5 | 225 | 1e-02 | 8.52 | 17.29 | 17.29 |
for the Example 24 and for the function F2
7.03e-11 | 1e-10 | 1e-02 | 4.0 | 10.0 | 3.38 | -3.91 | 3.38 | |
1.92e-12 | 1e-10 | 1e-02 | 1e-02 | 10.0 | 3.38 | 0.08 | 3.38 | |
3.47e-11 | 1e-10 | 1e-02 | 10.0 | 4.0 | 3.37 | -1.0 | 3.37 | |
7.53e-11 | 1e-10 | 1e-05 | 3.0 | 1e-01 | 3.38 | 2.17 | 3.38 | |
0.0 | 1e-10 | 1 | 100 | 102 | 3.50 | 4.04 | 4.04 | |
0.0 | 1e-10 | 2.0 | 5e-01 | 6.0 | 3.43 | 407.54 | 407.54 | |
0.0 | 1e-10 | 1.5 | 2.0 | 1.75 | 3.5 | 229.04 | 229.04 | |
0.0 | 1e-10 | 1.5 | 225 | 1e-02 | 4.46 | 15.50 | 15.50 |
for the Example 26 and for the function F1
3.37e-11 | 1e-10 | 0.0 | 0.25 | 3.5 | 3.18 | 0.83 | 3.18 | |
9.79e-11 | 1e-10 | 1e-3 | 1.25 | 5.5 | 3.18 | -0.17 | 3.18 | |
3.28e-11 | 1e-10 | 1e-3 | 5.5 | 1.25 | 3.19 | -0.17 | 3.19 | |
1.11e-11 | 1e-10 | 1e-3 | 3.5 | 0.25 | 3.19 | 0.83 | 3.19 | |
0.0 | 1e-10 | 2.0 | 21.0 | 24.0 | 5.82 | -0.97 | 5.82 | |
0.0 | 1e-10 | 2.0 | 24.0 | 21.0 | 6.09 | -0.97 | 6.09 |
for the Example 26 and for the function F2
4.21e-12 | 1e-10 | 0.0 | 0.25 | 3.5 | 3.23 | 0.76 | 3.23 | |
9.18e-11 | 1e-10 | 1e-03 | 1.25 | 5.5 | 3.23 | -0.24 | 3.23 | |
5.68e-11 | 1e-10 | 1e-03 | 5.5 | 1.25 | 3.23 | -0.24 | 3.23 | |
7.18e-11 | 1e-10 | 1e-03 | 3.5 | 0.25 | 3.23 | 0.76 | 3.23 | |
0.0 | 1e-10 | 2.0 | 21.0 | 24.0 | 4.14 | -0.97 | 4.14 | |
0.0 | 1e-10 | 2.0 | 24.0 | 21.0 | 4.24 | -0.97 | 4.24 |
Conclusions and Discussions
The novelty of the present paper consists in the following both theoretical and practical aspects regarding Hessian driven damping dynamical systems and also their numerical methods. We conclude the link between the discrete case and the continuous counterpart of the optimization algorithms, by pointing out our major contributions and some discussions that arise naturally from our research:
-
1)
In Subsection 2.1, we have shown that Nesterov’s algorithm AGM2α is in fact a natural discretization of a Hessian-type dynamical system. Even though other authors (see [18] and [33]) have pointed out a link between continuous Hessian dynamical systems and Nesterov algorithm, we have introduced a rigorous perspective such that Nesterov method can be interpreted as a Lie-Trotter discretization of the underlying second evolution equation. This approach is totally new with respect to inertial algorithms, in the following sense: only recently (see [1] and [29]) it was shown that Nesterov’s method is a natural discretization of the Extended-AVD dynamical system. Before that, no other constructions were available such that Nesterov algorithm could be interpreted as an explicit numerical scheme of a continuous dynamical system without an extra-gradient step, so our method represents a totally new approach to optimization. Furthermore, the use of the splitting technique is somewhat new to optimization problems, so that this article strengthens the relationship between distinct mathematical fields.
-
2)
We have presented a geometrical-type perspective upon optimization methods in the setting of minimization problems that involve convex objective functions (see Subsection 2.3). Likewise, we gave intuitive explanations over the choice of our algorithms in Subsection 2.4. At the same time, in Subsection 3.2 we have constructed Lemma 15 that represents a generalization of Lemma 14 that was given in [4]. Also, our proof of Theorem 20 is based upon the proof from the same paper, namely [4], but in a much more general setting.
-
3)
We have introduced new algorithms that contain the discrete version of the Hessian system, namely LT-S-IGAHD. Also, the particular case given in Example 25 resembles the newly introduced algorithm IGAHD in [4]. Also, in Example 24 and Example 26 we have gave general algorithms that contain three arbitrary parameters (under some suitable conditions) that have numerical benefits in practical applications (see Subsections 4.2 and 4.3). On the other hand, we emphasize that our algorithms have been developed in an elegant manner based upon the combination of splitting methods with symplectic algorithms, similar to the simpler case of Nesterov inertial algorithm (see [29]).
-
4)
We have developed some new optimization algorithms based upon two gradient evaluations at each iteration. For each of these numerical schemes we have presented the associated continuous versions, such that these numerical schemes are constructed through various geometric principles like splitting and symplectic methods. The associated continuous dynamical systems DynSys-ARDM and the system from Theorem 7 are worth studying by the fact that their discretizations are obtained in a natural manner, as in the case with Nesterov’s method (see [1]and [29]).
Acknowledgments
I am gratefully indebted to Szilard Csaba László for his suggestions concerning the organization of this article and for the helpful comments regarding the dynamical systems and the optimization algorithms of the present research paper.
References
- [1] C.D. Alecsa, S.C. László, T. Pinţa, An extension of the second order dynamical system that models Nesterov’s convex gradient method, 2019, arXiv:1908.02574.
- [2] C.D. Alecsa, I. Boros, T. Pinţa, New optimization algorithms for neural network training using operator splitting techniques, 2019, arXiv:1904.12952.
- [3] F. Álvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, J. Math. Pures Appl. 81 (2002), no. 8, pp. 747-779.
- [4] H. Attouch, Z. Chbani, J. Fadili, H. Riahi, First-order optimization algorithms via inertial systems with Hessian driven damping, 2019, arXiv:1907.10536.
- [5] H. Attouch, A. Cabot, Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity, J. Differential Equations 263 (2017), pp. 5412-5458, doi:10.1016/j.jde.2017.06.024.
- [6] H. Attouch, A. Cabot, Convergence rates of inertial forward-backward algorithms, SIAM J. Optim (28) 2018, no.1, pp. 849-874, https://doi.org/10.1137/17M1114739.
- [7] H. Attouch, X. Goudou, P. Redont, The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative system, Commun. Contemp. Math. 2 (2000), pp. 1-34.
- [8] H. Attouch, J. Peypouquet, P. Redont, Fast convex minimization via inertial dynamics with Hessian driven damping, J. Differential Equations 261 (2016), pp. 5734-5783.
- [9] N. Bansal, A. Gupta, Potential-function proofs for gradient methods, Theory of Computing (2019), pp. 1-32.
- [10] A. Bátkai, P. Csomós, B. Farkas, G. Nickel, Operator splitting for non-autonomous evolution equations, J. Functional Analysis 260 (2011), pp. 2163-2190.
- [11] A. Beck, Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB, vol. 19, SIAM, 2014.
- [12] S. Blanes, F. Casas, A. Murua, Splitting and composition methods in the numerical integration of differential equations, Bol. Soc. Esp. Mat. Apl. 45 (2008), pp. 89-145.
- [13] R.I. Bot, R. Csetnek, S.C. László, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applic. Analysis, 2018, doi:10.1080/00036811.2018.1495330.
- [14] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second-order differential equations with asymptotically small dissipation, Trans. Amer. Math. Soc. 361 (2009), pp. 5983-6017.
- [15] A. Cabot, H. Engler, S. Gadat, Second order differential equations with asymptotically small dissipation and piecewise flat potentials, Electron J. Differential Equations 17 (2009), pp. 33-38.
- [16] C. Castera, J. Bolte, C. Févotte, E. Pauwels, An inertial Newton algorithm for deep learning, 2019, arXiv:1905.12278.
- [17] A. Chambolle, C. Dossal, On the convergence of the iterates of FISTA, HAL Id : hal-01060130, 2014, https://hal.inria.fr/hal-01060130v3.
- [18] L. Chen, H. Luo, First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow, 2019, arXiv:1912.09276.
- [19] A. Defazio, On the curved geometry of accelerated optimization, Adv. Neural Inf. Processing Syst. 33 (NIPS 2019), 2019.
- [20] K. Feng, On difference schemes and symplectic geometry, Proceedings of the 5-th Intern. Symposium on differential geometry differential equations, August 1984, Beijing (1985), pp. 42-58.
- [21] G. França, D.P. Robinson, R. Vidal, Gradient flows and accelerated proximal splitting methods, 2019, arXiv:1908.00865.
- [22] G. França, J. Sulam, D.P. Robinson, R. Vidal, Conformal symplectic and relativistic optimization, 2019, arXiv:1903.04100.
- [23] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration. Structure-Preserving Algorithms for ODE’s, second edition, Springer-Verlag Berlin Heidelberg, 2006, doi:10.1007/3-540-30666-8.
- [24] E. Hairer, G. Wanner, Euler methods, explicit, implicit and symplectic. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 451-455.
- [25] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilinear parabolic problems, Appl. Num. Math. 62 (2012), pp. 1428-1435.
- [26] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Math. of Comput. 78 (2009), no. 267, pp. 1485-1496.
- [27] S.C. László, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, 2018, arXiv:1807.00387.
- [28] G. Marchuk, Some applications of splitting-up methods to the solution of mathematical physics problems, Aplikace Matematiky 13 (1968), pp. 103-132.
- [29] M. Muehlebach, M.I. Jordan, A dynamical systems perspective on Nesterov acceleration, 2019, arXiv:1905.07436.
- [30] Y. Nesterov, A method for solving a convex programming problem with convergence rate (in Russian), Soviet Math. Dokl. 27 (1983), no. 2, pp. 543-547.
- [31] Y. Nesterov, Introductory lectures on convex optimization : A basic course, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, 2004.
- [32] N.C. Nguyen, P. Fernandez, R.M. Freund, J. Peraire, Accelerated residual methods for iterative solution of systems of equations, SIAM J. Sci. Computing 40 (2018), pp. A3157-A3179.
- [33] M. Laborde, A.M. Oberman, A Lyapunov analysis for accelerated gradient methods : from deterministic to stochastic case, 2019, arXiv:1908.07861.
- [34] B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys. (4) 1964, pp. 1-17.
- [35] J.M. Sanz-Serna, Symplectic methods. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 1451-1458.
- [36] B. Shi, S.S. Du, W.J. Su, M. I. Jordan, Acceleration via symplectic discretization of high-resolution differential equations, 2019, arXiv:1902.03694.
- [37] B. Shi, S.S. Iyengar, A conservation law method based on optimization. In : Mathematical Theories of Machine Learning - Theory and Applications, Springer, Cham, 2019, pp. 63-85 https://doi.org/10.1007/978-3-030-17076-9_8.
- [38] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numer. Anal. 5 (1968), no. 3, pp. 506-517.
- [39] W.J. Su, S. Boyd, E.J. Candès, A differential equation for modeling Nesterov’s accelerated gradient method : Theory and insights, J. Machine Learning Res. (17) 2016, pp. 1-43.
- [40] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic convergence analysis of heavy-ball algorithms, AAAI, 2019.
- [41] H.F. Trotter, On the product of semi-groups of operators, Proc. Am. Math. Soc. 10 (1959), no. 4, pp. 545-551.
- [42] P. Tseng, On accelerated proximal-gradient methods for convex-concave optimization, Manuscript, May 21 2018, mit.edu/~dimitrib/PTseng/papers/apgm.pdf.
- [43] R. de Vogelaere, Methods of integration which preserve the contact transformation property of the Hamiltonian equations, Report no. 4, Dept. Math., Univ. of Notre Dame, Notre Dame, Ind., 1956.