The long time behavior and the rate of convergence of symplectic convex algorithms obtained via splitting discretizations of inertial damping systems

Abstract

In this paper we propose new numerical algorithms in the setting of unconstrained optimization problems and we give proof for the rate of convergence in the iterates of the objective function.

Furthermore, our algorithms are based upon splitting and symplectic methods and they preserve the energy properties of the inherent continuous dynamical system that contains a Hessian perturbation.

At the same time, we show that Nesterov gradient method is equivalent to a Lie-Trotter splitting applied to a Hessian driven damping system.

Finally, some numerical experiments are presented in order to validate the theoretical results.

Authors

Cristian Daniel Alecsa
Tiberiu Popoviciu Institute of Numerical Analysis Romanian Academy Cluj-Napoca, Romania
Romanian Institute of Science and Technology, Cluj-Napoca, Romania

Keywords

unconstrained optimization; rate of convergence; inertial algorithms; Nesterov; convex function; Hessian; dynamical system; splitting operator; vanishing damping; Lie-Trotter.

Paper coordinates

Cristian-Daniel Alecsa, The long time behavior and the rate of convergence of symplectic convex algorithms obtained via splitting discretizations of inertial damping systems, arXiv:2001.10831

PDF

About this paper

Journal
Publisher Name
DOI
Print ISSN
Online ISSN

google scholar link

[1] C.D. Alecsa, S.C. Laszlo, T. Pinta, An extension of the second order dynamical system that models Nesterov’s convex gradient method, 2019,arxiv:1908.02574

[2] C.D. Alecsa, I. Boros, T. Pinta, New optimization algorithms for neural network training using operator splitting techniques, 2019, arxiv:1904.12952

[3] F. Alvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, J. Math. Pures Appl. 81 (2002), no. 8, pp. 747-779.

[4] H. Attouch, Z. Chbani, J. Fadili, H. Riahi, First-order optimization algorithms via inertial systems with Hessian driven damping, 2019, arxiv:1907.10536

[5] H. Attouch, A. Cabot, Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity, J. Differential Equations 263 (2017), pp. 5412-5458, doi:10.1016/j.jde.2017.06.024

[6] H. Attouch, A. Cabot, Convergence rates of inertial forward-backward algorithms, SIAM J. Optim (28) 2018, no.1, pp. 849-874, https://doi.org/10.1137/17M1114739

[7] H. Attouch, X. Goudou, P. Redont, The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative system, Commun. Contemp. Math. 2 (2000), pp. 1-34.

[8] H. Attouch, J. Peypouquet, P. Redont, Fast convex minimization via inertial dynamics with Hessian driven damping, J. Differential Equations 261 (2016), pp. 5734-5783.

[9] N. Bansal, A. Gupta, Potential-function proofs for gradient methods, Theory of Computing (2019), pp. 1-32.

[10] A. Batkai, P. Csomos, B. Farkas, G. Nickel, Operator splitting for non-autonomous evolution equations, J. Functional Analysis 260 (2011), pp. 2163-2190.

[11] A. Beck, Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB, vol. 19, SIAM, 2014.

[12] S. Blanes, F. Casas, A. Murua, Splitting and composition methods in the numerical integration of differential equations, Bol. Soc. Esp. Mat. Apl. 45 (2008), pp. 89-145.

[13] R.I. Bot, R. Csetnek, S.C. Laszlo, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applic. Analysis, 2018, doi:10.1080/00036811.2018.1495330

[14] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second-order differential equations with asymptotically small dissipation, Trans. Amer. Math. Soc. 361 (2009), pp. 5983-6017.

[15] A. Cabot, H. Engler, S. Gadat, Second order differential equations with asymptotically small dissipation and piecewise flat potentials, Electron J. Differential Equations 17 (2009), pp. 33-38.

[16] C. Castera, J. Bolte, C. Fevotte, E. Pauwels, An inertial Newton algorithm for deep learning, 2019, arXiv:1905.12278.

[17] A. Chambolle, C. Dossal, On the convergence of the iterates of FISTA, HAL Id : hal-01060130, 2014, https://hal.inria.fr/hal-01060130v3https://hal.inria.fr/hal-01060130v3.

[18] L. Chen, H. Luo, First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow, 2019, arxiv:1912.09276.

[19] A. Defazio, On the curved geometry of accelerated optimization, Adv. Neural Inf. Processing Syst. 33 (NIPS 2019), 2019.

[20] K. Feng, On difference schemes and symplectic geometry, Proceedings of the 5-th Intern. Symposium on differential geometry & differential equations, August 1984, Beijing (1985), pp. 42-58.

[21] G. Franca, D.P. Robinson, R. Vidal, Gradient flows and accelerated proximal splitting methods, 2019, arXiv:1908.00865.

[22] G. Franca, J. Sulam, D.P. Robinson, R. Vidal, Conformal symplectic and relativistic optimization, 2019, arXiv:1903.04100.

[23] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration. Structure-Preserving Algorithms for ODE’s, second edition, Springer-Verlag Berlin Heidelberg, 2006, doi:10.1007/3-540-30666-8.

[24] E. Hairer, G. Wanner, Euler methods, explicit, implicit and symplectic. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 451-455.

[25] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilinear parabolic problems, Appl. Num. Math. 62 (2012), pp. 1428-1435.

[26] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Math. of Comput. 78 (2009), no. 267, pp. 1485-1496.

[27] S.C. Laszlo, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, 2018, arXiv:1807.00387.

[28] G. Marchuk, Some applications of splitting-up methods to the solution of mathematical physics problems, Aplikace Matematiky 13 (1968), pp. 103-132.

[29] M. Muehlebach, M.I. Jordan, A dynamical systems perspective on Nesterov acceleration, 2019, arXiv:1905.07436.

[30] Y. Nesterov, A method for solving a convex programming problem with convergence rate O(1/k 2 ) (in Russian), Soviet Math. Dokl. 27 (1983), no. 2, pp. 543-547.

[31] Y. Nesterov, Introductory lectures on convex optimization : A basic course, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, 2004.

[32] N.C. Nguyen, P. Fernandez, R.M. Freund, J. Peraire, Accelerated residual methods for iterative solution of systems of equations, SIAM J. Sci. Computing 40 (2018), pp. A3157-A3179.

[33] M. Laborde, A.M. Oberman, A Lyapunov analysis for accelerated gradient methods : from deterministic to stochastic case, 2019, arXiv:1908.07861.

[34] B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys. (4) 1964, pp. 1-17.

[35] J.M. Sanz-Serna, Symplectic methods. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 1451-1458.

[36] B. Shi, S.S. Du, W.J. Su, M. I. Jordan, Acceleration via symplectic discretization of high-resolution differential equations, 2019, arXiv:1902.03694.

[37] B. Shi, S.S. Iyengar, A conservation law method based on optimization. In : Mathematical Theories of Machine Learning – Theory and Applications, Springer, Cham, 2019, pp. 63-85 https://doi.org/10.1007/978-3-030-17076-9_8.

[38] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numer. Anal. 5 (1968), no. 3, pp. 506-517.

[39] W.J. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradient method : Theory and insights, J. Machine Learning Res. (17) 2016, pp. 1-43.

[40] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic convergence analysis of heavy-ball algorithms, AAAI, 2019.

[41] H.F. Trotter, On the product of semi-groups of operators, Proc. Am. Math. Soc. 10 (1959), no. 4, pp. 545-551.

[42] P. Tseng, On accelerated proximal-gradient methods for convex-concave optimization, Manuscript, May 21 2018, mit.edu/~dimitrib/PTseng/papers/apgm.pdf.

[43] R. de Vogelaere, Methods of integration which preserve the contact transformation property of the Hamiltonian equations, Report no. 4, Dept. Math., Univ. of Notre Dame, Notre Dame, Ind., 1956.

Paper (preprint) in HTML form

The long time behavior and the rate of convergence of symplectic convex algorithms obtained via splitting discretizations of inertial damping systems.

Cristian Daniel Alecsa
Tiberiu Popoviciu Institute of Numerical Analysis
Romanian Academy
Cluj-Napoca, Romania
cristian.alecsa@ictp.acad.ro

Romanian Institute of Science and Technology
Cluj-Napoca, Romania
alecsa@rist.ro
Corresponding author. This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P1-1.1-TE-2016-0266, within PNCDI III.

Abstract. In this paper we propose new numerical algorithms in the setting of unconstrained optimization problems and we study the rate of convergence in the iterates of the objective function. Furthermore, our algorithms are based upon splitting and symplectic methods and they preserve the energy properties of the inherent continuous dynamical system that contains a Hessian perturbation. In particular, we show that Nesterov gradient method is equivalent to a Lie-Trotter splitting applied to a Hessian driven damping system. Finally, some numerical experiments are presented in order to validate the theoretical results.

Key Words. unconstrained optimization problems, rate of convergence, inertial algorithms, Nesterov convex gradient method, convex function, Hessian driven damping, dynamical system, Lie-Trotter splitting

1 Preliminaries

The theory of accelerated optimization algorithms initiated from the work of Polyak (see [34]) and it peaked with the convex gradient method developed by Nesterov in [30]. The theory of dynamical systems associated to Nesterov accelerated method was introduced first in [39] by Su, Boyd and Candès, where they pointed out that the method they have developed was informal. Also for the case of convex objective functions, the first true model for Nesterov discretization was considered very recently by Muehlebach and Jordan in [29] and in the generalization given in [1] by Alecsa, László and Pinţa, where under the gradient was considered a perturbation of the exact solution of the dynamical system. On the other hand, also recently, in [19] Defazio showed that the accelerated gradient method can be seen as a proximal point method in a Riemannian manifold, but the theoretical analysis was done under the assumption of strong convexity. The first contribution of the present paper is to give a rigorous alternative to these approaches and show that Nesterov method is in fact a Lie-Trotter discretization of a Hessian-driven damping dynamical system (see Theorem 2 and Subsection 2.1). Our second important contribution is to show that many well-known optimization algorithms are in fact a combination between splitting and symplectic methods and we further show that the algorithms based upon Hessian-driven damping discretizations (similar to those in [4]) have a continuous dynamical system that is also constructed with respect to the aforementioned methods (see Subsection 2.2 and Theorem 4). On the other hand, we introduce some generalized optimization algorithms associated to Hessian-driven damping dynamical systems that have many advantages from a practical point of view(see LT-S-IGAHD and Section 4), and this represents our third and last contribution of the present article.

We consider the following unconstrained optimization problem

minxNf(x)\displaystyle\min\limits_{x\in\mathbb{R}^{N}}f(x) (1)

where f:Nf:\mathbb{R}^{N}\to\mathbb{R} is a convex and Fréchet differentiable function, for which its gradient f:NN\nabla f:\mathbb{R}^{N}\to\mathbb{R}^{N} is LL-Lipschitz, i.e. there exists L>0L>0 such that for each x,yNx,y\in\mathbb{R}^{N}

f(x)f(y)Lxy.\displaystyle\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|\ . (2)

For simplicity, we consider that all of the second order equations that appear in this paper are coupled with some initial conditions x(t0)=x0Nx(t_{0})=x_{0}\in\mathbb{R}^{N} and x˙(t0)=v0N\dot{x}(t_{0})=v_{0}\in\mathbb{R}^{N}, respectively (where t0>0t_{0}>0). This assumption is imposed only for the clarity of the paper, since (as in [4] and other similar papers), we are interested in the asymptotic behavior of a trajectory of the underlying dynamical system. On the other hand, this is also true for the algorithmic counterparts, where, given x0Nx_{0}\in\mathbb{R}^{N} and taking x1Nx_{1}\in\mathbb{R}^{N} arbitrary, the accelerated inertial methods start from n=1n=1 in order to compute x2x_{2}.
Now, we give a brief presentation regarding the discrete and the continuous analysis of optimization problems. In [34], Polyak has introduced the heavy ball with friction dynamical system, defined as

x¨(t)+γx˙(t)=f(x(t)).\displaystyle\ddot{x}(t)+\gamma\dot{x}(t)=-\nabla f(x(t))\ . (HBFγ)

Here γ>0\gamma>0 is a fixed viscous damping parameter that represents the coefficient of friction of the dynamical system (HBFγ). Furthermore, the combination between the second derivative of the solution and γ\gamma is used in order to accelerate the gradient methods using inertial coefficients. For the (HBFγ) the objective function ff does not decrease along the trajectories. On the other hand, the global energy (kinetic + potential) (t)=12x˙(t)2+f(x(t))\mathcal{E}(t)=\frac{1}{2}\|\dot{x}(t)\|^{2}+f(x(t)) is monotone decreasing, by being a continuous Lyapunov-type functional. Furthermore, we remind that Attouch. et. al. (in [7]) have studied the asymptotic properties of this dynamical system.
The natural discretization of (HBFγ) is Polyak’s inertial method (see [34]), i.e.

{yn=xn+(1hγ)(xnxn1)xn+1=ynh2f(xn)\displaystyle\begin{cases}y_{n}=x_{n}+(1-h\gamma)(x_{n}-x_{n-1})\\ x_{n+1}=y_{n}-h^{2}\nabla f(x_{n})\end{cases} (PIMγ)

where, in contrast with descent methods, it has an additional inertial term xnxn1x_{n}-x_{n-1}. If we divide this difference between two consecutive iteration values by the stepsize hh, then we obtain a basic approximation to the first order derivative of the solution x(t)x(t) of (HBFγ). As a side note, we mention that in [40], the authors made the first non-ergodic analysis of the rate of convergence for a variant of Polyak’s algorithm (PIMγ), that contains a non-constant stepsize, related to the inertial term, under the assumption that the objective function is coercive. Now, we shift our attention to the most popular inertial algorithm, developed by Y. Nesterov in [30]. The following update rule is often called Nesterov’s second accelerated method and it takes the following form :

{yn=xn+αn(xnxn1)xn+1=ynh2f(yn)\displaystyle\begin{cases}y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})\\ x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})\end{cases} (AGM2α)

where, in practical applications, αn=nn+α\alpha_{n}=\dfrac{n}{n+\alpha} (see the work of Chambolle and Dossal [17]). A different set of update rules is given by the so-called Nesterov’s first accelerated method (briefly AGM1), which gives the best convergence rate in the objective function, and it is defined by a sequence (λn)n(\lambda_{n})_{n\in\mathbb{N}} that has the general term defined recursively as λn=(1+1+4λn12)/2\lambda_{n}=\left(1+\sqrt{1+4\lambda^{2}_{n-1}}\right)/2. The equivalence between (AGM1) and (AGM2α) for α=2\alpha=2 has been used in [9] regarding a Lyapunov analysis of Nesterov’s algorithm. Also, in [42] some unified generalizations were made using Bregman distances. In contrast with Polyak’s method (PIMγ), in [30] it was shown that Nesterov’s algorithm has an improved rate of convergence, namely O(1/n2)O\left(1/n^{2}\right). Further, we point out that, in a nonconvex setting, a two-step inertial method similar to (AGM2α) has been recently studied in [27]. In [39], Su et. al. introduced the continuous counterpart of Nesterov’s dynamical system of the form

x¨(t)+αtx˙(t)=f(x(t))\displaystyle\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)=-\nabla f(x(t)) (AVDα)

The dynamical system with asymptotic vanishing damping was introduced in the framework of convex optimization problems. For α3\alpha\geq 3, it was shown that each trajectory of (AVDα) satisfies the rate of convergence : f(x(t))minf=O(1/t2) as tf(x(t))-\min f=O\left(1/t^{2}\right)\text{ as }t\to\infty. The relationship between Nesterov’s algorithm (AGM2α) and the dynamical system (AVDα) is based upon vanishing stepsize arguments, and so the Nesterov’s optimization scheme can not be considered a proper discretization of the associated differential equation (even the authors of [39] pointed out that their derivation of the dynamical system is informal). A general version of AVDα can be found in the seminal papers [14], [15] and in their extensions given in [5], where αt\dfrac{\alpha}{t} is replaced by a general function a=a(t)a=a(t).
In [13], a perturbed variant of (AVDα) was analyzed in the setting of nonconvex minimization problems, using the Kurdyka-Lojasiewicz property imposed on a regularization of the objective function. Very recently, in [29], it was shown that Nesterov convex gradient method is the natural discretization of a modified second order dynamical system. Further, a generalized version of the one from [29] was studied in [1], in which the authors proved an O(1/t2)O\left(1/t^{2}\right) rate of convergence as tt\to\infty for the trajectories of the dynamical system. that contains a perturbation under the gradient (see [1]):

x¨(t)+αtx˙(t)=f(x(t)+(γαγ2t)x˙(t))\displaystyle\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)=-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)\dot{x}(t)\right) (Extended-AVD)

On the other hand, for this continuous version of Nesterov’s algorithm some theoretical results were derived related to the regularity of the solution of the associated initial value problem.
Now, we shift our attention to generalizations of the aforementioned dynamical systems, in which the derivative of the solution is perturbed with the value of the Hessian along the trajectories. In [4], Attouch et. al. introduced the Hessian-driven damping dynamical system, i.e.

x¨(t)+αtx˙(t)+β(t)2f(x(t))x˙(t)=b(t)f(x(t)).\displaystyle\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)+\beta(t)\nabla^{2}f(x(t))\dot{x}(t)=-b(t)\nabla f(x(t))\ . (DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)})

Furthermore, the inertial system with constant damping that represents a generalization of the heavy ball with friction was first studied in [3]. Also see [8] for the case of (DIN-AVDαt,β,1)\left(\text{DIN-AVD}_{\tfrac{\alpha}{t},\beta,1}\right). The idea behind the Hessian dynamics is the fact that the oscillations that are present both in the continuous and in the discrete case are neutralized in the presence of convex objective functions. Now, we present some particular cases that play a key role in the present paper. The Hessian system (DIN-AVD3t,0,1)\left(\text{DIN-AVD}_{\tfrac{3}{t},0,1}\right) was considered in [33], where M. Laborde and A.M. Oberman pointed out that Nesterov’s discretization is in fact a forward Euler discretization for the Hessian dynamical system. In fact, they showed the method is a forward Euler discretization with a predictor value under the gradient, but not a true discretization. So, in this paper, we claim that Nesterov accelerated gradient method is indeed a proper discretization scheme of (DIN-AVDαt,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{\alpha}{t},\beta,1+\tfrac{\beta}{t}}\right), namely a full Lie-Trotter splitting discretization composed of forward Euler methods. Attouch et. al. proposed a Nesterov-like algorithm but with two consecutive gradients that represent a discretization of the Hessian. This is linked with the term 2f(x(t))x˙(t)\nabla^{2}f(x(t))\dot{x}(t) that acts like an additional dissipation term. We shall show a link between the true discretization of a modified DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)} type system that is a Polyak-type method and the algorithm introduced by Attouch et. al. On the other hand, another particular case is (DIN-AVDα,β,1)\left(\text{DIN-AVD}_{\alpha,\beta,1}\right), where in [16], the forward Euler method was constructed as an optimizer to deep learning neural networks using this dynamical system. At the same time, this discretization scheme is a forward Euler applied to a first order dynamical system that is equivalent to the second order differential equation, taking account the regularity of ff, i.e.

{x˙(t)+βf(x(t))+(α1β)x(t)+1βv(t)=0v˙(t)+(α1β)x(t)+1βv(t)=0\displaystyle\begin{cases}\dot{x}(t)+\beta\nabla f(x(t))+\left(\alpha-\dfrac{1}{\beta}\right)x(t)+\dfrac{1}{\beta}v(t)=0\\ \dot{v}(t)+\left(\alpha-\dfrac{1}{\beta}\right)x(t)+\dfrac{1}{\beta}v(t)=0\end{cases} (3)

We mention that the particular case of the vanishing damping Hessian system (DIN-AVD3t,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{3}{t},\beta,1+\tfrac{\beta}{t}}\right) can also be written as a first order dynamical system. The following form is much simpler than the former first order system. This will be used frequently in our section of main results.

{x˙(t)=2t(v(t)x(t))βf(x(t))v˙(t)=t2f(x(t))\displaystyle\begin{cases}\dot{x}(t)=\dfrac{2}{t}\left(v\left(t\right)-x\left(t\right)\right)-\beta\nabla f(x(t))\\ \dot{v}(t)=-\dfrac{t}{2}\nabla f(x(t))\end{cases} (4)

Finally, we end this section by pointing out that, for β=0\beta=0, the dynamical system (4) reduces to system that is equivalent to the second order evolution equation (AVDα) with α=3\alpha=3. For other details, we refer to [33].

1.1 Some notes on splitting operator techniques

In this subsection we present the most used splitting techniques for a general autonomous evolution equation of the form

ddtφ(t)=F(φ(t))\displaystyle\dfrac{d}{dt}\varphi(t)=F(\varphi(t)) (5)

where φ:N\varphi:\mathbb{R}^{N}\to\mathbb{R}, φ=φ(t)\varphi=\varphi(t) is the flow of the dynamical system. If the vector field from the right hand side can be decomposed into two vector fields, i.e. F=F[1]+F[2]F=F^{[1]}+F^{[2]}, then we attach the associated sub-problems of (5), namely

(P1):ddtφ(t)=F[1](φ(t)) and (P2):ddtφ(t)=F[2](φ(t))\displaystyle(P_{1}):\dfrac{d}{dt}\varphi(t)=F^{[1]}(\varphi(t))\quad\text{ and }\quad(P_{2}):\dfrac{d}{dt}\varphi(t)=F^{[2]}(\varphi(t)) (6)

Let φ[1]=φ[1](t)\varphi^{[1]}=\varphi^{[1]}(t) be the solution of (P1)(P_{1}) and φ[2]=φ[2](t)\varphi^{[2]}=\varphi^{[2]}(t) be the solution of (P2)(P_{2}). Then, the Lie-Trotter splitting or sequential splitting (see [41]) after one stepsize hh is :

ψh=φh[1]φh[2]\displaystyle\psi_{h}=\varphi^{[1]}_{h}\circ\varphi^{[2]}_{h} (LTS1)

Another Lie-Trotter splitting operator is to first apply the exact solution of (P1)(P_{1}) as an initial condition to (P2)(P_{2}), i.e.

ψh=φh[2]φh[1]\displaystyle\psi^{\ast}_{h}=\varphi^{[2]}_{h}\circ\varphi^{[1]}_{h} (LTS2)

Here we have employed the usual notation: φt[i]:=φ[i](t)\varphi^{[i]}_{t}:=\varphi^{[i]}(t) for i{1,2}i\in\{1,2\}. For a discrete time tn=nht_{n}=nh, (LTS1), becomes ψ(tn)=(φ[1]φ[2])(tn)=(φh[1]φh[2])n(h)\psi(t_{n})=(\varphi^{[1]}\circ\varphi^{[2]})(t_{n})=\left(\varphi^{[1]}_{h}\circ\varphi^{[2]}_{h}\right)^{n}(h) and (LTS2) is equivalent with ψ(tn)=(φh[2]φh[1])n(h)\psi^{\ast}(t_{n})=\left(\varphi^{[2]}_{h}\circ\varphi^{[1]}_{h}\right)^{n}(h). A more advanced type of splitting technique is the so-called Strang or Marchuk splitting (see [28] and [38]). From a numerical point of view, this has an advandage over the Lie-Trotter splitting by the fact that is second-order accurate. Again, giving a stepsize hh, the Strang splitting at t=ht=h, is defined as

ψh=φh/2[1]φh[2]φh/2[1]\displaystyle\psi_{h}=\varphi^{[1]}_{h/2}\circ\varphi^{[2]}_{h}\circ\varphi^{[1]}_{h/2} (SS1)

Again, the Strang splitting operator can appear into a different symmetric form, namely

ψh=φh/2[2]φh[1]φh/2[2]\displaystyle\psi^{\ast}_{h}=\varphi^{[2]}_{h/2}\circ\varphi^{[1]}_{h}\circ\varphi^{[2]}_{h/2} (SS2)

Applying recursively, we obtain the splitting operator formula for the discrete time tn=nht_{n}=nh. Quite interestingly, if, for example, the exact flows φ[1](t)\varphi^{[1]}(t) and φ[2](t)\varphi^{[2]}(t) are not available, then one can use consistent approximations to these solutions. In this manner, from now on, φ~h[1]\tilde{\varphi}^{[1]}_{h} and φ~h[2]\tilde{\varphi}^{[2]}_{h} represents numerical discretization of the flows on the two sub-problems (P1)(P_{1}) and (P2)(P_{2}) at time tn=nht_{n}=nh, respectively. Then, for example the Lie-Trotter splitting (LTS1) can be defined as

ψh=φ~h[1]φ~h[2]\displaystyle\psi_{h}=\tilde{\varphi}^{[1]}_{h}\circ\tilde{\varphi}^{[2]}_{h} (7)

This is valid also for the Strang splitting operator formula (see [35]). On the other hand, for the case of multiple vector fields, when F=F[1]++F[m]F=F^{[1]}+\ldots+F^{[m]}, we consider mm sub-problems, i.e.

(P1):ddtφ(t)=F[1](φ(t))(Pm):ddtφ(t)=F[m](φ(t))\displaystyle(P_{1}):\dfrac{d}{dt}\varphi(t)=F^{[1]}(\varphi(t))\quad\ldots\ldots\quad(P_{m}):\dfrac{d}{dt}\varphi(t)=F^{[m]}(\varphi(t)) (8)

In this case, the Lie-Trotter splitting becomes (see [23])

ψh=φh[1]φh[m]\displaystyle\psi_{h}=\varphi^{[1]}_{h}\circ\ldots\circ\varphi^{[m]}_{h} (9)

For example, since in the next sections we will use the case of m=3m=3, we observe that the formula (9) is in fact

ψh=(φh[1]φh[2])φh[3]\displaystyle\psi_{h}=\left(\varphi^{[1]}_{h}\circ\varphi^{[2]}_{h}\right)\circ\varphi^{[3]}_{h} (10)

This is equivalent to the fact that we divide the evolution equation into two-subproblems and we use a numerical approximation to the solution of the second sub-problem. Furthermore, the first sub-problem will also be divided into another two-subproblems that can also be decomposed according to the Lie-Trotter splitting.
The case of unbounded linear operator has been studied in [26], when F(φ(t))=Aφ(t)+Bφ(t)F(\varphi(t))=A\varphi(t)+B\varphi(t). Furthermore, in [25], Hansen et. al. considered the splitting for the case of semilinear evolution equations, when one of the sub-problems is not linear, i.e. F(φ(t))=Aφ(t)+F[2](φ(t))F(\varphi(t))=A\varphi(t)+F^{[2]}(\varphi(t)). For both this cases, Lie-Trotter splitting and Strang-Marchuk splitting ca be defined with the above formulas, using the exact flows or some approximations to the exact solution on the sub-problems.
Now, we turn our attention to the case of non-autonomous evolution equations of the form

ddtφ(t)=F(t,φ(t))\displaystyle\dfrac{d}{dt}\varphi(t)=F(t,\varphi(t)) (11)

There are some approaches for this kind of problem. In [10], Batkai et. al. have used a similar approach as Hansen, Kramer and Ostermann. They have considered the case when F=F[1]+F[2]F=F^{[1]}+F^{[2]}. In their paper, they applied the Lie-Trotter splitting, when F[1](φ(t))=A(t)φ(t)F^{[1]}(\varphi(t))=A(t)\varphi(t) and F[2](t)=B(t)φ(t)F^{[2]}(t)=B(t)\varphi(t). For brevity, we recall the sequential splitting (or Lie-Trotter splitting) on the the interval of discretization (tn1,tn](t_{n-1},t_{n}] :

{ddtφ[1](t)=B(t)φ[1](t),t(tn1,tn] with φ[1](tn1)=φLT(tn1)ddtφ[2](t)=A(t)φ[2](t),t(tn1,tn] with φ[2](tn1)=φ[1](tn1)\displaystyle\begin{cases}\dfrac{d}{dt}\varphi^{[1]}(t)=B(t)\varphi^{[1]}(t)\quad,\quad t\in(t_{n-1},t_{n}]\quad\text{ with }\varphi^{[1]}(t_{n-1})=\varphi^{LT}(t_{n-1})\\ \dfrac{d}{dt}\varphi^{[2]}(t)=A(t)\varphi^{[2]}(t)\quad,\quad t\in(t_{n-1},t_{n}]\quad\text{ with }\varphi^{[2]}(t_{n-1})=\varphi^{[1]}(t_{n-1})\end{cases} (12)

where φLT(tn1)\varphi^{LT}(t_{n-1}) represent the previous discretization solution that was applied to (tn2,tn1](t_{n-2},t_{n-1}]. Also, from the above numerical scheme we get that φLT(tn)=φ[2](tn)\varphi^{LT}(t_{n})=\varphi^{[2]}(t_{n}). On the other hand, taking t=tn1t=t_{n-1}, we obtain the full discretization of Lie-Trotter splitting, given by the formula (LTS2). In the following sections, we will employ this method, even though, Batkai et. al. have provided another formula when the evolution equation ddtφ(t)=A(r)φ(t)+B(r)φ(t)\dfrac{d}{dt}\varphi(t)=A(r)\varphi(t)+B(r)\varphi(t) can be solved for each rr\in\mathbb{R}. The only difference is that in (12), we can evaluate BB exactly in the current discretization time element tnt_{n}. Since we consider the updated value of tnt_{n} much more appropriately for the discretization schemes, we will use the previous formula. Also, we provide a reference for some splitting techniques that can be applied to optimization problems. For this, see the work of Blanes et. al. [12]. We emphasize the fact that, recently, in [2] the authors have used theses splitting techniques in order to devise some optimizers used in neural network training. The analysis is in contrast with the present article by the fact that in [2], the optimization algorithms were developed in an heuristic manner based upon algebraic manipulations. Nevertheless, this shows the applicability of the splitting-type methods to machine learning. Finally, we point out that splitting techniques have been used recently also in [21], where the authors have employed balanced and rebalanced splitting methods in connection with the equilibrium states of some ODE’s associated to proximal algorithms. This differs from our analysis by the fact that we use only the Lie-Trotter discretization scheme.

1.2 A brief review of symplectic methods

We consider the setting of autonomous evolution equation given by the formula (5). Let φ=(x,v):NN\varphi=(x,v):\mathbb{R}^{N}\to\mathbb{R}^{N} and consider H:NH:\mathbb{R}^{N}\to\mathbb{R} to be the Hamiltonian. We write the first order differential equation as a Hamiltonian dynamical system, using FF from (5) with F(φ(t))=(vH(x(t),v(t)),xH(x(t),v(t)))TF(\varphi(t))=(\nabla_{v}H(x(t),v(t)),-\nabla_{x}H(x(t),v(t)))^{T} in the following form (see [22]):

(ddtx(t)ddtv(t))=(vH(x(t),v(t))xH(x(t),v(t)))\displaystyle\begin{pmatrix}\dfrac{d}{dt}x(t)\\ \\ \dfrac{d}{dt}v(t)\end{pmatrix}=\begin{pmatrix}\nabla_{v}H(x(t),v(t))\\ \\ -\nabla_{x}H(x(t),v(t))\end{pmatrix} (HS)

For the pioneering works on differential geometry and basic Hamiltonian systems, we refer to [20] and [43]. Furthermore, for brevity, we recall (see [23] and [24]) the symplectic Euler methods related to (HS)

{xn+1=xn+hvH(xn+1,vn)vn+1=vnhxH(xn+1,vn)\displaystyle\begin{cases}x_{n+1}=x_{n}+h\nabla_{v}H(x_{n+1},v_{n})\\ v_{n+1}=v_{n}-h\nabla_{x}H(x_{n+1},v_{n})\end{cases} (SE1)

respectively

{xn+1=xn+hvH(xn,vn+1)vn+1=vnhxH(xn,vn+1)\displaystyle\begin{cases}x_{n+1}=x_{n}+h\nabla_{v}H(x_{n},v_{n+1})\\ v_{n+1}=v_{n}-h\nabla_{x}H(x_{n},v_{n+1})\end{cases} (SE2)

Other symplectic-type algorithms can be constructed with combinations of the form (xn,vn+1)(x_{n},v_{n+1}) and (xn+1,vn)(x_{n+1},v_{n}). Also from the work of Hairer, Lubich and Wanner [23], we recall a more advanced type of implicit-explicit symplectic integrators, namely the Stormer-Verlet schemes :

{xn+1/2=xn+h2vH(xn+1/2,vn)vn+1=vnh2(xH(xn+1/2,vn)+xH(xn+1/2,vn+1))xn+1=xn+1/2+h2vH(xn+1/2,vn+1)\displaystyle\begin{cases}x_{n+1/2}=x_{n}+\dfrac{h}{2}\nabla_{v}H(x_{n+1/2},v_{n})\\ v_{n+1}=v_{n}-\dfrac{h}{2}\left(\nabla_{x}H(x_{n+1/2},v_{n})+\nabla_{x}H(x_{n+1/2},v_{n+1})\right)\\ x_{n+1}=x_{n+1/2}+\dfrac{h}{2}\nabla_{v}H(x_{n+1/2},v_{n+1})\\ \end{cases} (SV1)

and a similar symmetric method

{vn+1/2=vnh2xH(xn,vn+1/2)xn+1=xn+h2(vH(xn,vn+1/2)+vH(xn+1,vn+1/2))vn+1=vn+1/2h2xH(xn+1,vn+1/2)\displaystyle\begin{cases}v_{n+1/2}=v_{n}-\dfrac{h}{2}\nabla_{x}H(x_{n},v_{n+1/2})\\ x_{n+1}=x_{n}+\dfrac{h}{2}\left(\nabla_{v}H(x_{n},v_{n+1/2})+\nabla_{v}H(x_{n+1},v_{n+1/2})\right)\\ v_{n+1}=v_{n+1/2}-\dfrac{h}{2}\nabla_{x}H(x_{n+1},v_{n+1/2})\\ \end{cases} (SV2)

Fro brevity, we point out that xn+1/2x_{n+1/2} from (SV1) and vn+1/2v_{n+1/2} from (SV2) represent just a notation for intermediate steps that are used in the construction of the algorithms. Interestingly, under general assumptions, for a symplectic integrator of convergence order μ\mu, one has that H(xn,vn)=H(x0,v0)+O(hμ)H(x_{n},v_{n})=H(x_{0},v_{0})+O(h^{\mu}), where the constant does not depend on the iteration nn over exponentially long time intervals, namely nheh0/2hnh\leq e^{h_{0}/2h}. This leads to a very good energy behavior of symplectic methods and so these methods represent a good choice for the discretization of a conservative Hamiltonian system.
Also, it is known that if the Hamiltonian is separable, i.e. H(x,v)=T(v)+U(x)H(x,v)=T(v)+U(x), then (HS) becomes

(ddtx(t)ddtv(t))=(vT(v(t))xU(x(t)))\displaystyle\begin{pmatrix}\dfrac{d}{dt}x(t)\\ \\ \dfrac{d}{dt}v(t)\end{pmatrix}=\begin{pmatrix}\nabla_{v}T(v(t))\\ \\ -\nabla_{x}U(x(t))\end{pmatrix} (13)

Furthermore, dividing (13) into two sub-problems

(P1):{ddtx(t)=0ddtv(t)=xU(x(t)) and (P2):{ddtx(t)=vT(v(t))ddtv(t)=0\displaystyle(P_{1}):\begin{cases}\dfrac{d}{dt}x(t)=0\\ \\ \dfrac{d}{dt}v(t)=-\nabla_{x}U(x(t))\end{cases}\text{ and }\quad(P_{2}):\begin{cases}\dfrac{d}{dt}x(t)=\nabla_{v}T(v(t))\\ \\ \dfrac{d}{dt}v(t)=0\end{cases} (14)

and applying the explicit forward Euler discretizations to both the sub-problems, then the symplectic Euler methods (SE1) and (SE2) lead to the Lie-Trotter full discretizations (LTS1) and (LTS2). The same analysis can be depicted for the Stormer Verlet methods (SV1) and (SV2), where these methods are equivalent to the Strang-Marchuk discretization schemes.
Recently, in [22], França et. al. devised some optimization schemes based on some Hamiltonian systems perturbed by a dissipation term γ>0\gamma>0, of the form

(ddtx(t)ddtv(t))=(vH(x(t),v(t))xH(x(t),v(t))γv(t))\displaystyle\begin{pmatrix}\dfrac{d}{dt}x(t)\\ \\ \dfrac{d}{dt}v(t)\end{pmatrix}=\begin{pmatrix}\nabla_{v}H(x(t),v(t))\\ \\ -\nabla_{x}H(x(t),v(t))-\gamma v(t)\end{pmatrix} (HSDγ)

where the separable Hamiltonian H:NH:\mathbb{R}^{N}\to\mathbb{R} is defined such as H(x,v)=v22m+f(x)H(x,v)=\dfrac{\|v\|^{2}}{2m}+f(x), with m>0m>0 related to the inertial positive coefficient γ\gamma. Here, the first term T(v)=v22mT(v)=\dfrac{\|v\|^{2}}{2m} is the kinetic energy and U(x)=f(x)U(x)=f(x) is the potential energy of the system. Furthermore, the heavy ball with friction system (HBFγ) is a particular case of (HSDγ) and can be written as

(ddtx(t)ddtv(t))=(v(t)f(x(t))γv(t))\displaystyle\begin{pmatrix}\dfrac{d}{dt}x(t)\\ \\ \dfrac{d}{dt}v(t)\end{pmatrix}=\begin{pmatrix}v(t)\\ \\ -\nabla f(x(t))-\gamma v(t)\end{pmatrix} (15)

and the Hamiltonian of the problem is a Lyapunov function and satisfies ddtH(x(t),v(t))=γv(t)20\dfrac{d}{dt}H(x(t),v(t))=-\gamma\|v(t)\|^{2}\leq 0 along the trajectories. Furthermore, it was shown that Polyak’s method (PIMγ) is a conformal symplectic Euler method for (15), but Nesterov’s method is not. Also, Polyak’s algorithm can be shown to be equivalent to a Lie-Trotter splitting (in connection with the decomposition of HSDγ), where for the conservative subproblem one applies a symplectic Euler method and for the dissipative subproblem an explicit forward Euler for the exact solution is used (for details see [22]). Finally, we remind that for heavy ball system (HBFγ) without friction (when γ=0\gamma=0), the symplectic Euler and the symplectic Stormer-Verlet algorithms can be considered as the natural discretizations of the underlying ODE (see [23] and [37]).

1.3 The organization of the paper

The outline of the present paper is the following. In Subsection 2.1, we show that Nesterov’s algorithm can be viewed as a Lie-Trotter or sequential splitting for a Hessian driven damping system. Furthermore, we also present an alternative version of writing the momentum based methods with asymptotic vanishing damping that resembles the continuous dynamical system. Also, we present some well known results for the heavy ball system and the extended dynamical system with asymptotic vanishing damping in connection to symplectic Euler integrator. In Subsection 2.3 and Subsection 2.4, we consider some results regarding various algorithms linked to symplectic integrators for the extended AVD system. In Section 3, we prove the rate of convergence of the Hessian-based algorithms in the framework of convex , differentiable objective functions, such that the gradient satisfies the Lipschitz continuity property. Finally, in Section 4, we validate our theoretical results with some numerical experiments involving various choices concerning the parameters of our newly introduced algorithms in the framework of classical convex test functions. For conciseness, we point out a brief analysis that will be done in the following sequel.

In Subsection 2.2, we show some connections between the continuous version and the discrete version of some dynamical systems. Then dynamical system (Extended-AVD) can be seen as a generalization of (AVDα) with an extra perturbation under the gradient that is similar to the updates in Nesterov’s algorithm iterations. On the other hand, the (AVDα) dynamical system can be generalized through the Hessian driven damping evolution equation (DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)}), by using a perturbation term consisting on the product of the Hessian matrix and the gradient of the underlying solution. For conciseness, we remind that the difference between Polyak’s method (PIMγ) and (AGM2α) lies on the fact that the momentum method is a splitting method for a system similar to (AVDα) but with constant damping, while Nesterov’s method is also a splitting based-discretization combined with a symplectic Euler method, but for the extended continuous system. Quite suprinsingly, our novelty was given in the second section concerning the fact that Nesterov’s method can be seen also a discretization (by using a triple splitting technique) but for the Hessian driven damping system (DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)}) with β(t)=β\beta(t)=\beta and b(t)=1+β/tb(t)=1+\beta/t. At the same time, regarding the Hessian-based dynamical system, as in the Nesterov’s case, IGAHD algorithm (see [4]), is also a splitting-based method where on the potential forces we apply the symplectic Euler method. The key role in our analysis is the fact that the symplectic methods are playing a special role in the discretizations of the continuous dynamical systems. We emphasize the fact that we have four types of symplectic methods, namely (SE1), (SE2), (SV1) and (SV2). By a trial and error approach, these methods, in combination with splitting discretizations, applied on the Hessian driven damping system (DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)}) lead to unfeasible algorithms due to the fact that the continuous evolution equation is of rather complicated nature. Also, using these techniques, one obtains algorithms containing at least with three gradient evaluations or with a product on the form 2f(x(t))f(x(t))\nabla^{2}f(x(t))\cdot\nabla f(x(t)) that is a computational burden (even after a forward-type discretization). So, our aim Subsection 2.3 and Subsection 2.4, is to focus only on the (Extended-AVD) dynamical system. We also have observed that the alternative variant of the symplectic Euler in contrast with the one used in [29] leads to the same empirical convergence as Nesterov’s algorithm but with an additional gradient evaluation. On the other hand, Stormer-Verlet algorithms leads to three gradient evaluations, which is unfeasible. In this way, we will analyze in an intuitive manner only the Stormer-Verlet integrator (SV2) and the symplectic Euler integrator (SE1) that induces two gradient evaluations. This is similar to the two gradient scheme IGAHD. Moreover, we will introduce another dynamical system that for a Lie-Trotter discretization we will obtain the perturbed Nesterov-type algorithm introduced in [32].

2 The discretizations of the dynamical systems

2.1 Nesterov’s accelerated method is a Lie-Trotter splitting method applied to a Hessian-based dynamical system

In [33], it was shown that the system (DIN-AVD3t,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{3}{t},\beta,1+\tfrac{\beta}{t}}\right) with the vanishing damping 3/t3/t can also be written as a first order dynamical system, namely (4). For brevity, in our first result, we will show that the general vanishing damping system with Hessian perturbation (DIN-AVD3t,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{3}{t},\beta,1+\tfrac{\beta}{t}}\right) can be written as a first order system. This result will be used also in this section when we will prove that Nesterov’s method is actually a Lie-Trotter splitting applied to three sub-problems of this dynamical system.

Proposition 1.

Let fC2(n,)f\in C^{2}(\mathbb{R}^{n},\mathbb{R}) and let β\beta\in\mathbb{R} and α>1\alpha>1. Further, consider the (DIN-AVDαt,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{\alpha}{t},\beta,1+\tfrac{\beta}{t}}\right) second order evolution equation, i.e.

x¨(t)+αtx˙(t)+β2f(x(t))x˙(t)=(1+βt)f(x(t))\displaystyle\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)+\beta\nabla^{2}f(x(t))\dot{x}(t)=-\left(1+\dfrac{\beta}{t}\right)\nabla f(x(t)) (16)

On the other hand, for t>0t>0 consider the non-autonomous first order dynamical system :

{x˙(t)=α1t(v(t)x(t))βf(x(t))v˙(t)=tα1f(x(t))\displaystyle\begin{cases}\dot{x}(t)=\dfrac{\alpha-1}{t}(v(t)-x(t))-\beta\nabla f(x(t))\\ \dot{v}(t)=-\dfrac{t}{\alpha-1}\nabla f(x(t))\end{cases} (17)

Then, the dynamical systems (16) and (17) are equivalent, in the sense that each solution of the second order dynamical system is a solution of the first order system and viceversa.

Proof .

Let (x(t),v(t))n×N(x(t),v(t))\in\mathbb{R}^{n}\times\mathbb{R}^{N} be a solution for (17). We will show that x=x(t)x=x(t) also satisfies the Hessian driven damping system (16). Firstly, using the fact that

v(t)=x(t)+tα1(x˙(t)+βf(x(t))),v(t)=x(t)+\dfrac{t}{\alpha-1}\left(\dot{x}(t)+\beta\nabla f(x(t))\right),

by differentiation we obtain that

v˙(t)=x˙(t)+1α1(x˙(t)+βf(x(t)))+tα1(x¨(t)+β2f(x(t))x˙(t)).\dot{v}(t)=\dot{x}(t)+\dfrac{1}{\alpha-1}\left(\dot{x}(t)+\beta\nabla f(x(t))\right)+\dfrac{t}{\alpha-1}\left(\ddot{x}(t)+\beta\nabla^{2}f(x(t))\dot{x}(t)\right).

Since v˙(t)=tα1f(x(t))\dot{v}(t)=-\dfrac{t}{\alpha-1}\nabla f(x(t)), one obtains that

tα1f(x(t))=x˙(t)+1α1x˙(t)+βα1f(x(t))+tα1x¨(t)+βtα12f(x(t))x˙(t),-\dfrac{t}{\alpha-1}\nabla f(x(t))=\dot{x}(t)+\dfrac{1}{\alpha-1}\dot{x}(t)+\dfrac{\beta}{\alpha-1}\nabla f(x(t))+\dfrac{t}{\alpha-1}\ddot{x}(t)+\dfrac{\beta t}{\alpha-1}\nabla^{2}f(x(t))\dot{x}(t)\ ,

which is equivalent to (16).
On the other hand, let x=x(t)x=x(t) be a solution to the Hessian damping system (16). We will show that, if we define v=v(t)v=v(t) as in the first equation of (17), then the derivative of the function vv satisfies the second equation of (17). In this sense, by the definition of v=v(t)v=v(t), we have that

v(t)=x(t)+tα1(x˙(t)+βf(x(t))).v(t)=x(t)+\dfrac{t}{\alpha-1}\left(\dot{x}(t)+\beta\nabla f(x(t))\right)\ .

By differentiation, as before, we obtain that

v˙(t)=x˙(t)+1α1(x˙(t)+βf(x(t)))+tα1(x¨(t)+β2f(x(t))x˙(t)).\dot{v}(t)=\dot{x}(t)+\dfrac{1}{\alpha-1}\left(\dot{x}(t)+\beta\nabla f(x(t))\right)+\dfrac{t}{\alpha-1}\left(\ddot{x}(t)+\beta\nabla^{2}f(x(t))\dot{x}(t)\right).

Using the fact that x=x(t)x=x(t) satisfies (16), it follows that

v˙(t)=x˙(t)+1α1x˙(t)+βα1f(x(t))+tα1(αtx˙(t)(1+βt)f(x(t))).\dot{v}(t)=\dot{x}(t)+\dfrac{1}{\alpha-1}\dot{x}(t)+\dfrac{\beta}{\alpha-1}\nabla f(x(t))+\dfrac{t}{\alpha-1}\left(-\dfrac{\alpha}{t}\dot{x}(t)-\left(1+\dfrac{\beta}{t}\right)\nabla f(x(t))\right)\ .

Finally, this equivalent exactly with the second line of (17). \blacksquare

Now, the following theorem is the main result of this subsection. In [33], the authors showed the equivalence between (AGM2α) and (NaG) for the case when α=3\alpha=3. Moreover, we emphasize that their method closely resembles the forward Euler method applied to the dynamical system (17). This intuitive idea was pointed out in the authors work. On the other hand, since under the gradient we use a predictor yny_{n} and not the true value xnx_{n} at the discretization time tnt_{n}, this method can not be interpreted as a proper discretization to the first order dynamical system. Following this idea, we will show that Nesterov’s method (NaG) can be viewed as a discrete counterpart of (17) using the idea of Lie-Trotter splitting. In fact, we show that Nesterov’s accelerated gradient method is a Lie-Trotter splitting applied to a decomposition consisting of three subproblems at time tn=nht_{n}=nh.

Theorem 2.

Nesterov’s accelerated gradient method (AGM2α) is the numerical discretization of Lie-Trotter splitting applied to two non-autonomous subproblems and one autonomous subproblem at the discretization time tn=nht_{n}=nh. Furthermore, for each continuous subproblem, an explicit forward Euler method is applied to each of them.

Proof .

(I.) The first step of the proof is inspired by the work of A.M. Oberman and M. Laborde [33]. We will present an alternative form of the Nesterov’s algorithm (AGM2α) with general vanishing damping with coefficient α>1\alpha>1, that, we will observe, it is very similar to the continuous counterpart (17). For this, let α>1\alpha>1 and consider the Nesterov’s discretization scheme (AGM2α). At the same time, consider the following alternate form of the Nesterov’s accelerated gradient method :

{xn+1=xn+(α1)htn(vnxn)β~f(yn)vn+1=vnhtnα1f(yn)\displaystyle\begin{cases}x_{n+1}=x_{n}+\dfrac{(\alpha-1)h}{t_{n}}(v_{n}-x_{n})-\tilde{\beta}\nabla f(y_{n})\\ v_{n+1}=v_{n}-\dfrac{ht_{n}}{\alpha-1}\nabla f(y_{n})\end{cases} (NaG)

Moreover, let β~=h2\tilde{\beta}=h^{2} and yn=xn+h(α1)tn(vnxn)y_{n}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}(v_{n}-x_{n}). Then, we show that the algorithms (AGM2α) and (NaG) are equivalent. Furthermore, tn=h(n+α)t_{n}=h(n+\alpha) is equivalent to αn=nn+α\alpha_{n}=\dfrac{n}{n+\alpha} and tn=hnt_{n}=hn is equivalent to the choice αn=nαn\alpha_{n}=\dfrac{n-\alpha}{n}. We begin by rewritting the discretization scheme (NaG) as follows

{xn+1xn(α1)htnvn+(α1)htnxn=h2f(yn)vn+1vn=htnα1f(yn)\displaystyle\begin{cases}x_{n+1}-x_{n}-\dfrac{(\alpha-1)h}{t_{n}}v_{n}+\dfrac{(\alpha-1)h}{t_{n}}x_{n}=-h^{2}\nabla f(y_{n})\\ \\ v_{n+1}-v_{n}=-\dfrac{ht_{n}}{\alpha-1}\nabla f(y_{n})\end{cases}

Multiplying the second equation by h(α1)tn\dfrac{-h(\alpha-1)}{t_{n}} and adding it to the first one, we obtain that :

vn+1=xn+tnh(α1)(xn+1xn).v_{n+1}=x_{n}+\dfrac{t_{n}}{h(\alpha-1)}(x_{n+1}-x_{n})\ .

Now, using the fact that yn=xn+h(α1)tnvnh(α1)tnxny_{n}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}v_{n}-\dfrac{h(\alpha-1)}{t_{n}}x_{n} and that vn=xn1+tn1h(α1)(xnxn1)v_{n}=x_{n-1}+\dfrac{t_{n-1}}{h(\alpha-1)}(x_{n}-x_{n-1}), it follows that

yn=xn+(xnxn1)(tn1tnh(α1)tn).y_{n}=x_{n}+(x_{n}-x_{n-1})\left(\dfrac{t_{n-1}}{t_{n}}-\dfrac{h(\alpha-1)}{t_{n}}\right)\ .

In this way, we obtain Nesterov’s algorithm (AGM2α), with the inertial coefficient αn=1tn(tn1h(α1))\alpha_{n}=\dfrac{1}{t_{n}}\left(t_{n-1}-h(\alpha-1)\right). Finally, if we take the discretization time to be tn=nht_{n}=nh, where hh is the stepsize, then αn=nαn\alpha_{n}=\dfrac{n-\alpha}{n}. On the other hand, the choice tn=h(n+α)t_{n}=h(n+\alpha) leads to αn=nn+α\alpha_{n}=\dfrac{n}{n+\alpha}. These two represent, in general, the classical choices for the inertial sequence (αn)n(\alpha_{n})_{n\in\mathbb{N}}.
(II.) Now, the second step is to consider splitting the first order dynamical system (17) in the following subproblems:

(P1):{x˙(t)=α1t(v(t)x(t))v˙(t)=tα1f(x(t))(P2):{x˙(t)=βf(x(t))v(t)˙=0\displaystyle(P_{1}):\begin{cases}\dot{x}(t)=\dfrac{\alpha-1}{t}(v(t)-x(t))\\ \dot{v}(t)=-\dfrac{t}{\alpha-1}\nabla f(x(t))\end{cases}\quad(P_{2}):\begin{cases}\dot{x}(t)=-\beta\nabla f(x(t))\\ \dot{v(t)}=0\end{cases}

Using the notation from the first section, the splitting discrete solution is ψ(tn)=ϕ~[2](tn)ϕ~[1](tn)\psi(t_{n})=\tilde{\phi}^{[2]}(t_{n})\circ\tilde{\phi}^{[1]}(t_{n}). We consider (x~n+1,v~n+1)T(\tilde{x}_{n+1},\tilde{v}_{n+1})^{T} to be the numerical solution for the first subproblem (P1)(P_{1}) at time tn=nht_{n}=nh. Then, applying a forward Euler discretization on (P2)(P_{2}) and taking β=h\beta=h, one obtains that

{xn+1=x~n+1h2f(x~n+1)vn+1=v~n+1\displaystyle\begin{cases}x_{n+1}=\tilde{x}_{n+1}-h^{2}\nabla f(\tilde{x}_{n+1})\\ v_{n+1}=\tilde{v}_{n+1}\end{cases}

where (xn+1,vn+1)T(x_{n+1},v_{n+1})^{T} is the current splitting solution at time tn=nht_{n}=nh. Now, also using forward Euler discretization we find the numerical solution (x~n+1,vn+1)T(\tilde{x}_{n+1},v_{n+1})^{T} at time tn=nht_{n}=nh for the continuous subproblem (P1)(P_{1}). As before, we employ splitting the subproblem (P1)(P_{1}) into two non-autonomous subproblems, the first one containing constant velocity v=v(tn)v=v(t_{n}) and the second one preserving the value of x=x(tn)x=x(t_{n}). So, let

(P1)1:{x˙(t)=α1t(v(t)x(t))v˙(t)=0(P1)2:{x˙(t)=0v˙(t)=tα1f(x(t))\displaystyle(P_{1})_{1}:\begin{cases}\dot{x}(t)=\dfrac{\alpha-1}{t}(v(t)-x(t))\\ \dot{v}(t)=0\end{cases}\quad(P_{1})_{2}:\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=\dfrac{-t}{\alpha-1}\nabla f(x(t))\end{cases}

Then, the numerical Lie-Trotter splitting on (P1)(P_{1}) at the discrete time tn=nht_{n}=nh is defined as ϕ~[1](tn)=(ϕ~[12]ϕ~[11])(tn)\tilde{\phi}^{[1]}(t_{n})=(\tilde{\phi}^{[12]}\circ\tilde{\phi}^{[11]})(t_{n}), where ϕ~[1i]\tilde{\phi}^{[1i]} is the forward Euler operator that approximates the flow of the subproblem (P1)i(P_{1})_{i}. If we apply the forward Euler numerical scheme on both the new subproblems, we obtain the following :

{x~n+1/2=xn+h(α1)tn(vnxn)v~n+1/2=vn and {x~n+1=x~n+1/2v~n+1=v~n+1/2htnα1f(x~n+1/2)\displaystyle\begin{cases}\tilde{x}_{n+1/2}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}(v_{n}-x_{n})\\ \tilde{v}_{n+1/2}=v_{n}\end{cases}\text{ and }\begin{cases}\tilde{x}_{n+1}=\tilde{x}_{n+1/2}\\ \tilde{v}_{n+1}=\tilde{v}_{n+1/2}-\dfrac{ht_{n}}{\alpha-1}\nabla f(\tilde{x}_{n+1/2})\end{cases}

Then, it follows that the Lie-Trotter approximation to (P1)(P_{1}) can be written as follows

{x~n+1=xn+h(α1)tn(vnxn)v~n+1=vnhtnα1f(xn+h(α1)tn(vnxn))\displaystyle\begin{cases}\tilde{x}_{n+1}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}(v_{n}-x_{n})\\ \tilde{v}_{n+1}=v_{n}-\dfrac{ht_{n}}{\alpha-1}\nabla f\left(x_{n}+\dfrac{h(\alpha-1)}{t_{n}}\left(v_{n}-x_{n}\right)\right)\end{cases}

where (xn,vn)T(x_{n},v_{n})^{T} is the approximation of the solution of the continuous problem at time tn1=(n1)ht_{n-1}=(n-1)h and since we are employing a sequential-type splitting, it is the numerical solution at time tn1=(n1)ht_{n-1}=(n-1)h of (P2)(P_{2}). In this way, putting all together, we obtain that

{xn+1=xn+h(α1)tn(vnxn)h2f(xn+h(α1)tn(vnxn))vn+1=vnhtnα1f(xn+h(α1)tn(vnxn))\displaystyle\begin{cases}x_{n+1}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}(v_{n}-x_{n})-h^{2}\nabla f\left(x_{n}+\dfrac{h(\alpha-1)}{t_{n}}\left(v_{n}-x_{n}\right)\right)\\ v_{n+1}=v_{n}-\dfrac{ht_{n}}{\alpha-1}\nabla f\left(x_{n}+\dfrac{h(\alpha-1)}{t_{n}}\left(v_{n}-x_{n}\right)\right)\end{cases}

By taking yn=xn+h(α1)tn(vnxn)y_{n}=x_{n}+\dfrac{h(\alpha-1)}{t_{n}}(v_{n}-x_{n}) and using the first step of the proof, we observe that the conclusion follows easily. \blacksquare

Finally, we give some observations regarding the Lie-Trotter splitting that was employed in the proof presented before.

Remark 3.

The numerical solution at time tn=nht_{n}=nh can be written as ψ(tn)=(ϕ~[2](ϕ~[12]ϕ~[11]))(tn)\psi(t_{n})=\left(\tilde{\phi}^{[2]}\circ\left(\tilde{\phi}^{[12]}\circ\tilde{\phi}^{[11]}\right)\right)(t_{n}), where the three operators are the approximation to the flows of the three subproblems considered above. Using the ideas that we have presented, we observe that all of the approximations are made at the discrete time tn=nht_{n}=nh. Consequently, in contrast to the Strang-Marchuk splitting, we have used the stepsize hh and not h/2h/2 in the numerical solutions of the subproblems. Finally, we observe that the splitting technique has the advantage that, at a fixed discrete time tnt_{n}, the same stepsize and the same value of tnt_{n} are used in all of the discretized subproblems.
Finally, we point out that, in the present proof, we have considered the following technique: we have taken the dynamical system (17) and then we applied the Lie-Trotter splitting with the choice β=h\beta=h. Then, we have showed the equivalence with NaG. But, we remind that in NaG we have chosen β~=h2\tilde{\beta}=h^{2} and this is valid since this choice does not correspond to a dynamical system, by the fact that it is a parameter from the discretization scheme.

2.2 The role of symplectic methods in optimization algorithms

In this subsection we will make a brief review concerning the role of the sympletic integrators in the setting of unconstrained optimization problems. Furthermore, we point out that many classical inertial algorithms that are based upon dynamical systems with vanishing or with constant damping can be interpreted as a combination between the Lie-Trotter splitting and choices of symplectic methods. Firstly, let’s consider the heavy ball with friction model (HBFγ). Also, consider its natural discretization, namely the Polyak’s algorithm given by (PIMγ). In [22], it was pointed out that the momentum method can be interpreted as a splitting discretization between conservative and dissipative vector fields. Indeed, consider the subproblems associated to (HBFγ) as follows

(P1):{x˙(t)=0v˙(t)=γv and (P2):{x˙(t)=vv˙(t)=f(x(t))\displaystyle(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\gamma v\end{cases}\text{ and }(P_{2}):\begin{cases}\dot{x}(t)=v\\ \dot{v}(t)=-\nabla f(x(t))\end{cases}

where the Hamiltonian H(x,v)=v22+f(x)H(x,v)=\dfrac{\|v\|^{2}}{2}+f(x) is the energy of the entire dynamical system. The first subproblem (P1)(P_{1}) represents the dissipative system and the second one, i.e. (P2)(P_{2}) is the conservative system. Then, if one applies the explicit forward Euler method on (P1)(P_{1}) and the symplectic Euler method (SE2) on (P2)(P_{2}), it follows that the Lie-Trotter discretization scheme at time tn=nht_{n}=nh , i.e. ψ(tn)=ϕ~[2](tn)ϕ~[1](tn)\psi(t_{n})=\tilde{\phi}^{[2]}(t_{n})\circ\tilde{\phi}^{[1]}(t_{n}) becomes

{yn=xn+h(1hγ)vnxn+1=ynh2f(xn)\displaystyle\begin{cases}y_{n}=x_{n}+h(1-h\gamma)v_{n}\\ x_{n+1}=y_{n}-h^{2}\nabla f(x_{n})\end{cases}

which is an alternative form of writting the momentum method (PIMγ), using the idea that, from the symplectic discretization of the second subproblem, vn+1=xn+1xnhv_{n+1}=\dfrac{x_{n+1}-x_{n}}{h} is the discretization of the velocity of the underlying system.
Now, we turn our attention to the dynamical system that models Nesterov’s accelerated gradient method (AGM2α). This dynamical system is the true continuous counterpart of Nesterov’s algorithm, since under the gradient it contains a continuous version of the inertial term yny_{n}, i.e. this inertial term is a discretization of a combination between the exact flow and its first order derivative. For brevity, we recall the idea behind [29], where the authors considered Nesterov’s method as a combination of a Lie-Trotter splitting and a symplectic Euler method. For this, we consider a more generalized vanishing damping system Extended-AVD.
It is obvious that taking γ=h\gamma=h, one obtains the Nesterov’s algorithm (AGM2α). Furthermore, (Extended-AVD) can be written as a first order evolution equation, using the idea of velocity, as we have pointed out in the case of the heavy ball system :

{x˙(t)=v(t)v˙(t)=αtv(t)f(x(t)+(γαγ2t)v(t))\displaystyle\begin{cases}\dot{x}(t)=v(t)\\ \dot{v}(t)=-\dfrac{\alpha}{t}v(t)-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)v(t)\right)\end{cases} (18)

By the fact that the above system has no inherent energetic structure, in [29], the term f(x(t))\nabla f(x(t)) was added and substracted, such that one can use, as in the case of the heavy ball with friction system, the dissipative and conservative parts of the system. Thus, the following subproblems were considered :

(P1):{x˙(t)=0v˙(t)=αtv+f(x(t))f(x(t)+(γαγ2t)v(t)) and (P2):{x˙(t)=vv˙(t)=f(x(t))\displaystyle(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\dfrac{\alpha}{t}v+\nabla f(x(t))-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)v(t)\right)\end{cases}\text{ and }(P_{2}):\begin{cases}\dot{x}(t)=v\\ \dot{v}(t)=-\nabla f(x(t))\end{cases}

Here, the first subproblem (P1)(P_{1}) represents the non-potential forces and the second subproblem (P2)(P_{2}) stands for the potential forces. Moreover, as in the case of the heavy ball with friction system, applying a forward Euler method to the first subproblem and the symplectic Euler integrator (SE2) to the second one leads to the Nesterov’s method (AGM2α).
Now, we shift our focus to the Hessian-driven damping dynamical system (DIN-AVDαt,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{\alpha}{t},\beta,1+\tfrac{\beta}{t}}\right). First of all, we will present a natural discretization of this dynamical system that resembles a Polyak-type counterpart to the IGAHD algorithm from [4]. Second of all, we show that a discretization resembling the IGAHD algorithm can be obtained, as in the case of Nesterov’s method, by a combination of a Lie-Trotter splitting method with a symplectic Euler integrator, but for a Hessian-based system endowed with a perturbation under the gradient as in (Extended-AVD). We consider the (DIN-AVDαt,β,1+βt)\left(\text{DIN-AVD}_{\tfrac{\alpha}{t},\beta,1+\tfrac{\beta}{t}}\right). The natural discretization this system is a Polyak-type method, where at time tn=nht_{n}=nh under the gradient appears the old iteration xnx_{n}. We approximate 2f(x(t))x˙(t)\nabla^{2}f(x(t))\dot{x}(t) by the difference of consecutive gradients as in the work of Attouch et. al. [4] and we consider the discretization of the continuous system, as follows :

xn+12xn+xn1h2+αtnxnxn1h+βh[f(xn)f(xn1)]=f(xn)βtnf(xn)\displaystyle\dfrac{x_{n+1}-2x_{n}+x_{n-1}}{h^{2}}+\dfrac{\alpha}{t_{n}}\dfrac{x_{n}-x_{n-1}}{h}+\dfrac{\beta}{h}\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]=-\nabla f(x_{n})-\dfrac{\beta}{t_{n}}\nabla f(x_{n})
xn+12xn+xn1h2+αnhxnxn1h+βh[f(xn)f(xn1)]=f(xn)βnhf(xn)\displaystyle\dfrac{x_{n+1}-2x_{n}+x_{n-1}}{h^{2}}+\dfrac{\alpha}{nh}\dfrac{x_{n}-x_{n-1}}{h}+\dfrac{\beta}{h}\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]=-\nabla f(x_{n})-\dfrac{\beta}{nh}\nabla f(x_{n})

By some easy computations, it follows that

xn+1=xn+(1αn)(xnxn1)βh[f(xn)f(xn1)]βhnf(xn)h2f(xn)x_{n+1}=x_{n}+\left(1-\dfrac{\alpha}{n}\right)(x_{n}-x_{n-1})-\beta h\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]-\dfrac{\beta h}{n}\nabla f(x_{n})-h^{2}\nabla f(x_{n})

Define α:=1αn\alpha:=1-\dfrac{\alpha}{n}. Then, we obtain that

{yn=xn+αn(xnxn1)βh[f(xn)f(xn1)]βhnf(xn)xn+1=ynh2f(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\beta h\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]-\dfrac{\beta h}{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(x_{n})\end{cases} (Polyak-IGAHD)

Now, our only result of this subsection consists in showing that a similar algorithm to the IGAHD (see [4]) with a predictor yny_{n} under the gradient can be obtained as in the case of (Extended-AVD) by applying a sequential splitting method with a symplectic Euler integrator on the second subproblem. Furthermore, the only difference between our algorithm and IGAHD is that the latter one contains f(xn1)\nabla f(x_{n-1}) instead f(xn)\nabla f(x_{n}). The choice f(xn1)\nabla f(x_{n-1}) seems unnatural, by the fact that at the discrete time tn=nht_{n}=nh, the gradient of the flow is approximated by f(xn1)\nabla f(x_{n-1}) and not with the current gradient value. Nevertheless, by using the same technique of approximation, we can easily obtain a similar discretization scheme with the IGAHD algorithm.

Theorem 4.

Let fC2(N,)f\in C^{2}(\mathbb{R}^{N},\mathbb{R}) and consider the following Hessian driven damping second order system :

x¨(t)+αtx˙(t)+β2f(x(t))x˙(t)=βtf(x(t))f(x(t)+γ(1αγt)x˙(t)βγ22f(x(t))x˙(t)βγ2tf(x(t)))\displaystyle\hskip-14.22636pt\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)+\beta\nabla^{2}f(x(t))\dot{x}(t)=-\dfrac{\beta}{t}\nabla f(x(t))-\nabla f\left(x(t)+\gamma\left(1-\dfrac{\alpha\gamma}{t}\right)\dot{x}(t)-\beta\gamma^{2}\nabla^{2}f(x(t))\dot{x}(t)-\dfrac{\beta\gamma^{2}}{t}\nabla f(x(t))\right)

Then, a Lie-Trotter splitting discretization, with a forward Euler method on the first subproblem and a symplectic integrator on the second one leads to an IGAHD-type algorithm.

Proof .

As in the case of (Extended-AVD) we consider two subproblems, one containing the so-called non-potential forces and the other one containing the potential forces :

(P1):{x˙=0v˙=αtv(t)β2f(x(t))v(t)βtf(x(t))f(𝒜(t,x(t),v(t)))+f(x(t))\displaystyle(P_{1}):\begin{cases}\dot{x}=0\\ \dot{v}=-\dfrac{\alpha}{t}v(t)-\beta\nabla^{2}f(x(t))v(t)-\dfrac{\beta}{t}\nabla f(x(t))-\nabla f\left(\mathcal{A}(t,x(t),v(t))\right)+\nabla f(x(t))\end{cases}

where

𝒜(t,x(t),v(t))=x(t)+γ(1αγt)v(t)βγ2f(x(t))v(t)βγ2tf(x(t))\mathcal{A}(t,x(t),v(t))=x(t)+\gamma\left(1-\dfrac{\alpha\gamma}{t}\right)v(t)-\beta\gamma^{2}\nabla f(x(t))v(t)-\dfrac{\beta\gamma^{2}}{t}\nabla f(x(t))

and the second subproblem as

(P2):{x˙=vv˙=f(x(t))\displaystyle(P_{2}):\begin{cases}\dot{x}=v\\ \dot{v}=-\nabla f(x(t))\end{cases}

Taking γ=h\gamma=h and defining αn:=1αhtn\alpha_{n}:=1-\dfrac{\alpha h}{t_{n}} we obtain that

{xn+1/2=xnvn+1/2=αnvnβh2f(xn)vnβhtnf(xn)+hf(xn)hf(xn+hαnvnβh22f(xn)vnβh2tnf(xn))\displaystyle\begin{cases}x_{n+1/2}=x_{n}\\ v_{n+1/2}=\alpha_{n}v_{n}-\beta h\nabla^{2}f(x_{n})v_{n}-\dfrac{\beta h}{t_{n}}\nabla f(x_{n})+h\nabla f(x_{n})-h\nabla f\left(x_{n}+h\alpha_{n}v_{n}-\beta h^{2}\nabla^{2}f(x_{n})v_{n}-\dfrac{\beta h^{2}}{t_{n}}\nabla f(x_{n})\right)\end{cases}

Furthermore, for the second subproblem (P2)(P_{2}), we apply the symplectic Euler integrator (SE2) and obtain that

{xn+1=xn+1/2+hvn+1vn+1=vn+1/2hf(xn+1/2)\displaystyle\begin{cases}x_{n+1}=x_{n+1/2}+hv_{n+1}\\ v_{n+1}=v_{n+1/2}-h\nabla f(x_{n+1/2})\end{cases}

Using the fact that for tn=nht_{n}=nh, the splitting solution is the discrete solution of (P2)(P_{2}), namely (xn+1,vn+1)TN×N(x_{n+1},v_{n+1})^{T}\in\mathbb{R}^{N}\times\mathbb{R}^{N}, we obtain from the first equation given above, that vn=xnxn1hv_{n}=\dfrac{x_{n}-x_{n-1}}{h}. By using this and the approximation involving the Hessian, i.e. 2f(xn)vn\nabla^{2}f(x_{n})v_{n} at time tn=nht_{n}=nh by two consecutive gradients, it follows that

{yn=xn+αn(xnxn1)βh[f(xn)f(xn1)]βhnf(xn)xn+1=ynh2f(yn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\beta h\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]-\dfrac{\beta h}{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})\end{cases} (IGAHD-type)

\blacksquare

We end this section by emphasizing the role of choosing the appropriate first order evolution equation for the Hessian driven damping system in the splitting methods.

Remark 5.

1.) At the second order dynamical system from Theorem 4, under the gradient at 𝒜(t,x(t),x˙(t))\mathcal{A}(t,x(t),\dot{x}(t)), we can approximate βγ2tf(x(t))\dfrac{\beta\gamma^{2}}{t}\nabla f(x(t)) as βh2tnf(xn1)\dfrac{\beta h^{2}}{t_{n}}\nabla f(x_{n-1}). In this way, we obtain the classical IGAHD algorithm from [4].
2.) The idea behind Theorem 4 is the fact that we separate the perturbed gradient βtf(x(t))-\dfrac{\beta}{t}\nabla f(x(t)) from the unperturbed gradient value. Moreover, the latter one is similar to the (Extended-AVD) since, under the gradient, the value contains a combination of both the exact flow x=x(t)x=x(t) and its first order derivative x˙=x˙(t)\dot{x}=\dot{x}(t).
3.) In [36], Shi et. al. have studied a symplectic Euler method that was applied to a full Hessian driven damping system similar to DIN-AVDαt,β(t),b(t){}_{\tfrac{\alpha}{t},\beta(t),b(t)}. The key difference between our analysis and theirs is that our algorithms are also based upon a Lie-Trotter discretization in combination with symplectic methods.

2.3 Discussion on some splitting techniques for the extended AVD dynamical system

Until now, we have used the symplectic Euler method (SE2) but, in the present subsection, we consider a Lie-Trotter splitting discretization scheme by combining a forward Euler method on the first subproblem and the symplectic Euler integrator (SE1) on the second one, on the dynamical system (Extended-AVD). In this way we consider the following two subproblems:

(P1):{x˙(t)=0v˙(t)=αtv+f(x(t))f(x(t)+(γαγ2t)v(t)) and (P2):{x˙(t)=vv˙(t)=f(x(t))\displaystyle(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\dfrac{\alpha}{t}v+\nabla f(x(t))-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)v(t)\right)\end{cases}\text{ and }(P_{2}):\begin{cases}\dot{x}(t)=v\\ \dot{v}(t)=-\nabla f(x(t))\end{cases}

By using a forward Euler discretization on (P1)(P_{1}) and setting γ=h\gamma=h, one obtains that

{xn+1/2=xnvn+1/2=(1αhtn)vnhf(xn+h(1αhtn)vn)+hf(xn)\displaystyle\begin{cases}&x_{n+1/2}=x_{n}\\ &v_{n+1/2}=\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}-h\nabla f\left(x_{n}+h\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}\right)+h\nabla f(x_{n})\end{cases}

Also, applying the symplectic Euler method (SE1) on the subproblem (P2)(P_{2}), it follows that

{xn+1=xn+1/2+hvn+1/2vn+1=vn+1/2hf(xn+1)\displaystyle\begin{cases}&x_{n+1}=x_{n+1/2}+hv_{n+1/2}\\ &v_{n+1}=v_{n+1/2}-h\nabla f(x_{n+1})\end{cases}

Now, as before, using the discretization time tn=nht_{n}=nh and denoting αn:=1αn\alpha_{n}:=1-\dfrac{\alpha}{n}, we introduce the auxiliary term yn=xn+hαnvny_{n}=x_{n}+h\alpha_{n}v_{n}. So, we get that

{xn+1=xn+hαnvnh2f(yn)+h2f(xn)vn+1=αnvnhf(yn)+hf(xn)hf(xn+1)\displaystyle\begin{cases}&x_{n+1}=x_{n}+h\alpha_{n}v_{n}-h^{2}\nabla f(y_{n})+h^{2}\nabla f(x_{n})\\ &v_{n+1}=\alpha_{n}v_{n}-h\nabla f(y_{n})+h\nabla f(x_{n})-h\nabla f(x_{n+1})\end{cases}

By multiplying the second equation by hh and substracting them, we obtain the perturbed velocity which differs by the one obtained in the case of (SE2) by an extra gradient, namely:

vn=xnxn1hhf(xn).v_{n}=\dfrac{x_{n}-x_{n-1}}{h}-h\nabla f(x_{n})\,.

Finally, we obtain our first algorithm, i.e.

{yn=xn+αn(xnxn1)h2αnf(xn)xn+1=ynh2f(yn)+h2f(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-h^{2}\alpha_{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+h^{2}\nabla f(x_{n})\end{cases} (LT-SE1)

Now, we consider a similar approach, namely on subproblem (P1)(P_{1}) we apply the forward Euler method, and on (P2)(P_{2}) we consider the Stormer-Verlet algorithm (SV2). As before, for the first subproblem we have that

{xn+1/2=xnvn+1/2=(1αhtn)vnhf(xn+h(1αhtn)vn)+hf(xn)\displaystyle\begin{cases}&x_{n+1/2}=x_{n}\\ &v_{n+1/2}=\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}-h\nabla f\left(x_{n}+h\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}\right)+h\nabla f(x_{n})\end{cases}

Furthermore, using (SV2) on the second sub-problem, it follows that

{v~n+1/2=vn+1/2h2f(xn+1/2)xn+1=xn+1/2+hv~n+1/2vn+1=v~n+1/2h2f(xn+1)\displaystyle\begin{cases}&\tilde{v}_{n+1/2}=v_{n+1/2}-\dfrac{h}{2}\nabla f(x_{n+1/2})\\ &x_{n+1}=x_{n+1/2}+h\tilde{v}_{n+1/2}\\ &v_{n+1}=\tilde{v}_{n+1/2}-\dfrac{h}{2}\nabla f(x_{n+1})\end{cases}

As before, using yn=xn+hαnvny_{n}=x_{n}+h\alpha_{n}v_{n}, where tnt_{n} is the discretization time and αn\alpha_{n} represents the intertial momentum coefficient, it follows that

{xn+1=xn+hαnvnh2f(yn)+h2f(xn)h22f(xn)vn+1=αnvnhf(yn)+hf(xn)h2f(xn)h2f(xn+1)\displaystyle\begin{cases}&x_{n+1}=x_{n}+h\alpha_{n}v_{n}-h^{2}\nabla f(y_{n})+h^{2}\nabla f(x_{n})-\dfrac{h^{2}}{2}\nabla f(x_{n})\\ &v_{n+1}=\alpha_{n}v_{n}-h\nabla f(y_{n})+h\nabla f(x_{n})-\dfrac{h}{2}\nabla f(x_{n})-\dfrac{h}{2}\nabla f(x_{n+1})\end{cases}

In a similar spirit with the determination of the algorithm (LT-SE1), we obtain our second algorithm, i.e.

{yn=xn+αn(xnxn1)h22αnf(xn)xn+1=ynh2f(yn)+h22f(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\dfrac{h^{2}}{2}\alpha_{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+\dfrac{h^{2}}{2}\nabla f(x_{n})\end{cases} (LT-SV2)

We emphasize that our newly introduced algorithms (LT-SE1) and (LT-SV2) differ by a h2\dfrac{h}{2} term at the gradient f(xn)\nabla f(x_{n}), in the case of the Stormer-Verlet integrator the perturbed velocity contains also a h2\dfrac{h}{2} term, which in an intuitive sense, it annihilates much more properly large gradient values.
Now, we shift our attention to an algorithm that was developed by N.C. Nguyen et.al. in [32], that was used as a residual method for solving systems of equations. Using a variant related to the constant stepsize hh, the residual method is the following:

{yn=xn+αn(xnxn1)h2(1+αn)f(xn)xn+1=ynh2f(yn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-h^{2}(1+\alpha_{n})\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})\end{cases} (ARDM)

where the method reads as follows: accelerated residual descent method. Also, N.C. Nguyen et.al. have considered a restarting strategy that it is out of the scope of the present paper. Furthermore, their arguments into developing (ARDM) are based upon the stability theory related to finite difference approximations and are also based upon an approach similar to that of Su et.al. (see [39]). Our next purpose is to introduce a new dynamical system, such that the optimization algorithm (ARDM) can be considered as a Lie-Trotter splitting for a combination between the forward Euler and the symplectic Euler method (SE2), as in the case of (Extended-AVD) dynamical system. For this, we consider the following second order evolution equation:

x¨(t)+αtx˙(t)=f(x(t)+(γαγ2t)x˙(t)γ2(2αγt)f(x(t)))(2αγt)f(x(t))\displaystyle\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)=-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)\dot{x}(t)-\gamma^{2}\left(2-\dfrac{\alpha\gamma}{t}\right)\nabla f(x(t))\right)-\left(2-\dfrac{\alpha\gamma}{t}\right)\nabla f(x(t)) (DynSys-ARDM)

As before, we divide this evolution equation into dissipative and conservative parts, i.e. into non-potential and potential forces:

(P1):{x˙(t)=0v˙(t)=αtv(t)+f(x(t))f(x(t)+(γαγ2t)v(t)γ2(2αγt)f(x(t)))(2αγt)f(x(t))\displaystyle\hskip-14.22636pt(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\dfrac{\alpha}{t}v(t)+\nabla f(x(t))-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)v(t)-\gamma^{2}\left(2-\dfrac{\alpha\gamma}{t}\right)\nabla f(x(t))\right)-\left(2-\dfrac{\alpha\gamma}{t}\right)\nabla f(x(t))\end{cases}

and

(P2):{x˙(t)=vv˙(t)=f(x(t))\displaystyle(P_{2}):\begin{cases}\dot{x}(t)=v\\ \dot{v}(t)=-\nabla f(x(t))\end{cases}

Using a forward Euler explict method on (P1)(P_{1}) and taking γ=h\gamma=h, we obtain that

{xn+1/2=xnvn+1/2=(1αhtn)vn(2αhtn)f(xn)hf(xn+h(1αhtn)vnh2(2αhtn)f(xn))+hf(xn)\displaystyle\begin{cases}&x_{n+1/2}=x_{n}\\ &v_{n+1/2}=\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}-\left(2-\dfrac{\alpha h}{t_{n}}\right)\nabla f(x_{n})-h\nabla f\left(x_{n}+h\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}-h^{2}\left(2-\dfrac{\alpha h}{t_{n}}\right)\nabla f(x_{n})\right)+h\nabla f(x_{n})\end{cases}

On the other hand, using the symplectic Euler method (SE2) on (P2)(P_{2}), we arrive at the following equalities:

{xn+1=xn+1/2+hvn+1vn+1=vn+1/2hf(xn+1/2)\displaystyle\begin{cases}&x_{n+1}=x_{n+1/2}+hv_{n+1}\\ &v_{n+1}=v_{n+1/2}-h\nabla f(x_{n+1/2})\end{cases}

Now, in contrast with the previous algorithm, at the discrete time tn=nht_{n}=nh, we consider the additional iteration yn=xn+hαnvnh2(1+αn)f(xn)y_{n}=x_{n}+h\alpha_{n}v_{n}-h^{2}(1+\alpha_{n})\nabla f(x_{n}). Using this notation, we get the following:

{xn+1=xn+hαnvnh2f(yn)h2(1+αn)f(xn)vn+1=αnvnhf(yn)h(1+αn)f(xn)\displaystyle\begin{cases}&x_{n+1}=x_{n}+h\alpha_{n}v_{n}-h^{2}\nabla f(y_{n})-h^{2}(1+\alpha_{n})\nabla f(x_{n})\\ &v_{n+1}=\alpha_{n}v_{n}-h\nabla f(y_{n})-h(1+\alpha_{n})\nabla f(x_{n})\end{cases}

As in our splitting-based algorithms that were presented above, we multiply the second equation by hh and by substracting them, we get the algorithm (ARDM). Finally, we end this subsection by considered a remark that will represent a key role in our theoretical analysis.

Remark 6.

So far, we have considered three algorithms, namely (ARDM), (LT-SE1) and (LT-SV2). The difference between the first one and the symplectic methods applied to the (Extended-AVD) dynamical system is that the accelerated residual method contains a perturbation h2(1+αn)h^{2}(1+\alpha_{n}) only in the additional inertial term yny_{n}. The sequential Euler and Stormer-Verlet algorithms for the extended dynamical system with asymptotically vanishing damping contains a perturbation at yny_{n} and also at the current iteration value xn+1x_{n+1}. In our terms, if an element appears at xn+1x_{n+1} and it contains the gradient f(xn)\nabla f(x_{n}), we say that the algorithm has a bias correction term. This is the case for the (LT-SE1) and (LT-SV2) algorithms. On the other hand, we can say that (ARDM) is bias corrected free. This difference can be seen in the definition of the inertial momentum term, where at (ARDM), yny_{n} contains h2(1+αn)f(xn)h^{2}(1+\alpha_{n})\nabla f(x_{n}). So, now we consider a general form of inertial type algorithms that contain two gradient evaluations and that encompasses (ARDM), (LT-SE1) and (LT-SV2):

{yn=xn+αn(xnxn1)ωnf(xn)xn+1=ynh2f(yn)+γnf(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\omega_{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+\gamma_{n}\nabla f(x_{n})\end{cases} (LT-Symplectic)

where αn=1αn\alpha_{n}=1-\dfrac{\alpha}{n} and tn=nht_{n}=nh. Also, we point out that ωn\omega_{n} is the underlying weight that appears in the Lie-Trotter discretization and γn\gamma_{n} is the bias correction term. Last, we observe that for (LT-SE1) we have that ωn=h2αn\omega_{n}=h^{2}\alpha_{n} and γn=h2\gamma_{n}=h^{2}. Furthermore, for (LT-SV2), these values are halved. On the other hand, (ARDM) can be obtained by setting ωn=h2(1+αn)\omega_{n}=h^{2}(1+\alpha_{n}) and γn=0\gamma_{n}=0, for each nn\in\mathbb{N}.

2.4 Fixed points, spurious solutions and exploding gradients : Heuristic interpretation

In this section we present some intuitive explanations regarding the general algorithm of type LT-Symplectic. We consider the asymptotic behavior, the critical points and the equilibrium states related to our optimization methods. Also, regarding the phenomenons that we observe, we shall present also some remedies.
First of all, we shall start with Nesterov algorithm AGM2α. Let’s suppose that there exists xNx^{\ast}\in\mathbb{R}^{N} such that the sequence (xn)n(x_{n})_{n\in\mathbb{N}} is convergent to xx^{\ast}. Then, let limyn=y\lim\limits y_{n}=y^{\ast}. We observe that y=xy^{\ast}=x^{\ast} and so f(x)=0\nabla f(x^{\ast})=0. This is equivalent to the fact that xargminfx^{\ast}\in\operatorname*{argmin}f. At the same time, define the operator Th:NNT_{h}:\mathbb{R}^{N}\to\mathbb{R}^{N}, as Th(y)=yh2f(y)T_{h}(y)=y-h^{2}\nabla f(y). From all of this it follows that if limnxn=x\lim\limits_{n\to\infty}x_{n}=x^{\ast}, then we have that xargminfx^{\ast}\in\operatorname*{argmin}f, i.e. x=Th(x)x^{\ast}=T_{h}(x^{\ast}). So, the behavior of the Nesterov scheme is related to the fact that if the principal sequence converges, then the limit is also a minimizer of the convex function ff and at the same time it is a fixed point of the associated operator ThT_{h}.
Now, we turn our attention to the IGAHD-type algorithm. Keeping in mind our notation from LT-Symplectic, let ωn=βhn\omega_{n}=\dfrac{\beta h}{n}. In this case limnωn=0\lim\limits_{n\to\infty}\omega_{n}=0. The fact that the limit of the sequence (ωn)n(\omega_{n})_{n\in\mathbb{N}} is 0 will be crucial in our analysis. We shall use the fact that two consecutive iterates of the gradient represent a natural approximation of the Hessian, and this is related to limn[f(xn)f(xn1)]=0\lim\limits_{n\to\infty}\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]=0, since limnxn=x\lim\limits_{n\to\mathbb{N}}x_{n}=x^{\ast}. As before, we have considered that xx^{\ast} the limit of (xn)n(x_{n})_{n\in\mathbb{N}}. Again, denoting by yy^{\ast} the limit of (yn)n(y_{n})_{n\in\mathbb{N}}, we find that, as in the case of AGM2α, xargminfx^{\ast}\in\operatorname*{argmin}f (here an important property relies on the fact that the coefficient of the difference between consecutive gradient βh\beta h is bounded). So, the same conclusion is valid also for the IGAHD-type algorithm, regarding the fixed points of the operator ThT_{h} and the minimizers of ff. So, if the major iteration of the algorithm converges, then the limit is also a minimizer of the convex objective function ff.
We have said that the property that limnωn=0\lim\limits_{n\to\infty}\omega_{n}=0 is crucial. We will call this phenomenon by the name annihilating the exploding gradient, since if the limit was strictly positive then the gradient at xnx_{n} would not be disappearing at all, and so we are left with a residual that, as time grows, will tend to the gradient of xx^{\ast}, which could be large enough if xx^{\ast} was not a minimizer. For instance, let ω=limnωn\omega=\lim\limits_{n\in\mathbb{N}}\omega_{n}, such that ω>0\omega>0. As before, let limnxn=x\lim\limits_{n\in\mathbb{N}}x_{n}=x^{\ast} and limnyn=y\lim\limits_{n\in\mathbb{N}}y_{n}=y^{\ast}. It is trivial to show that y=xωf(x)y^{\ast}=x^{\ast}-\omega\nabla f(x^{\ast}), and so xx^{\ast} verifies the following equation:

f(x)=h2ωf(xωf(x))\displaystyle\nabla f(x^{\ast})=-\dfrac{h^{2}}{\omega}\nabla f(x^{\ast}-\omega\nabla f(x^{\ast})) (19)

It is easy to see that a minimizer satisfies 19. On the other hand, if xx^{\ast} is not a minimizer then we have the so called spurious roots. This means that not all the solutions of 19 are minimizers of the objective function ff. As in the previous cases, by the fact that xn+1=ynh2f(yn)x_{n+1}=y_{n}-h^{2}\nabla f(y_{n}), let ThT_{h} defined as before. A simple analysis shows that the limit xx^{\ast} of (xn)n(x_{n})_{n\in\mathbb{N}} satisfies the relation 19. Also, we have that y=Th(y)y^{\ast}=T_{h}(y^{\ast}) if and only if y=xωf(x)argminfy^{\ast}=x^{\ast}-\omega\nabla f(x^{\ast})\in\operatorname*{argmin}f. This is the reason we also call these elements spurious fixed points, by the fact that they are fixed points of ThT_{h}, but they can be represented as a perturbation of the limit point xx^{\ast}.
Now, let’s consider the ARDM algorithm. Using, the same notation as before, we have that ω=2h2\omega=2h^{2}, which is strictly positive. By making the same analysis we reach the following equation between xx^{\ast} and yy^{\ast}, i.e

f(x)=12f(x2h2f(x))\displaystyle\nabla f(x^{\ast})=-\dfrac{1}{2}\nabla f(x^{\ast}-2h^{2}\nabla f(x^{\ast})) (20)

This comes from 19 by taking ω=2h2\omega=2h^{2}. Also, as for the case of the previous discussions regarding the modification of ω\omega in IGAHD-type, we obtain the same conclusion regarding the spurious roots of ARDM.
The algorithm LT-SE1 meets the requirement that ω=h2>0\omega=h^{2}>0. Also, letting limnx\lim\limits_{n\to\infty}x^{\ast} and limnyn=y\lim\limits_{n\to\infty y_{n}}=y^{\ast}, one can show that we obtain the equation

f(x)=f(xh2f(x))\displaystyle\nabla f(x^{\ast})=-\nabla f(x^{\ast}-h^{2}\nabla f(x^{\ast})) (21)

So, in all of the three cases from above, namely 19, 20 and 21 we have that the limit point xx^{\ast} is not necessarily a minimizer of the convex function ff. We want also to point out that, for LT-SE1, we have that xn+1=ynh2f(yn)+h2f(xn)x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+h^{2}\nabla f(x_{n}).In this case, let Th:N×NNT_{h}:\mathbb{R}^{N}\times\mathbb{R}^{N}\to\mathbb{R}^{N} be defined as Th(y,x)=yh2f(y)+h2f(x)T_{h}(y,x)=y-h^{2}\nabla f(y)+h^{2}\nabla f(x). In this case, by simple computations, we obtain that, for the case when xx^{\ast} is the limit of (xn)n(x_{n})_{n\in\mathbb{N}} and yy^{\ast} is the limit of (yn)n(y_{n})_{n\in\mathbb{N}}, we have that Th(y,x)=Th(xh2f(x),x)=xh2f(y)T_{h}(y^{\ast},x^{\ast})=T_{h}(x^{\ast}-h^{2}\nabla f(x^{\ast}),x^{\ast})=x^{\ast}-h^{2}\nabla f(y^{\ast}). So, we have that x=Th(y,x)x^{\ast}=T_{h}(y^{\ast},x^{\ast}) is equivalent with the fact that y=xh2f(x)argminfy^{\ast}=x^{\ast}-h^{2}\nabla f(x^{\ast})\in\operatorname*{argmin}f. Nevertheless, let us notice that x=Th(x,x)x^{\ast}=T_{h}(x^{\ast},x^{\ast}) means that xx^{\ast} is a minimizer. So, it is worth pointing out that the limit point xx^{\ast} minimizes ff if and only if is a fixed point of ThT_{h} on the cartesian product N×N\mathbb{R}^{N}\times\mathbb{R}^{N}, i.e. a fixed point in the classical sense (i.e. on the diagonal set).
Bearing all these in mind, we present a remedy for the algorithms that belong to the class LT-Symplectic, such that they preserve fixed points, do not have the property of gradient explosion and do not introduce spurious roots. For LT-Symplectic, suppose that limnωn=0\lim\limits_{n\to\infty}\omega_{n}=0, limnγn=γ0\lim\limits_{n\to\infty}\gamma_{n}=\gamma\geq 0 and that γh2\gamma\neq h^{2}. As before, let xx^{\ast} be the limit of (xn)n(x_{n})_{n\in\mathbb{N}} and yy^{\ast} be the limit of (yn)n(y_{n})_{n\in\mathbb{N}}. In this case, by the fact that ωn\omega_{n} converges to 0 such that the gradient f(x)\nabla f(x^{\ast}) is annihilated, we obtain that y=xy^{\ast}=x^{\ast}. This leads to the fact that (γh2)f(x)=0(\gamma-h^{2})\nabla f(x^{\ast})=0, so xargminfx^{\ast}\in\operatorname*{argmin}f. In this case, Th(y,x)=yh2f(y)+γf(x)T_{h}(y,x)=y-h^{2}\nabla f(y)+\gamma\nabla f(x). In this case, Th(y,x)=xT_{h}(y^{\ast},x^{\ast})=x^{\ast} is equivalent with xargminfx^{\ast}\in\operatorname*{argmin}f. So, with the above conditions on the coefficients, we have a natural extension of the asymptotic behavior of Nesterov algorithm AGM2α, where the operator Th(y)T_{h}(y) is extended to the cartesian product by the mean of Th(y,x)T_{h}(y,x).
We end this section, by considered another remedy but a little bit more particular, namely we modify explicitly LT-SE1, such that it has the properties presented above. In the case of LT-SE1, by our symplectic approach, we have the perturbed velocity vn=xnxn1hhf(xn)v_{n}=\dfrac{x_{n}-x_{n-1}}{h}-h\nabla f(x_{n}). This is different from the symplectic approach of Nesterov’s algorithm AGM2α by the fact that, in our proposed scheme, limnvn=hf(x)\lim\limits_{n\to\infty}v_{n}=-h\nabla f(x^{\ast}). Regarding the Extended-AVD dynamical system, the velocity vv satisfies limtv(t)=0\lim\limits_{t\to\infty}v(t)=0, i.e. the velocity of the continuous system approaches 0 as time grows. We want to reflect this property also to our discretization method. So, we want to have that vn=xnxn1hhθn1f(xn)v_{n}=\dfrac{x_{n}-x_{n-1}}{h}-h\theta_{n-1}\nabla f(x_{n}), with limnθn=0\lim\limits_{n\to\infty}\theta_{n}=0. We recall that in the construction of LT-SE1, we have used the technique of adding and subtracting f(x(t))\nabla f(x(t)) in order to split the potential and non-potential forces. We shall make this trick, but for θ(t)f(x(t))\theta(t)\nabla f(x(t)), where θ\theta vanishes at \infty. Furthermore, we consider the discretized version, i.e. θ(tn)θn\theta(t_{n})\approx\theta_{n}.
As for LT-SE1, we consider a Lie-Trotter splitting discretization scheme by combining a forward Euler method on the first subproblem and the symplectic Euler integrator (SE1) on the second one, on the dynamical system (Extended-AVD). Let the following two subproblems:

(P1):{x˙(t)=0v˙(t)=αtv+θ(t)f(x(t))f(x(t)+(γαγ2t)v(t)) and (P2):{x˙(t)=vv˙(t)=θ(t)f(x(t))\displaystyle(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\dfrac{\alpha}{t}v+\theta(t)\nabla f(x(t))-\nabla f\left(x\left(t\right)+\left(\gamma-\dfrac{\alpha\gamma^{2}}{t}\right)v(t)\right)\end{cases}\text{ and }(P_{2}):\begin{cases}\dot{x}(t)=v\\ \dot{v}(t)=-\theta(t)\nabla f(x(t))\end{cases}

By using a forward Euler discretization on (P1)(P_{1}), it follows that

{xn+1/2=xnvn+1/2=(1αhtn)vnhf(xn+h(1αhtn)vn)+hθnf(xn)\displaystyle\begin{cases}&x_{n+1/2}=x_{n}\\ &v_{n+1/2}=\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}-h\nabla f\left(x_{n}+h\left(1-\dfrac{\alpha h}{t_{n}}\right)v_{n}\right)+h\theta_{n}\nabla f(x_{n})\end{cases}

Also, applying the symplectic Euler method (SE1) on the subproblem (P2)(P_{2}), it follows that

{xn+1=xn+1/2+hvn+1/2vn+1=vn+1/2hθnf(xn+1)\displaystyle\begin{cases}&x_{n+1}=x_{n+1/2}+hv_{n+1/2}\\ &v_{n+1}=v_{n+1/2}-h\theta_{n}\nabla f(x_{n+1})\end{cases}

Using the discrete time tn=nht_{n}=nh and keeping in mind that αn:=1αn\alpha_{n}:=1-\dfrac{\alpha}{n} and that yn=xn+hαnvny_{n}=x_{n}+h\alpha_{n}v_{n}, we get:

{xn+1=xn+hαnvnh2f(yn)+h2θnf(xn)vn+1=αnvnhf(yn)+hθnf(xn)hθnf(xn+1)\displaystyle\begin{cases}&x_{n+1}=x_{n}+h\alpha_{n}v_{n}-h^{2}\nabla f(y_{n})+h^{2}\theta_{n}\nabla f(x_{n})\\ &v_{n+1}=\alpha_{n}v_{n}-h\nabla f(y_{n})+h\theta_{n}\nabla f(x_{n})-h\theta_{n}\nabla f(x_{n+1})\end{cases}

By multiplying the second equation by hh and substracting them, we obtain the modified velocity, i.e.

vn=xnxn1hhθn1f(xn).v_{n}=\dfrac{x_{n}-x_{n-1}}{h}-h\theta_{n-1}\nabla f(x_{n})\,.

In this way, we obtain the third symplectic-type method:

{yn=xn+αn(xnxn1)h2αnθn1f(xn)xn+1=ynh2f(yn)+h2θnf(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-h^{2}\alpha_{n}\theta_{n-1}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+h^{2}\theta_{n}\nabla f(x_{n})\end{cases} (LT-SE3)

In this case, we have that limnωn=limnh2αnθn1=0\lim\limits_{n\to\infty}\omega_{n}=\lim\limits_{n\to\infty}h^{2}\alpha_{n}\theta_{n-1}=0 and also limnγn=limnh2θn=0\lim\limits_{n\to\infty}\gamma_{n}=\lim\limits_{n\to\infty}h^{2}\theta_{n}=0. Finally, we observe that we are in the case of our previous discussion regarding the remedy concerning the exploding gradients and the spurious roots.

3 Proposed algorithms and their rate of convergence

Even though we have introduced new algorithms, namely LT-Symplectic via an intuitive geometric approach by the means of symplectic methods, we will focus our attention on a generalized Hessian driven damping system and not on the extended AVD system. The discussion about our choice is postponed to Remark 11. Hence, our aim is to consider a theoretical and an empirical convergence analysis regarding Nesterov’s algorithm (AGM2α), IGAHD algorithm from [4] and our approach via symplectic integrators, i.e. algorithms of the form LT-S-IGAHD.

3.1 Modified Hessian-driven damping continuous and discrete systems

In the previous sections, we have presented the key idea behind the splitting technique applied to gradient-type dynamical systems. We have considered LT-Symplectic and LT-SE3 algorithms that have a geometric interpretation that is related to symplectic integrators. We observe that all these optimization algorithms have two coefficients in front of the gradient value at xnx_{n}, namely γn\gamma_{n} and ωn\omega_{n}, for each nn\in\mathbb{N}. On the other hand, there are the Hessian-driven damping discretizations like IGAHD-type, which contain the difference of two consecutive gradients, i.e f(xn)f(xn1)\nabla f(x_{n})-\nabla f(x_{n-1}) along with f(xn)\nabla f(x_{n}), under the gradient of the inertial element yny_{n}. By the fact that the effect of the Hessian is to dampen the oscillations, it is more natural to consider a Hessian-driven damping algorithm. So, in the present subsection we shall modify some of the previous dynamical systems in order to make consistent modification into our proposed algorithms. At the same time, our algorithms that we propose shadow the influence of the spurious roots and exploding gradients (explained in the Subsection 2.4), preserve the Hessian structure and are derived from a symplectic perspective as the other methods derived above. In order to do this, for n1n\geq 1 we consider the following generalization of IGAHD-type, namely the Lie-Trotter-Symplectic-IGAHD, i.e.

{yn=xn+αn(xnxn1)λn[f(xn)f(xn1)]ωnf(xn)xn+1=ynh2f(yn)+γnf(xn)\displaystyle\begin{cases}&y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\lambda_{n}\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]-\omega_{n}\nabla f(x_{n})\\ &x_{n+1}=y_{n}-h^{2}\nabla f(y_{n})+\gamma_{n}\nabla f(x_{n})\end{cases} (LT-S-IGAHD)

where λn\lambda_{n} and the coefficients ωn\omega_{n} and γn\gamma_{n} are consistent with the notations from LT-Symplectic. Furthermore, for γn=0\gamma_{n}=0, λn=βh\lambda_{n}=\beta h and ωn=βhn\omega_{n}=\dfrac{\beta h}{n}, we obtain as a particular case IGAHD-type. On the other hand, taking into account Subsection 2.4, it is naturally to consider ωn\omega_{n} such as limnωn=0\lim\limits_{n\to\infty}\omega_{n}=0. Also, for the case of IGAHD-type this property is easily verifiable, with ωn=O(1n)\omega_{n}=O\left(\dfrac{1}{n}\right) as nn\to\infty. Now, we present our result of this subsection, in which we introduce a new dynamical system such that LT-S-IGAHD is a Lie-Trotter splitting discretization.

Theorem 7.

Let fC2(N,)f\in C^{2}(\mathbb{R}^{N},\mathbb{R}) and consider the following Hessian driven damping second order system :

x¨(t)+αtx˙(t)+λ(t)γf2(x(t))x˙(t)=1γ2(ω(t)γ~(t))f(x(t))f(𝒜(t,x(t),x˙(t)))\displaystyle\hskip-14.22636pt\ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t)+\dfrac{\lambda(t)}{\gamma}\nabla f^{2}(x(t))\dot{x}(t)=-\dfrac{1}{\gamma^{2}}\left(\omega(t)-\tilde{\gamma}(t)\right)\nabla f(x(t))-\nabla f\left(\mathcal{A}(t,x(t),\dot{x}(t))\right)

where γ~=γ~(t)\tilde{\gamma}=\tilde{\gamma}(t) and ω=ω(t)\omega=\omega(t) are given functions. Also, for a pair of the form (t,x(t),y(t))(t,x(t),y(t)) we have used the notation

𝒜(t,x(t),y(t)):=x(t)+γ(1αγt)y(t)γλ(t)2f(x(t))y(t)ω(t)f(x(t))\mathcal{A}(t,x(t),y(t)):=x(t)+\gamma\left(1-\dfrac{\alpha\gamma}{t}\right)y(t)-\gamma\lambda(t)\nabla^{2}f(x(t))y(t)-\omega(t)\nabla f(x(t))

Then, a Lie-Trotter splitting discretization, with a forward Euler method on the first subproblem and a symplectic integrator on the second one leads to LT-S-IGAHD algorithm.

Proof .

As in the case of (IGAHD-type) we consider two subproblems, one containing the so-called non-potential forces and the other one containing the potential forces :

(P1):{x˙(t)=0v˙(t)=αtv(t)λ(t)γf2(x(t))v(t)ω(t)γ2f(x(t))+γ~(t)γ2f(x(t))f(𝒜(t,x(t),v(t)))+f(x(t))\displaystyle\hskip-14.22636pt(P_{1}):\begin{cases}\dot{x}(t)=0\\ \dot{v}(t)=-\dfrac{\alpha}{t}v(t)-\dfrac{\lambda(t)}{\gamma}\nabla f^{2}(x(t))v(t)-\dfrac{\omega(t)}{\gamma^{2}}\nabla f(x(t))+\dfrac{\tilde{\gamma}(t)}{\gamma^{2}}\nabla f(x(t))-\nabla f\left(\mathcal{A}(t,x(t),v(t))\right)+\nabla f(x(t))\end{cases}

and the second subproblem as

(P2):{x˙(t)=v(t)v˙(t)=f(x(t))\displaystyle(P_{2}):\begin{cases}\dot{x}(t)=v(t)\\ \dot{v}(t)=-\nabla f(x(t))\end{cases}

Taking γ=h\gamma=h, approximating γ~(tn)γn\tilde{\gamma}(t_{n})\approx\gamma_{n}, λ(tn)λn\lambda(t_{n})\approx\lambda_{n} and ω(tn)ωn\omega(t_{n})\approx\omega_{n}, and defining αn:=1αhtn\alpha_{n}:=1-\dfrac{\alpha h}{t_{n}} we obtain that

{xn+1/2=xnvn+1/2=αnvnλn2f(xn)vnωnhf(xn)+γnhf(xn)+hf(xn)hf(yn)\displaystyle\begin{cases}x_{n+1/2}=x_{n}\\ v_{n+1/2}=\alpha_{n}v_{n}-\lambda_{n}\nabla^{2}f(x_{n})v_{n}-\dfrac{\omega_{n}}{h}\nabla f(x_{n})+\dfrac{\gamma_{n}}{h}\nabla f(x_{n})+h\nabla f(x_{n})-h\nabla f\left(y_{n}\right)\end{cases}

where we have the notation for the auxiliary iteration element yn:=𝒜(tn,xn,vn)y_{n}:=\mathcal{A}(t_{n},x_{n},v_{n}), i.e. yn=xn+hαnvnhλn2f(xn)vnωnf(xn)y_{n}=x_{n}+h\alpha_{n}v_{n}-h\lambda_{n}\nabla^{2}f(x_{n})v_{n}-\omega_{n}\nabla f(x_{n}). Furthermore, for the second subproblem (P2)(P_{2}), we apply the symplectic Euler integrator (SE2) and obtain that

{xn+1=xn+1/2+hvn+1vn+1=vn+1/2hf(xn+1/2)\displaystyle\begin{cases}x_{n+1}=x_{n+1/2}+hv_{n+1}\\ v_{n+1}=v_{n+1/2}-h\nabla f(x_{n+1/2})\end{cases}

Using the fact that for tn=nht_{n}=nh, the splitting solution is the discrete solution of (P2)(P_{2}), namely (xn+1,vn+1)TN×N(x_{n+1},v_{n+1})^{T}\in\mathbb{R}^{N}\times\mathbb{R}^{N} , we obtain, from the definition of xn+1x_{n+1}, that vn=xnxn1hv_{n}=\dfrac{x_{n}-x_{n-1}}{h}. Using the approximation technique for 2f(xn)vn\nabla^{2}f(x_{n})v_{n} as in Theorem 4, it follows that our algorithm is identical with LT-S-IGAHD. \blacksquare

Now, we will briefly give a chain of observations that will give insights into the construction of LT-S-IGAHD type algorithms.

Remark 8.

The same principle as derived in Theorem 7 can be applied to the DynSys-ARDM. It is easy to see that if we add a perturbation of the form γ~(t)γ2f(x(t))\dfrac{\tilde{\gamma}(t)}{\gamma^{2}}\nabla f(x(t)), the we will obtain an additional element γnf(xn)\gamma_{n}\nabla f(x_{n}) in the discrete version of the exact solution, i.e. xn+1x_{n+1}. Furthermore, we will explain in detail the technical reason why the new added iteration is significant in the process of creating new splitting-type algorithms. For example, we refer to the particular case of the dynamical system from Theorem 7, for the case when ω(t)=βγ2t\omega(t)=\dfrac{\beta\gamma^{2}}{t}. This is the system corresponding to the IGAHD dynamical system from Theorem 4, but with an additional gradient with coefficient γ~(t)γ2\dfrac{\tilde{\gamma}(t)}{\gamma^{2}}. Even though we have two gradients in the right hand side, namely βtf(x(t))\dfrac{\beta}{t}\nabla f(x(t)) and γ~(t)γ2f(x(t))\dfrac{\tilde{\gamma}(t)}{\gamma^{2}}\nabla f(x(t)), the first one is proportional to the one under the gradient of 𝒜(t,x(t),x˙(t))\mathcal{A}(t,x(t),\dot{x}(t)), i.e. βγ2tf(x(t))\dfrac{\beta\gamma^{2}}{t}\nabla f(x(t)). This basic observation (also valid for the term involving the Hessian, which also has γ2\gamma^{2} as a coefficient of proportionality) leads to the crucial idea in which we can construct algorithms of the form LT-S-IGAHD. This is equivalent to the fact that if a term proportional to the gradient at the current value does not appear under the gradient of 𝒜(t,x(t),x,x˙(t))\mathcal{A}(t,x(t),x,\dot{x}(t)), then this term will not appear in the inertial term yny_{n} but will appear in xn+1x_{n+1}, i.e. it is added to yny_{n} and f(yn)\nabla f(y_{n}).

Remark 9.

In the Subsection 2.4 we have pointed out that in order to make the algorithms have the natural behavior in the sense that they do not possess exploding gradients or spurious fixed points, a natural technique is to make the division into potential and non-potential forces and add and substract θ(t)f(x(t))\theta(t)\nabla f(x(t)) as in the case of LT-SE3. This method was used for the LT-SE3 algorithm by the fact that we have employed the SE1 symplectic integrator. In the case of LT-S-IGAHD (as in the case of Nesterov’s method AGM2α), the symplectic numerical scheme SE2 does not require such a process.

Remark 10.

One might argue that our Hessian-type algorithms LT-S-IGAHD have a major drawback, namely we require the computation of f(xn)\nabla f(x_{n}) at xn+1x_{n+1}, along with f(yn)\nabla f(y_{n}). This is not true since f(xn)\nabla f(x_{n}) was already computed in order to find the value of the inertial term yny_{n} so, the algorithms require the same computational time.

Remark 11.

One might wonder why we do not analyzed theoretically LT-Symplectic type algorithms and why we have introduced an additional gradient value f(xn1)\nabla f(x_{n-1}). The motivation is two-fold. First of all, the product 2f(x(t))x˙(t)\nabla^{2}f(x(t))\dot{x}(t) is equal to ddt(f(x(t)))\dfrac{d}{dt}\left(\nabla f(x(t))\right) and so it is natural to consider the effects of the Hessian perturbation. Also, the Hessian-type systems lead to a much clearer interpretation of numerical methods in optimization (for this see Subsection 2.1). Second of all, the previous value under the gradient f(xn1)\nabla f(x_{n-1}) leads to an intuitive discrete Lyapunov function as we will see in the main result of the next subsection. Without this term, through tedious computations, one can see that the Lyapunov function can not be naturally chosen.

Remark 12.

Recently, in [18], L. Chen and H. Luo have presented some Hessian-driven damping dynamical systems. They have incorporated the effects of a dynamically changing damping coefficient. Also, they have considered alternative proofs concerning the continuous and the discrete energy functions using the so-called strong Lyapunov property. On the other hand, their optimization methods are in connection to the first order system associated to the second order evolution equation containing the Hessian. This idea is related to the technique of A.M. Oberman and M. Laborde [33], in which Nesterov type accelerated gradient methods can be interpreted as explicit discretizations of Hessian systems, but with an extra gradient descent step. We point out that the algorithms developed in [18] come from an interpretation different to that developed by us in Subsection 2.1. Our algorithms have a beautiful geometric interpretation related to the idea of symplectic integrators that almost conserve the energy and splitting methods that separate the potential and non-potential forces. Also, they have presented that Nesterov’s method is an explicit method for a Hessian type system but with an addition gradient step. This can be seen in contrast with the present work where we actually show in a rigorous manner how AGM2α can be constructed through the lens of the Lie-Trotter splitting.

3.2 The rate of convergence of the optimization methods

In the present section, we shall consider some necessary tools for the convergence analysis of the Hessian-driven damping discretization LT-S-IGAHD. For this, we recall that an important assumption concerning the theoretical rate of convergence of optimization algorithms that the obhective function has a Lipschitz continuous gradient. An interesting consequence is the so-called descent lemma (see inequality 2.1.6 from page 56 of [31]), which is presented below.

Lemma 13 (Descent lemma).

Let f:Nf:\mathbb{R}^{N}\to\mathbb{R} be Fŕechet differentiable, such that f:NN\nabla f:\mathbb{R}^{N}\to\mathbb{R}^{N} is LL - Lipschitz continuous. Then, one has that

f(y)f(x)+f(x),yx+L2yx2,x,yN.\displaystyle f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\dfrac{L}{2}\|y-x\|^{2}\quad,\quad\forall x,y\in\mathbb{R}^{N}\quad. (DL)

It is worth noticing that if the objective function is convex, then we have actually an equivalence between DL and the LL - Lipschitz continuity of f\nabla f. Furthermore, Lemma DL presents a quadratic upper bound of the underlying objective function. Now, we shall briefly recall from [4] extended variant of Lemma 13, where a modified variant of DL contains also the norm of the difference between two values of f\nabla f (as a side note, see also inequality 2.17 from page 57 of [31]).

Lemma 14 (Extended descent lemma).

Let f:Nf:\mathbb{R}^{N}\to\mathbb{R} be convex and Fréchet differentiable, such that f:NN\nabla f:\mathbb{R}^{N}\to\mathbb{R}^{N} is LL - Lipschitz continuous. Consider a parameter ss, such that 0<s1L0<s\leq\dfrac{1}{L}. Then one has that

f(ysf(y))f(x)+f(y),yxs2f(y)2s2f(x)f(y)2,x,yN.\displaystyle f(y-s\nabla f(y))\leq f(x)+\langle\nabla f(y),y-x\rangle-\dfrac{s}{2}\|\nabla f(y)\|^{2}-\dfrac{s}{2}\|\nabla f(x)-\nabla f(y)\|^{2}\quad,\quad\forall x,y\in\mathbb{R}^{N}\quad. (EDL)

In the following sequel, we present a technical lemma that will play a crucial role in our main result regarding the asymptotic convergence of the Hessian-type algorithms. So, it is the time to ’extend’ the ’Extended descent lemma (Lemma 14). This represents our first result of this section.

Lemma 15 (Three points extended lemma).

Let f:Nf:\mathbb{R}^{N}\to\mathbb{R} be a convex and Fréchet differentiable function, such that f:NN\nabla f:\mathbb{R}^{N}\to\mathbb{R}^{N} is LL - Lipschitz continuous. Let s(0,1L]s\in\left(0,\dfrac{1}{L}\right]. For simplicity, consider the following notations: 𝒜1(s):=s\mathcal{A}_{1}(s):=s, 𝒜2(s):=𝒜1(s)\mathcal{A}_{2}(s):=-\mathcal{A}_{1}(s), 𝒜3(s;γ,L):=γ(1Ls)\mathcal{A}_{3}(s;\gamma,L):=-\gamma(1-Ls), 𝒜4(s):=𝒜1(s)s2\mathcal{A}_{4}(s):=\mathcal{A}_{1}(s)-\dfrac{s}{2} and 𝒜5(s;γ):=γ2/(2s)\mathcal{A}_{5}(s;\gamma):=-\gamma^{2}/(2s), where γ0\gamma\geq 0. Then x,y,zN\forall x,y,z\in\mathbb{R}^{N}, one has that ff satisfies the following inequality:

f(ysf(y)+γf(z))\displaystyle f(y-s\nabla f(y)+\gamma\nabla f(z)) f(x)+f(y),yx𝒜1(s)f(y)2\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-\mathcal{A}_{1}(s)\|\nabla f(y)\|^{2} (E-EDL)
[𝒜2(s)f(y),f(x)+𝒜3(s;γ,L)f(y),f(z)]\displaystyle-\left[\mathcal{A}_{2}(s)\langle\nabla f(y),\nabla f(x)\rangle+\mathcal{A}_{3}(s;\gamma,L)\langle\nabla f(y),\nabla f(z)\rangle\right]
[𝒜4(s)f(x)2+𝒜5(s;γ)f(z)2].\displaystyle-\left[\mathcal{A}_{4}(s)\|\nabla f(x)\|^{2}+\mathcal{A}_{5}(s;\gamma)\|\nabla f(z)\|^{2}\right]\quad.
Proof .

Let x,y,zNx,y,z\in\mathbb{R}^{N}. For simplicity, consider y+:=ysf(y)+γf(z)y^{+}:=y-s\nabla f(y)+\gamma\nabla f(z). Applying DL for y+y^{+} and yy, it follows that

f(y+)f(y)+f(y),y+y+L2y+y2.\displaystyle f(y^{+})\leq f(y)+\langle\nabla f(y),y^{+}-y\rangle+\dfrac{L}{2}\|y^{+}-y\|^{2}\quad.

This leads to

f(y+)f(y)+f(y),sf(y)+γf(z)+L2sf(y)+γf(z)2\displaystyle f(y^{+})\leq f(y)+\langle\nabla f(y),-s\nabla f(y)+\gamma\nabla f(z)\rangle+\dfrac{L}{2}\|-s\nabla f(y)+\gamma\nabla f(z)\|^{2}
f(y+)f(y)sf(y)2+γf(y),f(z)+L2sf(y)+γf(z),sf(y)+γf(z)\displaystyle f(y^{+})\leq f(y)-s\|\nabla f(y)\|^{2}+\gamma\langle\nabla f(y),\nabla f(z)\rangle+\dfrac{L}{2}\langle-s\nabla f(y)+\gamma\nabla f(z),-s\nabla f(y)+\gamma\nabla f(z)\rangle
f(y+)f(y)sf(y)2+γf(y),f(z)+L2s2f(y)2+L2γ2f(z)2Lsγf(y),f(z)\displaystyle f(y^{+})\leq f(y)-s\|\nabla f(y)\|^{2}+\gamma\langle\nabla f(y),\nabla f(z)\rangle+\dfrac{L}{2}s^{2}\|\nabla f(y)\|^{2}+\dfrac{L}{2}\gamma^{2}\|\nabla f(z)\|^{2}-Ls\gamma\langle\nabla f(y),\nabla f(z)\rangle

So, it follows that

f(y+)f(y)s2(2Ls)f(y)2+L2γ2f(z)2+γ(1Ls)f(y),f(z)\displaystyle f(y^{+})\leq f(y)-\dfrac{s}{2}\left(2-Ls\right)\|\nabla f(y)\|^{2}+\dfrac{L}{2}\gamma^{2}\|\nabla f(z)\|^{2}+\gamma(1-Ls)\langle\nabla f(y),\nabla f(z)\rangle (22)

Taking into account that f\nabla f is LL - Lipschitz continuous is equivalent to the fact that the conjugate function ff^{\ast} is 1L\dfrac{1}{L} - strongly convex and applying this property for the pair (f(y),f(x))(\nabla f(y),\nabla f(x)), we get that

f(f(y))f(f(x))+f(f(x)),f(y)f(x)+12Lf(y)f(x)2\displaystyle f^{\ast}(\nabla f(y))\geq f^{\ast}(\nabla f(x))+\langle\nabla f^{\ast}(\nabla f(x)),\nabla f(y)-\nabla f(x)\rangle+\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}

Using the fact that (f)1=f(\nabla f)^{-1}=\nabla f^{\ast}, it simplifies to

f(f(y))f(f(x))+x,f(y)f(x)+12Lf(y)f(x)2\displaystyle f^{\ast}(\nabla f(y))\geq f^{\ast}(\nabla f(x))+\langle x,\nabla f(y)-\nabla f(x)\rangle+\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}

Using Fenchel identity for yy, namely f(y)=f(y),yf(f(y))f(y)=\langle\nabla f(y),y\rangle-f^{\ast}(\nabla f(y)), we get

f(y),yf(y)f(f(x))+x,f(y)f(x)+12Lf(y)f(x)2\displaystyle\langle\nabla f(y),y\rangle-f(y)\geq f^{\ast}(\nabla f(x))+\langle x,\nabla f(y)-\nabla f(x)\rangle+\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}

So, this is equivalent to

f(y)f(y),yf(f(x))x,f(y)f(x)12Lf(y)f(x)2\displaystyle f(y)\leq\langle\nabla f(y),y\rangle-f^{\ast}(\nabla f(x))-\langle x,\nabla f(y)-\nabla f(x)\rangle-\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}
f(y)[x,f(x)f(f(x))]+f(y),yx12Lf(y)f(x)2\displaystyle f(y)\leq\left[\langle x,\nabla f(x)\rangle-f^{\ast}(\nabla f(x))\right]+\langle\nabla f(y),y-x\rangle-\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}

Again using Fenchel identity but applied on xx, namely f(x)=x,f(x)f(f(x))f(x)=\langle x,\nabla f(x)\rangle-f^{\ast}(\nabla f(x)), it follows that

f(y)f(x)+f(y),yx12Lf(y)f(x)2.\displaystyle f(y)\leq f(x)+\langle\nabla f(y),y-x\rangle-\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}\quad. (23)

Taking into account inequalities 22 and 23, thus we obtain

f(y+)\displaystyle f(y^{+}) f(x)+f(y),yx12Lf(y)f(x)2s2(2Ls)f(y)2\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-\dfrac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^{2}-\dfrac{s}{2}\left(2-Ls\right)\|\nabla f(y)\|^{2} (24)
+L2γ2f(z)2+γ(1Ls)f(z),f(y).\displaystyle+\dfrac{L}{2}\gamma^{2}\|\nabla f(z)\|^{2}+\gamma(1-Ls)\langle\nabla f(z),\nabla f(y)\rangle\quad.

From the theorem’s assumptions, we know that s2sLs22\dfrac{s}{2}\leq s-\dfrac{Ls^{2}}{2}, 1Ls\dfrac{1}{L}\geq s and 12Ls2-\dfrac{1}{2L}\leq-\dfrac{s}{2}. From the inequality 24 and from the condition on LL, one obtains that

f(y+)\displaystyle f(y^{+}) f(x)+f(y),yxs2f(y)f(x)2s2f(y)2\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-\dfrac{s}{2}\|\nabla f(y)-\nabla f(x)\|^{2}-\dfrac{s}{2}\|\nabla f(y)\|^{2}
+Lγ22f(z)2+γ(1Ls)f(z),f(y)\displaystyle+\dfrac{L\gamma^{2}}{2}\|\nabla f(z)\|^{2}+\gamma(1-Ls)\langle\nabla f(z),\nabla f(y)\rangle

Through some simple computations, it follows

f(y+)\displaystyle f(y^{+}) f(x)+f(y),yxs2f(y)2s2f(x)2\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-\dfrac{s}{2}\|\nabla f(y)\|^{2}-\dfrac{s}{2}\|\nabla f(x)\|^{2}
+sf(y),f(x)s2f(y)2+Lγ22f(z)2+γ(1Ls)f(z),f(y)\displaystyle+s\langle\nabla f(y),\nabla f(x)\rangle-\dfrac{s}{2}\|\nabla f(y)\|^{2}+\dfrac{L\gamma^{2}}{2}\|\nabla f(z)\|^{2}+\gamma(1-Ls)\langle\nabla f(z),\nabla f(y)\rangle

Hence,

f(y+)\displaystyle f(y^{+}) f(x)+f(y),yx[s2+s2]f(y)2[s2f(x)2Lγ22f(z)2]\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-\left[\dfrac{s}{2}+\dfrac{s}{2}\right]\|\nabla f(y)\|^{2}-\left[\dfrac{s}{2}\|\nabla f(x)\|^{2}-\dfrac{L\gamma^{2}}{2}\|\nabla f(z)\|^{2}\right]
[sf(y),f(x)γ(1Ls)f(y),f(z))]\displaystyle-\left[-s\langle\nabla f(y),\nabla f(x)\rangle-\gamma(1-Ls)\langle\nabla f(y),\nabla f(z)\rangle)\right]

Using the fact L212s\dfrac{L}{2}\leq\dfrac{1}{2s}, the last inequality leads to

f(y+)\displaystyle f(y^{+}) f(x)+f(y),yxsf(y)2[s2f(x)2γ22sf(z)2]\displaystyle\leq f(x)+\langle\nabla f(y),y-x\rangle-s\|\nabla f(y)\|^{2}-\left[\dfrac{s}{2}\|\nabla f(x)\|^{2}-\dfrac{\gamma^{2}}{2s}\|\nabla f(z)\|^{2}\right]
[sf(y),f(x)γ(1Ls)f(y),f(z)]\displaystyle-\left[-s\langle\nabla f(y),\nabla f(x)\rangle-\gamma(1-Ls)\langle\nabla f(y),\nabla f(z)\rangle\right]

Using the notations for 𝒜1(s)\mathcal{A}_{1}(s), 𝒜2(s)\mathcal{A}_{2}(s), 𝒜3(s;γ,L)\mathcal{A}_{3}(s;\gamma,L), 𝒜4(s)\mathcal{A}_{4}(s) and 𝒜5(s;γ)\mathcal{A}_{5}(s;\gamma), the inequality E-EDL follows easily. \blacksquare

Corollary 16.

Consider x,y,zNx,y,z\in\mathbb{R}^{N}. By the fact that the stepsize ss lies in (0,1L]\left(0,\dfrac{1}{L}\right], it follows that the major bound E-EDL for f(y+)f(y^{+}) reduces to EDL, with γ=0\gamma=0.

Before embarking into our main result of the present paper, we consider some technical lemmas that will be used in the forthcoming sequel. These results involve inequalities that are related to quadratic inequalities with nonconstant coefficients. Even though they are elementary results, for brevity we give them along with their proofs.

Lemma 17.

Let a,b,c:a,b,c:\mathbb{R}\to\mathbb{R} be given functions. If there exists xx\in\mathbb{R}, such that the following inequalities are satisfied: a(x)>0a(x)>0 and b2(x)4a(x)c(x)0b^{2}(x)-4a(x)c(x)\leq 0, then xx satisfies a(x)x2+b(x)x+c(x)0a(x)x^{2}+b(x)x+c(x)\geq 0.

Proof .

First of all, we consider the following computations:

a(x)(x+b(x)2a(x))2b2(x)4a(x)c(x)4a(x)\displaystyle a(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}-\dfrac{b^{2}(x)-4a(x)c(x)}{4a(x)} =a(x)[x2+b2(x)4a2(x)+xb(x)a(x)]b2(x)4a(x)+c(x)\displaystyle=a(x)\left[x^{2}+\dfrac{b^{2}(x)}{4a^{2}(x)}+x\dfrac{b(x)}{a(x)}\right]-\dfrac{b^{2}(x)}{4a(x)}+c(x)
=a(x)x2+b2(x)4a2(x)a(x)+xb(x)a(x)a(x)b2(x)4a(x)+c(x).\displaystyle=a(x)x^{2}+\dfrac{b^{2}(x)}{4a^{2}(x)}a(x)+x\dfrac{b(x)}{a(x)}a(x)-\dfrac{b^{2}(x)}{4a(x)}+c(x)\,.

This is equivalent to the fact that

a(x)x2+b(x)x+c(x)=a(x)(x+b(x)2a(x))2b2(x)4a(x)c(x)4a(x).a(x)x^{2}+b(x)x+c(x)=a(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}-\dfrac{b^{2}(x)-4a(x)c(x)}{4a(x)}\,.

Now, a(x)x2+b(x)x+c(x)0a(x)(x+b(x)2a(x))2b2(x)4a(x)c(x)4a(x)a(x)x^{2}+b(x)x+c(x)\geq 0\Longleftrightarrow a(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}\geq\dfrac{b^{2}(x)-4a(x)c(x)}{4a(x)}. Taking into account that b2(x)4a(x)c(x)0b^{2}(x)-4a(x)c(x)\geq 0 and dividing by a(x)>0a(x)>0, it follows that (x+b(x)2a(x))2b2(x)4a(x)c(x)4a2(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}\geq\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}. Likewise, let (x+b(x)2a(x))2b2(x)4a(x)c(x)4a2(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}\geq\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}. As before, we observe that these terms are positive and multiplying be a(x)>0a(x)>0, we obtain that a(x)x2+b(x)x+c(x)0a(x)x^{2}+b(x)x+c(x)\geq 0. So, wrapping things up, we need to show that (x+b(x)2a(x))2b2(x)4a(x)c(x)4a2(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}\geq\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}, since this is equivalent to the quadratic inequality with nonconstant coefficients. Because b2(x)a(x)c(x)0b^{2}(x)-a(x)c(x)\leq 0, then it follows that b2(x)4a(x)c(x)4a2(x)0\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}\leq 0. Combining this with the fact that the left hand side of the inequality is positive, the conclusion stands up. \blacksquare

Lemma 18.

Let a,b,c:a,b,c:\mathbb{R}\to\mathbb{R} be given functions. If there exists xx\in\mathbb{R}, such that the following inequalities are satisfied: a(x)>0a(x)>0, b2(x)4a(x)c(x)0b^{2}(x)-4a(x)c(x)\geq 0 and xI1I2x\in I_{1}\cup I_{2}, where: I1:=(,b(x)b2(x)4a(x)c(x)2a(x)]I_{1}:=\left(-\infty,\dfrac{-b(x)-\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}\right] and I2:=[b(x)+b2(x)4a(x)c(x)2a(x),+)I_{2}:=\left[\dfrac{-b(x)+\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)},+\infty\right), then xx satisfies a(x)x2+b(x)x+c(x)0a(x)x^{2}+b(x)x+c(x)\geq 0.

Proof .

As before, we must show that (x+b(x)2a(x))2b2(x)4a(x)c(x)4a2(x)\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}\geq\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}, so we must analyze two cases:

  • 1)

    xI1x\in I_{1}, that is equivalent to xb(x)2a(x)b2(x)4a(x)c(x)2a(x)x\leq-\dfrac{b(x)}{2a(x)}-\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}. Thus, we have that b2(x)4a(x)c(x)2a(x)xb(x)2a(x)=(x+b(x)2a(x))\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}\leq-x-\dfrac{b(x)}{2a(x)}=-\left(x+\dfrac{b(x)}{2a(x)}\right). But, it is easy to see that xb(x)2a(x)x\leq-\dfrac{b(x)}{2a(x)}, i.e. (x+b(x)2a(x))0-\left(x+\dfrac{b(x)}{2a(x)}\right)\geq 0. So, it follows that b2(x)4a(x)c(x)2a(x)(x+b(x)2a(x))=|x+b(x)2a(x)|\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}\leq-\left(x+\dfrac{b(x)}{2a(x)}\right)=\Big{|}x+\dfrac{b(x)}{2a(x)}\Big{|}. Squaring the positive terms, it follows that b2(x)4a(x)c(x)4a2(x)(x+b(x)2a(x))2\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}\leq\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}.

  • 2)

    xI2x\in I_{2}, i.e. xb(x)2a(x)+b2(x)4a(x)c(x)2a(x)x\geq-\dfrac{b(x)}{2a(x)}+\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}, which is equivalent to b2(x)4a(x)c(x)2a(x)x+b(x)2a(x)\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}\leq x+\dfrac{b(x)}{2a(x)}. But, since x+b(x)2a(x)b2(x)4a(x)c(x)2a(x)x+\dfrac{b(x)}{2a(x)}\geq\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}, it follows that x+b(x)2a(x)0x+\dfrac{b(x)}{2a(x)}\geq 0, so x+b(x)2a(x)=|x+b(x)2a(x)|x+\dfrac{b(x)}{2a(x)}=\Big{|}x+\dfrac{b(x)}{2a(x)}\Big{|}. Hence, using that b2(x)4a(x)c(x)2a(x)|x+b(x)2a(x)|\dfrac{\sqrt{b^{2}(x)-4a(x)c(x)}}{2a(x)}\leq\Big{|}x+\dfrac{b(x)}{2a(x)}\Big{|} and squaring the positive terms, we get the conclusion that b2(x)4a(x)c(x)4a2(x)(x+b(x)2a(x))2\dfrac{b^{2}(x)-4a(x)c(x)}{4a^{2}(x)}\leq\left(x+\dfrac{b(x)}{2a(x)}\right)^{2}.

\blacksquare

Until our main result, we consider some observations regarding the functions that appear in Lemma 17 and Lemma 18.

Remark 19.

Related to Lemma 17 and Lemma 18, we have the following:

  • For a given pair of functions (a(x),b(x),c(x))(a(x),b(x),c(x)), if the element xx\in\mathbb{R} exists, then the quadratic inequality takes place. But this does not mean that such a point exists for all pair of functions (a(x),b(x),c(x))(a(x),b(x),c(x)).

  • It is easy to observe that, in Lemma 17, the conditions a(x)>0a(x)>0 and b2(x)4a(x)c(x)0b^{2}(x)-4a(x)c(x)\leq 0 implies that c(x)0c(x)\geq 0. Furthermore, we do not have a positivity-type inequality on the function b=b(x)b=b(x).

  • In Lemma 18, we observe that a(x)>0a(x)>0 and if c(x)0c(x)\leq 0, then it implies that 4a(x)c(x)b2(x)4a(x)c(x)\leq b^{2}(x) is trivially satisfied. As before, we do not need the positivity of b=b(x)b=b(x).

In what follows we have our main result of this section. Given a convex objective function endowed with Lipschitz continuous gradient, we show that the rate of convergence of the algorithms given by the general discretization LT-S-IGAHD is O(1n2)O\left(\dfrac{1}{n^{2}}\right) as nn\to\infty, under some suitable conditions over the parameters. Given xargminfx^{\ast}\in\operatorname*{argmin}f (under the assumption that argminf\operatorname*{argmin}f\neq\emptyset), the Lyapunov analysis that we employ is based upon the discrete energy defined as:

n:=tn2(f(xn)f(x))+12szn2\displaystyle\mathcal{E}_{n}:=t_{n}^{2}\left(f(x_{n})-f(x^{\ast})\right)+\dfrac{1}{2s}\|z_{n}\|^{2} (Discrete-Energy)
zn:=(xn1x)+tn(xnxn1)+λntn+1f(xn1),\displaystyle z_{n}:=(x_{n-1}-x^{\ast})+t_{n}(x_{n}-x_{n-1})+\lambda_{n}t_{n+1}\nabla f(x_{n-1})\,,

where tn+1=nα1t_{n+1}=\dfrac{n}{\alpha-1}, with α3\alpha\geq 3. We shall also use αn=nαn\alpha_{n}=\dfrac{n-\alpha}{n}, so it follows that tn1=αntn+1t_{n}-1=\alpha_{n}t_{n+1}. Moreover, our technique for the proof is inspired by the ones given in [4] and [6].

Theorem 20.

Let f:Nf:\mathbb{R}^{N}\to\mathbb{R} be a convex and Fréchet differentiable function whose gradient f:NN\nabla f:\mathbb{R}^{N}\to\mathbb{R}^{N} is LL - Lipschitz continuous. Also, suppose that argminf\operatorname*{argmin}f\neq\emptyset. Let xnx_{n} be generated by the algorithm LT-S-IGAHD for every n1n\geq 1 and, for simplicity, denote h2h^{2} as ss, which satisfies the condition s(0,1L]s\in\left(0,\dfrac{1}{L}\right]. Suppose that the sequences (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}} are convergent. Likewise, assume that the following assumptions are satisfied:

  • i)

    there exists NN^{\prime}\in\mathbb{N} such that [s(1γnL)n+1nλn+1]2<s2γn2,n>max{α1,N}\left[s(1-\gamma_{n}L)-\dfrac{n+1}{n}\lambda_{n+1}\right]^{2}<s^{2}-\gamma_{n}^{2}\,,\forall n>max\{\alpha-1,N^{\prime}\};

  • ii)

    γn=(λn+ωn)n+1nλn+1,nα1\gamma_{n}=(\lambda_{n}+\omega_{n})-\dfrac{n+1}{n}\lambda_{n+1}\,,\forall n\geq\alpha-1.

For xargminfx^{\ast}\in\operatorname*{argmin}f, the sequence ()n(\mathcal{E})_{n\in\mathbb{N}} defined in Discrete-Energy is non-increasing and the following converge rate is satisfied:

f(xn)f(x)=O(1n2) as nf(x_{n})-f(x^{\ast})=O\left(\dfrac{1}{n^{2}}\right)\text{ as }n\to\infty
Proof .

( Step I ) The first step of the proof is to apply the so-called Three points extended lemma, namely Lemma 15 and then construct the difference n+1n\mathcal{E}_{n+1}-\mathcal{E}_{n} of the discrete energy Discrete-Energy.
We apply the inequality E-EDL with the pair x=z=xnx=z=x_{n}, y=yny=y_{n} and γ=γn\gamma=\gamma_{n}. Then, we obtain that:

f(ynsf(yn)+γnf(xn))\displaystyle f(y_{n}-s\nabla f(y_{n})+\gamma_{n}\nabla f(x_{n})) f(xn)+f(yn),ynxn𝒜1(s)f(yn)2\displaystyle\leq f(x_{n})+\langle\nabla f(y_{n}),y_{n}-x_{n}\rangle-\mathcal{A}_{1}(s)\|\nabla f(y_{n})\|^{2}
[𝒜2(s)f(yn),f(xn)+𝒜3(s;γn,L)f(yn),f(xn)]\displaystyle-\left[\mathcal{A}_{2}(s)\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle+\mathcal{A}_{3}(s;\gamma_{n},L)\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle\right]
[𝒜4(s)f(xn)2+𝒜5(s;γn)f(xn)2].\displaystyle-\left[\mathcal{A}_{4}(s)\|\nabla f(x_{n})\|^{2}+\mathcal{A}_{5}(s;\gamma_{n})\|\nabla f(x_{n})\|^{2}\right]\,.

This means that

f(xn+1)\displaystyle f(x_{n+1}) f(xn)+f(yn),ynxn𝒜1(s)f(yn)2\displaystyle\leq f(x_{n})+\langle\nabla f(y_{n}),y_{n}-x_{n}\rangle-\mathcal{A}_{1}(s)\|\nabla f(y_{n})\|^{2} (25)
[𝒜2(s)+𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-\left[\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
[𝒜4(s)+𝒜5(s;γn)]f(xn)2.\displaystyle-\left[\mathcal{A}_{4}(s)+\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}\,.

As before, we will apply E-EDL, but with the choices x=xx=x^{\ast}, y=yny=y_{n}, z=xnz=x_{n} and γ=γn\gamma=\gamma_{n}. It follows that:

f(ynsf(yn)+γnf(xn))\displaystyle f(y_{n}-s\nabla f(y_{n})+\gamma_{n}\nabla f(x_{n})) f(x)+f(yn),ynx𝒜1(s)f(yn)2\displaystyle\leq f(x^{\ast})+\langle\nabla f(y_{n}),y_{n}-x^{\ast}\rangle-\mathcal{A}_{1}(s)\|\nabla f(y_{n})\|^{2}
[𝒜2(s)f(yn),f(x)+𝒜3(s;γn,L)f(yn),f(xn)]\displaystyle-\left[\mathcal{A}_{2}(s)\langle\nabla f(y_{n}),\nabla f(x^{\ast})\rangle+\mathcal{A}_{3}(s;\gamma_{n},L)\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle\right]
[𝒜4(s)f(x)2+𝒜5(s;γn)f(xn)2].\displaystyle-\left[\mathcal{A}_{4}(s)\|\nabla f(x^{\ast})\|^{2}+\mathcal{A}_{5}(s;\gamma_{n})\|\nabla f(x_{n})\|^{2}\right]\,.

Using the fact that xargminfx^{\ast}\in\operatorname*{argmin}f, so f(x)=0\nabla f(x^{\ast})=0, we obtain

f(xn+1)\displaystyle f(x_{n+1}) f(x)+f(yn),ynx𝒜1(s)f(yn)2\displaystyle\leq f(x^{\ast})+\langle\nabla f(y_{n}),y_{n}-x^{\ast}\rangle-\mathcal{A}_{1}(s)\|\nabla f(y_{n})\|^{2} (26)
𝒜3(s;γn,L)f(yn),f(xn)𝒜5(s;γn)f(xn)2.\displaystyle-\mathcal{A}_{3}(s;\gamma_{n},L)\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\mathcal{A}_{5}(s;\gamma_{n})\|\nabla f(x_{n})\|^{2}\,.

Using the definition of tn+1t_{n+1}, we consider

nα1.\displaystyle n\geq\alpha-1\,. (27)

hence tn+110t_{n+1}-1\geq 0. Now, we multiply 25 by (tn+11)(t_{n+1}-1) and then add it to 26. One obtains that

(tn+11)f(xn+1)+f(xn+1)\displaystyle(t_{n+1}-1)f(x_{n+1})+f(x_{n+1}) (tn+11)f(xn)+f(x)+(tn+11)f(yn),ynxn+f(yn),ynx\displaystyle\leq(t_{n+1}-1)f(x_{n})+f(x^{\ast})+(t_{n+1}-1)\langle\nabla f(y_{n}),y_{n}-x_{n}\rangle+\langle\nabla f(y_{n}),y_{n}-x^{\ast}\rangle
(tn+11)[𝒜2(s)+𝒜3(s;γn,L)]f(yn),f(xn)𝒜1(s)(tn+11)f(yn)2\displaystyle-(t_{n+1}-1)\left[\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\mathcal{A}_{1}(s)(t_{n+1}-1)\|\nabla f(y_{n})\|^{2}
(tn+11)[𝒜4(s)+𝒜5(s;γn)]f(xn)2𝒜1(s)f(yn)2\displaystyle-(t_{n+1}-1)\left[\mathcal{A}_{4}(s)+\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)\|\nabla f(y_{n})\|^{2}
𝒜3(s;γn,L)f(yn),f(xn)𝒜5(s;γn)f(xn)2.\displaystyle-\mathcal{A}_{3}(s;\gamma_{n},L)\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\mathcal{A}_{5}(s;\gamma_{n})\|\nabla f(x_{n})\|^{2}\,.

By grouping up the terms, it follows that

tn+1f(xn+1)(tn+11)f(xn)f(x)\displaystyle t_{n+1}f(x_{n+1})-(t_{n+1}-1)f(x_{n})-f(x^{\ast}) (tn+11)f(yn),ynxn+f(yn),ynx\displaystyle\leq(t_{n+1}-1)\langle\nabla f(y_{n}),y_{n}-x_{n}\rangle+\langle\nabla f(y_{n}),y_{n}-x^{\ast}\rangle
[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+1f(yn)2.\displaystyle-\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}\|\nabla f(y_{n})\|^{2}\,.

We rewrite the left hand side of the inequality from above as follows:

tn+1f(xn+1)(tn+11)f(xn)f(x)\displaystyle t_{n+1}f(x_{n+1})-(t_{n+1}-1)f(x_{n})-f(x^{\ast}) =tn+1f(xn+1)(tn+11)f(xn)(tn+1tn+1+1)f(x)\displaystyle=t_{n+1}f(x_{n+1})-(t_{n+1}-1)f(x_{n})-(t_{n+1}-t_{n+1}+1)f(x^{\ast})
=tn+1f(xn+1)(tn+11)f(xn)tn+1f(x)(1tn+1)f(x)\displaystyle=t_{n+1}f(x_{n+1})-(t_{n+1}-1)f(x_{n})-t_{n+1}f(x^{\ast})-(1-t_{n+1})f(x^{\ast})
=tn+1[f(xn+1)f(x)](tn+11)[f(xn)f(x)]\displaystyle=t_{n+1}\left[f(x_{n+1})-f(x^{\ast})\right]-(t_{n+1}-1)\left[f(x_{n})-f(x^{\ast})\right]

Then, the above inequality takes the following form:

tn+1[f(xn+1)f(x)]\displaystyle t_{n+1}\left[f(x_{n+1})-f(x^{\ast})\right] (tn+11)[f(xn)f(x)]+f(yn),(tn+11)(ynxn)+(ynx)\displaystyle\leq(t_{n+1}-1)\left[f(x_{n})-f(x^{\ast})\right]+\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle (28)
[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+1f(yn)2.\displaystyle-\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}\|\nabla f(y_{n})\|^{2}\,.

Using the fact that tn+10t_{n+1}\geq 0, for every n0n\geq 0, hence for every nn satisfying 27, we multiply the inequality 28 by tn+1t_{n+1} and it follows that:

tn+12[f(xn+1)f(x)]\displaystyle t_{n+1}^{2}\left[f(x_{n+1})-f(x^{\ast})\right] (tn+12tn+1)[f(xn)f(x)]+tn+1f(yn),(tn+11)(ynxn)+(ynx)\displaystyle\leq(t_{n+1}^{2}-t_{n+1})\left[f(x_{n})-f(x^{\ast})\right]+t_{n+1}\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle
tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+12f(yn)2.\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}\,.

Since (tn+12tn+1)[f(xn)f(x)]=(tn+12tn+1tn2)[f(xn)f(x)]+tn2[f(xn)f(x)](t_{n+1}^{2}-t_{n+1})\left[f(x_{n})-f(x^{\ast})\right]=(t_{n+1}^{2}-t_{n+1}-t_{n}^{2})\left[f(x_{n})-f(x^{\ast})\right]+t_{n}^{2}\left[f(x_{n})-f(x^{\ast})\right] and using the fact that tn+12tn+1tn20t_{n+1}^{2}-t_{n+1}-t_{n}^{2}\leq 0 and that f(xn)f(x)0f(x_{n})-f(x^{\ast})\geq 0 (since xargminfx^{\ast}\in\operatorname*{argmin}f) for every n0n\geq 0 and thus for each nn satisfying 27, it follows

tn+12[f(xn+1)f(x)]\displaystyle t_{n+1}^{2}\left[f(x_{n+1})-f(x^{\ast})\right] tn2[f(xn)f(x)]+tn+1f(yn),(tn+11)(ynxn)+(ynx)\displaystyle\leq t_{n}^{2}\left[f(x_{n})-f(x^{\ast})\right]+t_{n+1}\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle
tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+12f(yn)2.\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}\,.

Using the definition of n\mathcal{E}_{n} given in Discrete-Energy, one can easily see that

n+1n\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n} tn+1f(yn),(tn+11)(ynxn)+(ynx)\displaystyle\leq t_{n+1}\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle (29)
tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+12f(yn)2\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}
+12szn+1212szn2.\displaystyle+\dfrac{1}{2s}\|z_{n+1}\|^{2}-\dfrac{1}{2s}\|z_{n}\|^{2}\,.

In computing the last term involving znz_{n} and zn+1z_{n+1}, we shall use the following well-known equality:

12zn+1212zn2=zn+1zn,zn+112zn+1zn2.\displaystyle\dfrac{1}{2}\|z_{n+1}\|^{2}-\dfrac{1}{2}\|z_{n}\|^{2}=\langle z_{n+1}-z_{n},z_{n+1}\rangle-\dfrac{1}{2}\|z_{n+1}-z_{n}\|^{2}\,. (30)

Armed with the definition of znz_{n} given in Discrete-Energy, we have the following estimations:

zn+1zn\displaystyle z_{n+1}-z_{n} =xnx+tn+1(xn+1xn)+λn+1tn+2f(xn)xn1+xtn(xnxn1)λntn+1f(xn1)\displaystyle=x_{n}-x^{\ast}+t_{n+1}(x_{n+1}-x_{n})+\lambda_{n+1}t_{n+2}\nabla f(x_{n})-x_{n-1}+x^{\ast}-t_{n}(x_{n}-x_{n-1})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})
=(xnxn1)+tn+1(xn+1xn)tn(xnxn1)+[λn+1tn+2f(xn)λntn+1f(xn1)]\displaystyle=(x_{n}-x_{n-1})+t_{n+1}(x_{n+1}-x_{n})-t_{n}(x_{n}-x_{n-1})+\left[\lambda_{n+1}t_{n+2}\nabla f(x_{n})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})\right]
=tn+1(xn+1xn)(tn1)(xnxn1)+[λn+1tn+2f(xn)λntn+1f(xn1)].\displaystyle=t_{n+1}(x_{n+1}-x_{n})-(t_{n}-1)(x_{n}-x_{n-1})+\left[\lambda_{n+1}t_{n+2}\nabla f(x_{n})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})\right]\,.

But, we know that tn1=αntn+1t_{n}-1=\alpha_{n}t_{n+1} for each n0n\geq 0, consequently for every nn satisfying 27, we get that

zn+1zn\displaystyle z_{n+1}-z_{n} =tn+1(xn+1xn)αntn+1(xnxn1)+[λn+1tn+2f(xn)λntn+1f(xn1)]\displaystyle=t_{n+1}(x_{n+1}-x_{n})-\alpha_{n}t_{n+1}(x_{n}-x_{n-1})+\left[\lambda_{n+1}t_{n+2}\nabla f(x_{n})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})\right]
=tn+1[xn+1(xn+αn(xnxn1))]+[λn+1tn+2f(xn)λntn+1f(xn1)].\displaystyle=t_{n+1}\left[x_{n+1}-(x_{n}+\alpha_{n}(x_{n}-x_{n-1}))\right]+\left[\lambda_{n+1}t_{n+2}\nabla f(x_{n})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})\right]\,.

From the definition of LT-S-IGAHD, it follows that xn+αn(xnxn1)=yn+λn[f(xn)f(xn1)]+ωnf(xn).x_{n}+\alpha_{n}(x_{n}-x_{n-1})=y_{n}+\lambda_{n}\left[\nabla f(x_{n})-\nabla f(x_{n-1})\right]+\omega_{n}\nabla f(x_{n})\,. Hence, it follows that

zn+1zn\displaystyle z_{n+1}-z_{n} =tn+1[xn+1ynλn(f(xn)f(xn1))ωnf(xn)]+[λn+1tn+2f(xn)λntn+1f(xn1)]\displaystyle=t_{n+1}\left[x_{n+1}-y_{n}-\lambda_{n}\left(\nabla f(x_{n})-\nabla f(x_{n-1})\right)-\omega_{n}\nabla f(x_{n})\right]+\left[\lambda_{n+1}t_{n+2}\nabla f(x_{n})-\lambda_{n}t_{n+1}\nabla f(x_{n-1})\right]
zn+1zn\displaystyle z_{n+1}-z_{n} =tn+1(xn+1yn)+[λntn+1ωntn+1+λn+1tn+2]f(xn).\displaystyle=t_{n+1}(x_{n+1}-y_{n})+\left[-\lambda_{n}t_{n+1}-\omega_{n}t_{n+1}+\lambda_{n+1}t_{n+2}\right]\nabla f(x_{n})\,.

But, from LT-S-IGAHD, we have that xn+1yn=sf(yn)+γnf(xn)x_{n+1}-y_{n}=-s\nabla f(y_{n})+\gamma_{n}\nabla f(x_{n}). This leads to

zn+1zn=stn+1f(yn)+[γntn+1+λn+1tn+2λntn+1ωntn+1]f(xn).\displaystyle z_{n+1}-z_{n}=-st_{n+1}\nabla f(y_{n})+\left[\gamma_{n}t_{n+1}+\lambda_{n+1}t_{n+2}-\lambda_{n}t_{n+1}-\omega_{n}t_{n+1}\right]\nabla f(x_{n})\,. (31)

It is easy to remark that assumption (ii) is equivalent to γn=(λn+ωn)tn+2tn+1λn+1\gamma_{n}=(\lambda_{n}+\omega_{n})-\dfrac{t_{n+2}}{t_{n+1}}\lambda_{n+1} for every nα1n\geq\alpha-1, by taking into account that tn+1=nα1t_{n+1}=\dfrac{n}{\alpha-1} and so tn+2=n+1α1t_{n+2}=\dfrac{n+1}{\alpha-1}. Now, multiplying by tn+1t_{n+1}, it follows that γntn+1=(λn+ωn)tn+1λn+1tn+2\gamma_{n}t_{n+1}=(\lambda_{n}+\omega_{n})t_{n+1}-\lambda_{n+1}t_{n+2}. Using this relationship, 27 and 31, for nα1n\geq\alpha-1, we have that

zn+1zn=stn+1f(yn)\displaystyle z_{n+1}-z_{n}=-st_{n+1}\nabla f(y_{n}) (32)

At the same time, by using 30 and 32, we get

12szn+1212szn2=tn+1f(yn),(xnx)+tn+1(xn+1xn)+λn+1tn+2f(xn)s2tn+12f(yn)2\displaystyle\dfrac{1}{2s}\|z_{n+1}\|^{2}-\dfrac{1}{2s}\|z_{n}\|^{2}=-\langle t_{n+1}\nabla f(y_{n}),(x_{n}-x^{\ast})+t_{n+1}(x_{n+1}-x_{n})+\lambda_{n+1}t_{n+2}\nabla f(x_{n})\rangle-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}

This means that

12szn+1212szn2\displaystyle\dfrac{1}{2s}\|z_{n+1}\|^{2}-\dfrac{1}{2s}\|z_{n}\|^{2} =λn+1tn+1tn+2f(yn),f(xn)s2tn+12f(yn)2\displaystyle=-\lambda_{n+1}t_{n+1}t_{n+2}\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2} (33)
tn+1f(yn),(xnx)+tn+1(xn+1xn).\displaystyle-t_{n+1}\langle\nabla f(y_{n}),(x_{n}-x^{\ast})+t_{n+1}(x_{n+1}-x_{n})\rangle\,.

Combining 29 with 33, we obtain

n+1n\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n} tn+1f(yn),(tn+11)(ynxn)+(ynx)\displaystyle\leq t_{n+1}\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle
tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2𝒜1(s)tn+12f(yn)2\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\mathcal{A}_{1}(s)t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}
λn+1tn+1tn+2f(yn),f(xn)s2tn+12f(yn)2tn+1f(yn),(xnx)+tn+1(xn+1xn).\displaystyle-\lambda_{n+1}t_{n+1}t_{n+2}\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}-t_{n+1}\langle\nabla f(y_{n}),(x_{n}-x^{\ast})+t_{n+1}(x_{n+1}-x_{n})\rangle\,.

On the other hand, we make the following estimations:

f(yn),(tn+11)(ynxn)+(ynx)f(yn),(xnx)+tn+1(xn+1xn)=\displaystyle\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle-\langle\nabla f(y_{n}),(x_{n}-x^{\ast})+t_{n+1}(x_{n+1}-x_{n})\rangle=
f(yn),(tn+11)(ynxn)+(ynx)(xnx)tn+1(xn+1xn)=\displaystyle\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})-(x_{n}-x^{\ast})-t_{n+1}(x_{n+1}-x_{n})\rangle=
f(yn),tn+1(ynxn+1).\displaystyle\langle\nabla f(y_{n}),t_{n+1}(y_{n}-x_{n+1})\rangle\,.

From LT-S-IGAHD, we know that ynxn+1=sf(yn)γnf(xn)y_{n}-x_{n+1}=s\nabla f(y_{n})-\gamma_{n}\nabla f(x_{n}) and this leads to

f(yn),(tn+11)(ynxn)+(ynx)f(yn),(xnx)+tn+1(xn+1xn)=\displaystyle\langle\nabla f(y_{n}),(t_{n+1}-1)(y_{n}-x_{n})+(y_{n}-x^{\ast})\rangle-\langle\nabla f(y_{n}),(x_{n}-x^{\ast})+t_{n+1}(x_{n+1}-x_{n})\rangle=
f(yn),stn+1f(yn)γntn+1f(xn)=stn+1f(yn)2γntn+1f(yn),f(xn).\displaystyle\langle\nabla f(y_{n}),st_{n+1}\nabla f(y_{n})-\gamma_{n}t_{n+1}\nabla f(x_{n})\rangle=st_{n+1}\|\nabla f(y_{n})\|^{2}-\gamma_{n}t_{n+1}\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle\,.

So, it follows that

n+1n\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n} 𝒜1(s)tn+12f(yn)2+stn+12f(yn)2s2tn+12f(yn)2\displaystyle\leq-\mathcal{A}_{1}(s)t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}+st_{n+1}^{2}\|\nabla f(y_{n})\|^{2}-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}
tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)]f(yn),f(xn)\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}
λn+1tn+1tn+2f(yn),f(xn)γntn+12f(yn),f(xn).\displaystyle-\lambda_{n+1}t_{n+1}t_{n+2}\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle-\gamma_{n}t_{n+1}^{2}\langle f(y_{n}),f(x_{n})\rangle\,.

Using the fact that 𝒜1(s)=s\mathcal{A}_{1}(s)=s, one obtains

n+1n\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n} tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)+λn+1tn+2+γntn+1]f(yn),f(xn)\displaystyle\leq-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)+\lambda_{n+1}t_{n+2}+\gamma_{n}t_{n+1}\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2s2tn+12f(yn)2.\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}\,.

Recalling the fact that we have already pointed out that assumption (ii) is equivalent to λn+1tn+2+γntn+1=(λn+ωn)tn+1\lambda_{n+1}t_{n+2}+\gamma_{n}t_{n+1}=(\lambda_{n}+\omega_{n})t_{n+1} for every nα1n\geq\alpha-1, we find that

n+1n\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n} tn+1[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)+(λn+ωn)tn+1]f(yn),f(xn)\displaystyle\leq-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})t_{n+1}\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle
tn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2s2tn+12f(yn)2.\displaystyle-t_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}-\dfrac{s}{2}t_{n+1}^{2}\|\nabla f(y_{n})\|^{2}\,.

So, the difference in the discrete energy functional can be written as

n+1ntn+1F~n,\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}\leq-t_{n+1}\tilde{F}_{n}\,, (34)

where

F~n\displaystyle\tilde{F}_{n} :=s2tn+1f(yn)2+[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)+(λn+ωn)tn+1]f(yn),f(xn)\displaystyle:=\dfrac{s}{2}t_{n+1}\|\nabla f(y_{n})\|^{2}+\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})t_{n+1}\right]\langle\nabla f(y_{n}),\nabla f(x_{n})\rangle (35)
+[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2.\displaystyle+\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}\,.

( Step II ) The second step of our proof consists in applying Lemma 17 and Lemma 18 to two quadratic inequalities with nonconstant coefficients and show the conditions on the index nn such that they are satisfied.
For simplicity, let’s denote Y:=f(yn)Y:=\nabla f(y_{n}) and X:=f(xn)X:=\nabla f(x_{n}). Hence, we obtain that

F~n=s2tn+1Y2\displaystyle\tilde{F}_{n}=\dfrac{s}{2}t_{n+1}\|Y\|^{2} +[(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)+(λn+ωn)tn+1]Y,X\displaystyle+\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})t_{n+1}\right]\langle Y,X\rangle
+[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]X2.\displaystyle+\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|X\|^{2}\,.

Using Cauchy-Schwarz inequality, it follows that

F~ns2tn+1Y2\displaystyle\tilde{F}_{n}\geq\dfrac{s}{2}t_{n+1}\|Y\|^{2} [(tn+11)𝒜2(s)+tn+1𝒜3(s;γn,L)+(λn+ωn)tn+1]YX\displaystyle-\left[(t_{n+1}-1)\mathcal{A}_{2}(s)+t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})t_{n+1}\right]\|Y\|\|X\| (36)
+[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]X2:=Fn.\displaystyle+\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|X\|^{2}:=F_{n}\,.

The next step of the present proof is to apply Lemma 17 and to show that for nn large enough, Fn0F_{n}\geq 0. Using Lemma 17, we observe that the coefficient of Y2\|Y\|^{2} is positive, namely s2tn+1>0\dfrac{s}{2}t_{n+1}>0 for nα1n\geq\alpha-1. Now, the only thing left for us is to show that

[(1tn+1)𝒜2(s)tn+1𝒜3(s;γn,L)(λn+ωn)tn+1]22stn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)].\displaystyle\left[(1-t_{n+1})\mathcal{A}_{2}(s)-t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)-(\lambda_{n}+\omega_{n})t_{n+1}\right]^{2}\leq 2st_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\,. (37)

This means that

(1tn+1)2𝒜22(s)+tn+12𝒜32(s;γn,L)2tn+1(1tn+1)𝒜2(s)𝒜3(s;γn,L)+(λn+ωn)2tn+12\displaystyle(1-t_{n+1})^{2}\mathcal{A}_{2}^{2}(s)+t_{n+1}^{2}\mathcal{A}_{3}^{2}(s;\gamma_{n},L)-2t_{n+1}(1-t_{n+1})\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})^{2}t_{n+1}^{2}
2(λn+ωn)tn+1[(1tn+1)𝒜2(s)tn+1𝒜3(s;γn,L)]2stn+1[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)],.\displaystyle-2(\lambda_{n}+\omega_{n})t_{n+1}\left[(1-t_{n+1})\mathcal{A}_{2}(s)-t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)\right]\leq 2st_{n+1}\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right],.

Expanding the terms, it follows that

tn+12𝒜22(s)+𝒜22(s)2tn+1𝒜22(s)+tn+12𝒜32(s;γn,L)2tn+1𝒜2(s)𝒜3(s;γn,L)+2tn+12𝒜2(s)𝒜3(s;γn,L)\displaystyle t_{n+1}^{2}\mathcal{A}_{2}^{2}(s)+\mathcal{A}_{2}^{2}(s)-2t_{n+1}\mathcal{A}_{2}^{2}(s)+t_{n+1}^{2}\mathcal{A}_{3}^{2}(s;\gamma_{n},L)-2t_{n+1}\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)+2t_{n+1}^{2}\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)
+(λn+ωn)2tn+122𝒜2(s)(λn+ωn)tn+1+2𝒜2(s)(λn+ωn)tn+12+2(λn+ωn)tn+12𝒜3(s;γn,L)\displaystyle+(\lambda_{n}+\omega_{n})^{2}t_{n+1}^{2}-2\mathcal{A}_{2}(s)(\lambda_{n}+\omega_{n})t_{n+1}+2\mathcal{A}_{2}(s)(\lambda_{n}+\omega_{n})t_{n+1}^{2}+2(\lambda_{n}+\omega_{n})t_{n+1}^{2}\mathcal{A}_{3}(s;\gamma_{n},L)
2s𝒜4(s)tn+122s𝒜4(s)tn+1+2stn+12𝒜5(s;γn).\displaystyle\leq 2s\mathcal{A}_{4}(s)t_{n+1}^{2}-2s\mathcal{A}_{4}(s)t_{n+1}+2st_{n+1}^{2}\mathcal{A}_{5}(s;\gamma_{n})\,.

Grouping by tn+1t_{n+1} and tn+12t_{n+1}^{2} implies that

tn+12[𝒜22(s)+𝒜32(s;γn,L)+2𝒜2(s)𝒜3(s;γn,L)+(λn+ωn)2+2𝒜2(s)(λn+ωn)+2(λn+ωn)𝒜3(s;γn,L)\displaystyle t_{n+1}^{2}[\mathcal{A}_{2}^{2}(s)+\mathcal{A}_{3}^{2}(s;\gamma_{n},L)+2\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)+(\lambda_{n}+\omega_{n})^{2}+2\mathcal{A}_{2}(s)(\lambda_{n}+\omega_{n})+2(\lambda_{n}+\omega_{n})\mathcal{A}_{3}(s;\gamma_{n},L)
2s𝒜4(s)2s𝒜5(s;γn)]+tn+1[2𝒜22(s)2𝒜2(s)𝒜3(s;γn,L)2𝒜2(s)(λn+ωn)+2s𝒜4(s)]+𝒜22(s)0.\displaystyle-2s\mathcal{A}_{4}(s)-2s\mathcal{A}_{5}(s;\gamma_{n})]+t_{n+1}[-2\mathcal{A}_{2}^{2}(s)-2\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)-2\mathcal{A}_{2}(s)(\lambda_{n}+\omega_{n})+2s\mathcal{A}_{4}(s)]+\mathcal{A}_{2}^{2}(s)\leq 0\,.

Finally, one has that

tn+12[(𝒜2(s)+𝒜3(s;γn,L))22s(𝒜4(s)+𝒜5(s;γn))+(λn+ωn)2+2(λn+ωn)(𝒜2(s)+𝒜3(s;γn,L))]+\displaystyle t_{n+1}^{2}\left[(\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L))^{2}-2s(\mathcal{A}_{4}(s)+\mathcal{A}_{5}(s;\gamma_{n}))+(\lambda_{n}+\omega_{n})^{2}+2(\lambda_{n}+\omega_{n})(\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L))\right]+ (38)
tn+1[2𝒜22(s)2𝒜2(s)𝒜3(s;γn,L)2(λn+ωn)𝒜2(s)+2s𝒜4(s)]+𝒜22(s)0.\displaystyle t_{n+1}\left[-2\mathcal{A}_{2}^{2}(s)-2\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)-2(\lambda_{n}+\omega_{n})\mathcal{A}_{2}(s)+2s\mathcal{A}_{4}(s)\right]+\mathcal{A}_{2}^{2}(s)\leq 0\,.

We rewrite 38 as

Gntn+12Hntn+1In0,\displaystyle G_{n}t_{n+1}^{2}-H_{n}t_{n+1}-I_{n}\geq 0\,, (39)

where the coefficients are

{Gn:=(𝒜2(s)+𝒜3(s;γn,L))2+2s(𝒜4(s)+𝒜5(s;γn))(λn+ωn)22(λn+ωn)(𝒜2(s)+𝒜3(s;γn,L))Hn:=2𝒜22(s)2𝒜2(s)𝒜3(s;γn,L)2(λn+ωn)𝒜2(s)+2s𝒜4(s)In:=𝒜22(s)\displaystyle\begin{cases}G_{n}&:=-(\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L))^{2}+2s(\mathcal{A}_{4}(s)+\mathcal{A}_{5}(s;\gamma_{n}))-(\lambda_{n}+\omega_{n})^{2}-2(\lambda_{n}+\omega_{n})(\mathcal{A}_{2}(s)+\mathcal{A}_{3}(s;\gamma_{n},L))\\ H_{n}&:=-2\mathcal{A}_{2}^{2}(s)-2\mathcal{A}_{2}(s)\mathcal{A}_{3}(s;\gamma_{n},L)-2(\lambda_{n}+\omega_{n})\mathcal{A}_{2}(s)+2s\mathcal{A}_{4}(s)\\ I_{n}&:=\mathcal{A}_{2}^{2}(s)\end{cases}

Since In=s2>0I_{n}=s^{2}>0, in order to use Lemma 18 and taking into account Remark 19, we only need to show that for nn large enough Gn>0G_{n}>0, hence Hn2+4GnIn0H_{n}^{2}+4G_{n}I_{n}\geq 0. Now, this means that 39 is equivalent to the fact that (Gn)tn+12+Hntn+1+In0(-G_{n})t_{n+1}^{2}+H_{n}t_{n+1}+I_{n}\leq 0. Based upon 38, we compute Gn-G_{n} as follows:

Gn\displaystyle-G_{n} =[s+γn(1Ls)]2s2+γn2+(λn+ωn)2+2(λn+ωn)(sγn(1Ls))\displaystyle=\left[s+\gamma_{n}(1-Ls)\right]^{2}-s^{2}+\gamma_{n}^{2}+(\lambda_{n}+\omega_{n})^{2}+2(\lambda_{n}+\omega_{n})(-s-\gamma_{n}(1-Ls))
=[s+γn(1Ls)]2+(γn2s2)+(λn+ωn)22(λn+ωn)(s+γn(1Ls))\displaystyle=\left[s+\gamma_{n}(1-Ls)\right]^{2}+(\gamma_{n}^{2}-s^{2})+(\lambda_{n}+\omega_{n})^{2}-2(\lambda_{n}+\omega_{n})(s+\gamma_{n}(1-Ls))
=[s+γn(1Ls)]2+(γn2s2)+(λn+ωn)2(λn+ωn)(s+γn(1Ls))(λn+ωn)(s+γn(1Ls))\displaystyle=\left[s+\gamma_{n}(1-Ls)\right]^{2}+(\gamma_{n}^{2}-s^{2})+(\lambda_{n}+\omega_{n})^{2}-(\lambda_{n}+\omega_{n})(s+\gamma_{n}(1-Ls))-(\lambda_{n}+\omega_{n})(s+\gamma_{n}(1-Ls))
=[s+γn(1Ls)][s+γn(1Ls)(λn+ωn)]+(γn2s2)+(λn+ωn)[(λn+ωn)(s+γn(1Ls))]\displaystyle=\left[s+\gamma_{n}(1-Ls)\right]\cdot\left[s+\gamma_{n}(1-Ls)-(\lambda_{n}+\omega_{n})\right]+(\gamma_{n}^{2}-s^{2})+(\lambda_{n}+\omega_{n})\left[(\lambda_{n}+\omega_{n})-(s+\gamma_{n}(1-Ls))\right]
=(γn2s2)+[s+γn(1Ls)(λn+ωn)][s+γn(1Ls)](λn+ωn)[s+γn(1Ls)(λn+ωn)]\displaystyle=(\gamma_{n}^{2}-s^{2})+\left[s+\gamma_{n}(1-Ls)-(\lambda_{n}+\omega_{n})\right]\cdot\left[s+\gamma_{n}(1-Ls)\right]-(\lambda_{n}+\omega_{n})\left[s+\gamma_{n}(1-Ls)-(\lambda_{n}+\omega_{n})\right]
=(γn2s2)+[(s+γn(1Ls))(λn+ωn)][(s+γn(1Ls))(λn+ωn)]\displaystyle=(\gamma_{n}^{2}-s^{2})+\left[(s+\gamma_{n}(1-Ls))-(\lambda_{n}+\omega_{n})\right]\cdot\left[(s+\gamma_{n}(1-Ls))-(\lambda_{n}+\omega_{n})\right]
=(γn2s2)+[(s+γn(1Ls))(λn+ωn)]2.\displaystyle=(\gamma_{n}^{2}-s^{2})+\left[(s+\gamma_{n}(1-Ls))-(\lambda_{n}+\omega_{n})\right]^{2}\,.

Considering the above computations for GnG_{n} and the form of λn+ωn\lambda_{n}+\omega_{n} from assumption (ii), we obtain, after some basic evaluations, exactly assumption (i), where it is satisfied for each n>max{α1,N}n>max\{\alpha-1,N^{\prime}\}.
( Step III ) In the last step of the present proof we recollect the results from above and apply Lemma 17 and Lemma 18 and actually show the O(1n2)O\left(\dfrac{1}{n^{2}}\right) asymptotic rate of converge. At the same time, we actually determine the index NN\in\mathbb{N}, such that for nNn\geq N, the rate of convergence in the objective function is obtained.
From now on, we consider only when n>max{N1,N2,N}n>max\{N_{1},N_{2},N^{\prime}\}, where N1:=α1N_{1}:=\alpha-1, N2:=(α1)Hn+Hn2+4GnIn2GnN_{2}:=(\alpha-1)\dfrac{H_{n}+\sqrt{H_{n}^{2}+4G_{n}I_{n}}}{2G_{n}} and NN^{\prime} the index from the hypothesis of the theorem. Now, we know from the theorem’s hypothesis that (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}} are convergent. By using assumption (ii), it follows that (ωn)n(\omega_{n})_{n\in\mathbb{N}} is also convergent. This implies that also (Hn)n(H_{n})_{n\in\mathbb{N}} and (Gn)n(G_{n})_{n\in\mathbb{N}} are convergent. Since these two sequences are convergent, they are bounded. Hence, the sequence ((α1)Hn+Hn2+4GnIn2Gn)n\left((\alpha-1)\dfrac{H_{n}+\sqrt{H_{n}^{2}+4G_{n}I_{n}}}{2G_{n}}\right)_{n\in\mathbb{N}} is also bounded. So, let M>0M>0 such that (α1)Hn+Hn2+4GnIn2Gn<M(\alpha-1)\dfrac{H_{n}+\sqrt{H_{n}^{2}+4G_{n}I_{n}}}{2G_{n}}<M. Then, we can take nM>N2n\geq M>N_{2} (in this sense n>N2n>N_{2} is well defined).
Consider 39, with the nonconstant coefficients GnG_{n}, HnH_{n} and InI_{n}. We apply Lemma 18 with x=tn+1x=t_{n+1}, a(x)=a(tn+1)=Gna(x)=a(t_{n+1})=G_{n}, b(x)=b(tn+1)=Hnb(x)=b(t_{n+1})=-H_{n} and c(x)=c(tn+1)=Inc(x)=c(t_{n+1})=-I_{n}. Since n>N2n>N_{2}, then we consider find an infinity of points x=tn+1Hn+Hn2+4GnIn2Gnx=t_{n+1}\geq\dfrac{H_{n}+\sqrt{H_{n}^{2}+4G_{n}I_{n}}}{2G_{n}} that obviously satisfy c(tn+1)=s2<0c(t_{n+1})=-s^{2}<0, since s>0s>0. Also, the condition b2(tn+1)4a(tn+1)c(tn+1)0b^{2}(t_{n+1})-4a(t_{n+1})c(t_{n+1})\geq 0 is equivalent to Hn2+4GnIn0H_{n}^{2}+4G_{n}I_{n}\geq 0. Because n>max{N1,N2,N}n>max\{N_{1},N_{2},N^{\prime}\}, assumption (i) is in fact Gn>0G_{n}>0, i.e. a(tn+1)>0a(t_{n+1})>0 (this stands true for nmax{α1,N}n\geq\max\{\alpha-1,N^{\prime}\}). This means that indeed we have Hn2+4GnIn0H_{n}^{2}+4G_{n}I_{n}\geq 0, for each n>max{N1,N2,N}n>max\{N_{1},N_{2},N^{\prime}\}.
After these computations, now we can consider the quadratic inequality with nonconstant coefficients Fn0F_{n}\geq 0, with FnF_{n} given in 36 and where n>max{N1,N2,N}n>\max\{N_{1},N_{2},N^{\prime}\}. We apply Lemma 17 with x=f(yn)x=\|\nabla f(y_{n})\|, a(x)=a(f(yn))=s2tn+1a(x)=a(\|\nabla f(y_{n})\|)=\dfrac{s}{2}t_{n+1}, b(x)=b(f(yn))=[(1tn+1)𝒜2(s)tn+1𝒜3(s;γn,L)(λn+ωn)tn+1]f(xn)b(x)=b(\|\nabla f(y_{n})\|)=\left[(1-t_{n+1})\mathcal{A}_{2}(s)-t_{n+1}\mathcal{A}_{3}(s;\gamma_{n},L)-(\lambda_{n}+\omega_{n})t_{n+1}\right]\|\nabla f(x_{n})\| and c(x)=c(f(yn))=[(tn+11)𝒜4(s)+tn+1𝒜5(s;γn)]f(xn)2c(x)=c(\|\nabla f(y_{n})\|)=\left[(t_{n+1}-1)\mathcal{A}_{4}(s)+t_{n+1}\mathcal{A}_{5}(s;\gamma_{n})\right]\|\nabla f(x_{n})\|^{2}. So, in this case we consider x=tn+1x=t_{n+1} with n>max{N1,N2,N}n>\max\{N_{1},N_{2},N^{\prime}\}. The condition a(f(yn))>0a(\|\nabla f(y_{n})\|)>0 is obviously satisfied for n>max{N1,N2,N}n>\max\{N_{1},N_{2},N^{\prime}\}. Furthermore, the condition is equivalent to 37. Moreover, this is in fact 39, that we have shown that is valid for n>max{N1,N2,N}n>max\{N_{1},N_{2},N^{\prime}\}. We conclude that Fn0F_{n}\geq 0 for every f(yn)\nabla f(y_{n}) with n>max{N1,N2,N}n>max\{N_{1},N_{2},N^{\prime}\}.
By 34, we find that n+1n0\mathcal{E}_{n+1}-\mathcal{E}_{n}\leq 0, so nN\mathcal{E}_{n}\leq\mathcal{E}_{N}, where N:=max{N1,N2,N}N:=\max\{N_{1},N_{2},N^{\prime}\}. By the fact that tn=n1α1t_{n}=\dfrac{n-1}{\alpha-1} and utilizing the definition Discrete-Energy, we have that tn2[f(xn)f(x)]Nt_{n}^{2}\left[f(x_{n})-f(x^{\ast})\right]\leq\mathcal{E}_{N}. Finally, we obtain that f(xn)f(x)N(α1)2(n1)2f(x_{n})-f(x^{\ast})\leq\dfrac{\mathcal{E}_{N}(\alpha-1)^{2}}{(n-1)^{2}}, for n>Nn>N and the proof is over. \blacksquare

Remark 21.

Take the case when the sequence (λn)n(\lambda_{n})_{n\in\mathbb{N}} is constant, with λn=βs\lambda_{n}=\beta\sqrt{s}. Then, the value of znz_{n} given in Discrete-Energy is zn=(xn1x)+tn(xnxn1)+βstn+1f(xn1)z_{n}=(x_{n-1}-x^{\ast})+t_{n}(x_{n}-x_{n-1})+\beta\sqrt{s}t_{n+1}\nabla f(x_{n-1}). This discrete value differs significantly from that given in [4], where zn=(xn1x)+tn[(xnxn1)+βsf(xn1)]z_{n}=(x_{n-1}-x^{\ast})+t_{n}\left[(x_{n}-x_{n-1})+\beta\sqrt{s}\nabla f(x_{n-1})\right], by the fact that we have used f(xn)\nabla f(x_{n}) and not f(xn1)\nabla f(x_{n-1}) in LT-S-IGAHD (see Remark 5 and Subsection 2.2). Last of all, we point out that even though Discrete-Energy is similar to that used in [4], there are subtle differences consisting in the fact that our choice of the discrete functional is optimal, since in zn+1z_{n+1} we got rid of f(xn1)\nabla f(x_{n-1}). This means that every other types of discrete Lyapunov functionals that can be constructed, contain f(xn1)\nabla f(x_{n-1}), that can’t be eliminated by the fact that they have a nonconstant term in front of them.

Remark 22.

We observe that assumption (ii) of Theorem 20 reduces to the fact that γn2<s2\gamma^{2}_{n}<s^{2}, namely γn(s,s)\gamma_{n}\in(-s,s) for each n>max{α1,N}n>max\{\alpha-1,N^{\prime}\}. Furthermore, following the last step of the previous proof and considering Lemma 17 and Remark 19, the following sequence of equivalences stand true:

γn2<s2(1α1n)\displaystyle\gamma_{n}^{2}<s^{2}\left(1-\dfrac{\alpha-1}{n}\right) sγn2s>s(α1)n\displaystyle\Longleftrightarrow s-\dfrac{\gamma_{n}^{2}}{s}>\dfrac{s(\alpha-1)}{n}
nα1(s2γn22s)>s2tn+1[𝒜4(s)+𝒜5(s;γn)]>𝒜4(s),\displaystyle\Longleftrightarrow\dfrac{n}{\alpha-1}\left(\dfrac{s}{2}-\dfrac{\gamma_{n}^{2}}{2s}\right)>\dfrac{s}{2}\Longleftrightarrow t_{n+1}\left[\mathcal{A}_{4}(s)+\mathcal{A}_{5}(s;\gamma_{n})\right]>\mathcal{A}_{4}(s)\,,

So, in fact we have a stronger bound for γn\gamma_{n}, namely s2(1α1n)<γn<s2(1α1n)-s^{2}\left(1-\dfrac{\alpha-1}{n}\right)<\gamma_{n}<s^{2}\left(1-\dfrac{\alpha-1}{n}\right), and this is valid for all n>max{α1,N}n>max\{\alpha-1,N^{\prime}\}.

Remark 23.

In Theorem 20 we have considered the strict inequality belonging to assumption (ii) to be valid for each n>max{α1,N}n>\max\{\alpha-1,N^{\prime}\}. But, if we look closer at the third step of the proof from the same theorem, we observe that we could have taken n>max{α1,N2,N}n>\max\{\alpha-1,N_{2},N^{\prime}\}, where N2N_{2} was defined through HnH_{n}, GnG_{n} and InI_{n}. Hence, in this way, we obtain a more relaxed bound on the restriction on the index nn.
On the other hand, we observe that the assumptions that the sequences (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}} are bounded, so also (ωn)n(\omega_{n})_{n\in\mathbb{N}}, can be linked to our reasoning from Subsection 2.4, where we have supposed the convergence of the sequences in our heuristic analysis.

We end this section by giving some examples of algorithms of tpye LT-S-IGAHD that satisfy the assumptions from Theorem 20 and we also show that indeed exists an index NN^{\prime} for each of them. In our first example, we consider a general class of Hessian driven optimization algorithms with asymptotic vanishing damping, that include three parameters aa, μ\mu and bb that will play a crucial role in the numerical behavior, in the sense that, from a scientific computing point of view, we have a much larger freedom and flexibility into the hypertuning of the parameters.

Example 24.

We consider the choice of the coefficients as follows:
γn=sα1n+a\gamma_{n}=s\sqrt{\dfrac{\alpha-1}{n+a}}, λn+1=snn+1+μn(n+1)(n+b)\lambda_{n+1}=s\dfrac{n}{n+1}+\mu\dfrac{n}{(n+1)(n+b)} and ωn=sα1n+a+sn+μ[1n+bn1n(n+b1)]\omega_{n}=s\sqrt{\dfrac{\alpha-1}{n+a}}+\dfrac{s}{n}+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right], where a0a\geq 0, b0b\geq 0 and μ0\mu\geq 0. Now, we validate our algorithm as follows:
With the choices of the sequences (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}}, we obtain the following chain of equivalences:

ωn\displaystyle\omega_{n} =sα1n+a+sn+μ[1n+bn1n(n+b1)]\displaystyle=s\sqrt{\dfrac{\alpha-1}{n+a}}+\dfrac{s}{n}+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right]
ωn\displaystyle\omega_{n} =sα1n+as(n1n1)+μ[1n+bn1n(n+b1)]\displaystyle=s\sqrt{\dfrac{\alpha-1}{n+a}}-s\left(\dfrac{n-1}{n}-1\right)+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right]
ωn\displaystyle\omega_{n} =sα1n+asn1nμn1n(n+b1)+s+μn+b\displaystyle=s\sqrt{\dfrac{\alpha-1}{n+a}}-s\dfrac{n-1}{n}-\mu\dfrac{n-1}{n(n+b-1)}+s+\dfrac{\mu}{n+b}
ωn\displaystyle\omega_{n} =γnλn+n+1nλn+1\displaystyle=\gamma_{n}-\lambda_{n}+\dfrac{n+1}{n}\lambda_{n+1}

so assumption (ii) of Theorem 20 is satisfied for each n1n\geq 1.
We must show assumption (i) of Theorem 20. For this, we have the following equivalences:

[s(1γnL)n+1nλn+1]2<s2γn2\displaystyle\left[s(1-\gamma_{n}L)-\dfrac{n+1}{n}\lambda_{n+1}\right]^{2}<s^{2}-\gamma_{n}^{2} [sγnL+μn+b]2<s2γn2\displaystyle\Longleftrightarrow\left[s\gamma_{n}L+\dfrac{\mu}{n+b}\right]^{2}<s^{2}-\gamma_{n}^{2}
s2γn2L2+2sμLγn1n+b+μ2(n+b)2<s2γn2\displaystyle\Longleftrightarrow s^{2}\gamma_{n}^{2}L^{2}+2s\mu L\gamma_{n}\dfrac{1}{n+b}+\dfrac{\mu^{2}}{(n+b)^{2}}<s^{2}-\gamma_{n}^{2}
s4α1n+aL2+2s2μLα1(n+b)n+a+μ2(n+b)2<s2s2α1n+a\displaystyle\Longleftrightarrow s^{4}\dfrac{\alpha-1}{n+a}L^{2}+2s^{2}\mu L\dfrac{\sqrt{\alpha-1}}{(n+b)\sqrt{n+a}}+\dfrac{\mu^{2}}{(n+b)^{2}}<s^{2}-s^{2}\dfrac{\alpha-1}{n+a}

Multiplying by 1/s21/s^{2}, we get that

s2α1n+aL2+2μLα1(n+b)n+a+(μs)21(n+b)2+α1n+a<1.\displaystyle s^{2}\dfrac{\alpha-1}{n+a}L^{2}+2\mu L\dfrac{\sqrt{\alpha-1}}{(n+b)\sqrt{n+a}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}+\dfrac{\alpha-1}{n+a}<1\,. (40)

Let n+a<n+b\sqrt{n+a}<n+b. So n+an22bnb2<0n+a-n^{2}-2bn-b^{2}<0, which means that n2+(2b1)n+(b2a)>0n^{2}+(2b-1)n+(b^{2}-a)>0, with Δ=(2b1)24(b2a)\Delta=(2b-1)^{2}-4(b^{2}-a). We consider the following cases :

  • If Δ0\Delta\geq 0, namely ba14b-a\leq\dfrac{1}{4} then we can take n>12b+4(ab)+12n>\dfrac{1-2b+\sqrt{4(a-b)+1}}{2}, so the inequality is satisfied.

  • If Δ<0\Delta<0, i.e. ba>14b-a>\dfrac{1}{4}, then we can take n1n\geq 1.

Since

s2α1n+aL2+2μLα1(n+b)n+a+(μs)21(n+b)2+α1n+a<s2α1n+aL2+2μLα1n+a+(μs)21n+a+α1n+a,s^{2}\dfrac{\alpha-1}{n+a}L^{2}+2\mu L\dfrac{\sqrt{\alpha-1}}{(n+b)\sqrt{n+a}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}+\dfrac{\alpha-1}{n+a}<s^{2}\dfrac{\alpha-1}{n+a}L^{2}+2\mu L\dfrac{\sqrt{\alpha-1}}{n+a}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{n+a}+\dfrac{\alpha-1}{n+a}\,,

and taking into account 40, then it is enough to show that:

s2α1n+aL2+2μLα1n+a+(μs)21n+a+α1n+a<1\displaystyle s^{2}\dfrac{\alpha-1}{n+a}L^{2}+2\mu L\dfrac{\sqrt{\alpha-1}}{n+a}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{n+a}+\dfrac{\alpha-1}{n+a}<1 1n+a[(α1)(s2L2+1)+2μLα1+(μs)2]<1\displaystyle\Longleftrightarrow\dfrac{1}{n+a}\left[(\alpha-1)(s^{2}L^{2}+1)+2\mu L\sqrt{\alpha-1}+\left(\dfrac{\mu}{s}\right)^{2}\right]<1
n>(α1)(s2L2+1)+2μLα1+(μs)2a.\displaystyle\Longleftrightarrow n>(\alpha-1)(s^{2}L^{2}+1)+2\mu L\sqrt{\alpha-1}+\left(\dfrac{\mu}{s}\right)^{2}-a\,.

Finally, if ba>14b-a>\dfrac{1}{4} then we can consider

N:=(α1)(s2L2+1)+2μLα1+(μs)2a,N^{\prime}:=(\alpha-1)(s^{2}L^{2}+1)+2\mu L\sqrt{\alpha-1}+\left(\dfrac{\mu}{s}\right)^{2}-a\,,

and if ba14b-a\leq\dfrac{1}{4}, then we can take

N:=max{12b+4(ab)+12,(α1)(s2L2+1)+2μLα1+(μs)2a}.N^{\prime}:=\max\Big{\{}\dfrac{1-2b+\sqrt{4(a-b)+1}}{2},(\alpha-1)(s^{2}L^{2}+1)+2\mu L\sqrt{\alpha-1}+\left(\dfrac{\mu}{s}\right)^{2}-a\Big{\}}.

The second example is also based upon three coefficient β\beta, bb and μ\mu. It is worth noticing that for the particular case when μ=0\mu=0 we obtain the IGAHD-type algorithm that differs from that given in [4] (see Remark 5 and Subsection 2.2).

Example 25.

We consider the following choice of the coefficients:
γn=0\gamma_{n}=0, λn+1=βs+μn+b\lambda_{n+1}=\beta\sqrt{s}+\dfrac{\mu}{n+b} and ωn=βsn+μ[n+1n(n+b)1n+b1]\omega_{n}=\dfrac{\beta\sqrt{s}}{n}+\mu\left[\dfrac{n+1}{n(n+b)}-\dfrac{1}{n+b-1}\right], where 0<β<2s0<\beta<2\sqrt{s}, b>0b>0 and μ0\mu\geq 0. Now, we validate our algorithm as follows:
With the choices of the sequences (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}}, we obtain the following chain of equivalences:

ωn\displaystyle\omega_{n} =βsn+μ[n+1n(n+b)1n+b1]\displaystyle=\dfrac{\beta\sqrt{s}}{n}+\mu\left[\dfrac{n+1}{n(n+b)}-\dfrac{1}{n+b-1}\right]
ωn\displaystyle\omega_{n} =n+1nβs+μn+1n(n+b)βsμn+b1\displaystyle=\dfrac{n+1}{n}\beta\sqrt{s}+\mu\dfrac{n+1}{n(n+b)}-\beta\sqrt{s}-\dfrac{\mu}{n+b-1}
ωn\displaystyle\omega_{n} =n+1nλn+1λn\displaystyle=\dfrac{n+1}{n}\lambda_{n+1}-\lambda_{n}
ωn\displaystyle\omega_{n} =γnλn+n+1nλn+1\displaystyle=\gamma_{n}-\lambda_{n}+\dfrac{n+1}{n}\lambda_{n+1}

We must show assumption (i) of Theorem 20. For this, we have the following equivalences:

[s(1γnL)n+1nλn+1]2<s2γn2\displaystyle\left[s(1-\gamma_{n}L)-\dfrac{n+1}{n}\lambda_{n+1}\right]^{2}<s^{2}-\gamma_{n}^{2} [sn+1nλn+1]2<s2\displaystyle\Longleftrightarrow\left[s-\dfrac{n+1}{n}\lambda_{n+1}\right]^{2}<s^{2}
[sn+1nβsμn+1n(n+b)]2<s2\displaystyle\Longleftrightarrow\left[s-\dfrac{n+1}{n}\beta\sqrt{s}-\mu\dfrac{n+1}{n(n+b)}\right]^{2}<s^{2}
[n+1nβs+μn+1n(n+b)][2sn+1nβsμn+1n(n+b)]<0\displaystyle\Longleftrightarrow-\left[\dfrac{n+1}{n}\beta\sqrt{s}+\mu\dfrac{n+1}{n(n+b)}\right]\left[2s-\dfrac{n+1}{n}\beta\sqrt{s}-\mu\dfrac{n+1}{n(n+b)}\right]<0

By the fact that n+1nβs+μn+1n(n+b)\dfrac{n+1}{n}\beta\sqrt{s}+\mu\dfrac{n+1}{n(n+b)} is positive and since β>0\beta>0, this term can’t be 0. So, the inequality from above is satisfied if and only if n+1nβs+μn+1n(n+b)<2s\dfrac{n+1}{n}\beta\sqrt{s}+\mu\dfrac{n+1}{n(n+b)}<2s. Multiplying by ns(n+1)\dfrac{n}{\sqrt{s}(n+1)}, it follows that β+μs(n+b)<2snn+1\beta+\dfrac{\mu}{\sqrt{s}(n+b)}<2\sqrt{s}\dfrac{n}{n+1}. Hence

β+μs(n+b)<2snn+1.\displaystyle\beta+\dfrac{\mu}{\sqrt{s}(n+b)}<2\sqrt{s}\dfrac{n}{n+1}\,. (41)

Since β<2s\beta<2\sqrt{s}, there exists τ>0\tau>0 such that 2sβ>τ2\sqrt{s}-\beta>\tau. We consider ϵ=τμs+2s>0\epsilon=\dfrac{\tau}{\dfrac{\mu}{\sqrt{s}}+2\sqrt{s}}>0. From the Archimedean property, we know that for ϵ>0\epsilon>0 there exists NN\in\mathbb{N} such that ϵ>1N\epsilon>\dfrac{1}{N}, thus the strict inequality 2sβ>(μs+2s)ϵ2\sqrt{s}-\beta>\left(\dfrac{\mu}{\sqrt{s}}+2\sqrt{s}\right)\epsilon implies that 2sβ>(μs+2s)1N2\sqrt{s}-\beta>\left(\dfrac{\mu}{\sqrt{s}}+2\sqrt{s}\right)\dfrac{1}{N}. For n+1Nn+1\geq N and n+bNn+b\geq N which are equivalent to 1N1n+1\dfrac{1}{N}\geq\dfrac{1}{n+1} and 1N1n+b\dfrac{1}{N}\geq\dfrac{1}{n+b}, we obtain that

(μs+2s)1Nμs1n+b+2s1n+12sβ>μs1n+b+2s1n+1.\left(\dfrac{\mu}{\sqrt{s}}+2\sqrt{s}\right)\dfrac{1}{N}\geq\dfrac{\mu}{\sqrt{s}}\dfrac{1}{n+b}+2\sqrt{s}\dfrac{1}{n+1}\Rightarrow 2\sqrt{s}-\beta>\dfrac{\mu}{\sqrt{s}}\dfrac{1}{n+b}+2\sqrt{s}\dfrac{1}{n+1}.

Second to last, we can take N:=max{N1,Nb}N^{\prime}:=max\{N-1,N-b\} and we obtain that 41 is satisfied for every nNn\geq N^{\prime}. Finally, one can find in an explicit way the index NN^{\prime}. In order to do this, observe that 41 leads to a quadratic inequality, namely:

(2sβ)n2+[2sbβ(b+1)μs]n(βb+μs)>0.(2\sqrt{s}-\beta)n^{2}+\left[2\sqrt{s}b-\beta(b+1)-\dfrac{\mu}{\sqrt{s}}\right]n-\left(\beta b+\dfrac{\mu}{\sqrt{s}}\right)>0.

In this case, one can see that N=[β(b+1)+μs2sb]+Δ2(2sβ)N^{\prime}=\dfrac{\left[\beta(b+1)+\dfrac{\mu}{\sqrt{s}}-2\sqrt{s}b\right]+\sqrt{\Delta}}{2(2\sqrt{s}-\beta)}, where Δ:=(2bsβ(b+1)μs)2+4(2sβ)(βb+μs)0.\Delta:=\left(2b\sqrt{s}-\beta(b+1)-\dfrac{\mu}{\sqrt{s}}\right)^{2}+4(2\sqrt{s}-\beta)\left(\beta b+\dfrac{\mu}{\sqrt{s}}\right)\geq 0\,.

Our last example consists in choosing γn\gamma_{n} to take negative values, but λn\lambda_{n} to have only positive values. Thus, ωn\omega_{n} is determined through assumption (ii) of Theorem 20.

Example 26.

We consider the choice of the coefficients as follows:
γn=sn+a\gamma_{n}=-\dfrac{s}{n+a}, λn+1=snn+1+μn(n+1)(n+b)\lambda_{n+1}=s\dfrac{n}{n+1}+\mu\dfrac{n}{(n+1)(n+b)} and ωn=sn+a+sn+μ[1n+bn1n(n+b1)]\omega_{n}=-\dfrac{s}{n+a}+\dfrac{s}{n}+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right], where a0a\geq 0, b0b\geq 0 and μ0\mu\geq 0. Now, we validate our algorithm as follows:
With the choices of the sequences (γn)n(\gamma_{n})_{n\in\mathbb{N}} and (λn)n(\lambda_{n})_{n\in\mathbb{N}}, we obtain the following chain of equivalences:

ωn\displaystyle\omega_{n} =sn+a+sn+μ[1n+bn1n(n+b1)]\displaystyle=-\dfrac{s}{n+a}+\dfrac{s}{n}+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right]
ωn\displaystyle\omega_{n} =sn+as(n1n1)+μ[1n+bn1n(n+b1)]\displaystyle=-\dfrac{s}{n+a}-s\left(\dfrac{n-1}{n}-1\right)+\mu\left[\dfrac{1}{n+b}-\dfrac{n-1}{n(n+b-1)}\right]
ωn\displaystyle\omega_{n} =sn+asn1nμn1n(n+b1)+s+μn+b\displaystyle=-\dfrac{s}{n+a}-s\dfrac{n-1}{n}-\mu\dfrac{n-1}{n(n+b-1)}+s+\dfrac{\mu}{n+b}
ωn\displaystyle\omega_{n} =γnλn+n+1nλn+1\displaystyle=\gamma_{n}-\lambda_{n}+\dfrac{n+1}{n}\lambda_{n+1}

so assumption (ii) of Theorem 20 is satisfied for each n1n\geq 1.
Now, regarding assumption (i), we consider the following equivalences:

[s(1γnL)n+1nλn+1]2<s2γn2\displaystyle\left[s(1-\gamma_{n}L)-\dfrac{n+1}{n}\lambda_{n+1}\right]^{2}<s^{2}-\gamma_{n}^{2} [sγnL+μn+b]2<s2γn2\displaystyle\Longleftrightarrow\left[s\gamma_{n}L+\dfrac{\mu}{n+b}\right]^{2}<s^{2}-\gamma_{n}^{2}
s2γn2L2+2sμLγn1n+b+μ2(n+b)2<s2γn2\displaystyle\Longleftrightarrow s^{2}\gamma_{n}^{2}L^{2}+2s\mu L\gamma_{n}\dfrac{1}{n+b}+\dfrac{\mu^{2}}{(n+b)^{2}}<s^{2}-\gamma_{n}^{2}
s4L21(n+a)22s2μL1(n+a)(n+b)+μ2(n+b)2<s2s21(n+a)2\displaystyle\Longleftrightarrow s^{4}L^{2}\dfrac{1}{(n+a)^{2}}-2s^{2}\mu L\dfrac{1}{(n+a)(n+b)}+\dfrac{\mu^{2}}{(n+b)^{2}}<s^{2}-s^{2}\dfrac{1}{(n+a)^{2}}

Multiplying by 1/s21/s^{2}, we get that

(sL)2+1(n+a)22μL(n+a)(n+b)+(μs)21(n+b)2<1.\displaystyle\dfrac{(sL)^{2}+1}{(n+a)^{2}}-\dfrac{2\mu L}{(n+a)(n+b)}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}<1\,. (42)

Now, we analyze two cases:
1) Suppose that aba\leq b. Then, 1n+b1n+a\dfrac{1}{n+b}\leq\dfrac{1}{n+a}. We consider n>N:=(sL)2+1+(μs)2an>N^{\prime}:=\sqrt{(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}}-a. This leads to the fact that (n+a)2>(sL)2+1+(μs)2(n+a)^{2}>(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}. Thus, (sL)2+1(n+a)2+(μs)21(n+a)2<1\dfrac{(sL)^{2}+1}{(n+a)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+a)^{2}}<1. At the same time, we have the following inequalities:

(sL)2+1(n+a)22μL(n+a)(n+b)+(μs)21(n+b)2\displaystyle\dfrac{(sL)^{2}+1}{(n+a)^{2}}-\dfrac{2\mu L}{(n+a)(n+b)}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}} (sL)2+1(n+a)2+(μs)21(n+b)2(sL)2+1(n+a)2+(μs)21(n+a)2<1.\displaystyle\leq\dfrac{(sL)^{2}+1}{(n+a)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}\leq\dfrac{(sL)^{2}+1}{(n+a)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+a)^{2}}<1\,.

2) Suppose that bab\leq a. Then, 1n+a1n+b\dfrac{1}{n+a}\leq\dfrac{1}{n+b}. We consider n>N:=(sL)2+1+(μs)2bn>N^{\prime}:=\sqrt{(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}}-b. This leads to the fact that (n+b)2>(sL)2+1+(μs)2(n+b)^{2}>(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}. Thus, (sL)2+1(n+b)2+(μs)21(n+b)2<1\dfrac{(sL)^{2}+1}{(n+b)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}<1. At the same time, we have the following inequalities:

(sL)2+1(n+a)22μL(n+a)(n+b)+(μs)21(n+b)2\displaystyle\dfrac{(sL)^{2}+1}{(n+a)^{2}}-\dfrac{2\mu L}{(n+a)(n+b)}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}} (sL)2+1(n+a)2+(μs)21(n+b)2(sL)2+1(n+b)2+(μs)21(n+b)2<1.\displaystyle\leq\dfrac{(sL)^{2}+1}{(n+a)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}\leq\dfrac{(sL)^{2}+1}{(n+b)^{2}}+\left(\dfrac{\mu}{s}\right)^{2}\dfrac{1}{(n+b)^{2}}<1\,.

Finally, we end this example emphasizing that we can take N=3+(μs)2aN^{\prime}=\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}-a if aba\leq b and N=3+(μs)2bN^{\prime}=\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}-b if bab\leq a, and in this way we obtain an index that does not depend on the Lipschitz constant LL.

4 Numerical experiments involving convex functions

The role of the present section is to validate our theoretical results through some computational simulations, involving some convex test functions. Furthermore, we consider the particular cases of LT-S-IGAHD given in Example 24, Example 25 and Example 26 and we study the role of the parameters for each of them. Also, our codes for the optimization algorithms are implemented in Matlab and the analysis of the convexity of the chosen functions is based upon the a fundamental theory that can be found in [11].

4.1 Algorithmic description and objective functions

We consider the algorithms of the form LT-S-IGAHD (with the stepsize s=h2s=h^{2}), starting from n1n\geq 1. We initialize the starting point x02x_{0}\in\mathbb{R}^{2} for functions f:2f:\mathbb{R}^{2}\to\mathbb{R}. Furthermore, even though in many simulations x12x_{1}\in\mathbb{R}^{2} is considered arbitrary, for consistency we choose x1=x0sf(x0)x_{1}=x_{0}-s\nabla f(x_{0}), namely the first iteration at n=1n=1 is given by a step of gradient descent. Throughout the present section we count the total number of iterations starting from n=0n=0 in order to encompass the initial value x0x_{0} (even though our algorithms start from n=1n=1). Furthermore, since for x2x_{2} we need only the values x0x_{0}, x1x_{1} and y1y_{1}, for brevity we set y0=x0y_{0}=x_{0}. Using tolerance values of low order ϵ=1010\epsilon=10^{-10}, we employ the stopping criteria as f(xn)f(xn1)2ϵ\|f(x_{n})-f(x_{n-1})\|_{2}\leq\epsilon or as f(xn)f(x)2ϵ\|f(x_{n})-f(x^{\ast})\|_{2}\leq\epsilon in the case when the objective function admits a unique minimum point x2x^{\ast}\in\mathbb{R}^{2}. For each of the functions that we consider we compute the Lipschitz constant in order to present the behavior of the discretizations with respect to the parameters. Furthermore, we consider the stepsize s<1Ls<\dfrac{1}{L} by the fact that in the optimal case when s=1Ls=\dfrac{1}{L} no interpretation can be given since the algorithms stop just after a few iterations. We consider the following convex functions:
1) The case of convex, but not strongly convex function (see [4]), namely f:2f:\mathbb{R}^{2}\to\mathbb{R}, defined as

f(x1,x2)=(x1+x2)2,(x1,x2)2,\displaystyle f(x_{1},x_{2})=(x_{1}+x_{2})^{2}\,,\,\forall(x_{1},x_{2})\in\mathbb{R}^{2}\,, (F1)

It is easy to observe that f(x1,x2)=(2(x1+x2)2(x1+x2))\nabla f(x_{1},x_{2})=\begin{pmatrix}2(x_{1}+x_{2})\\ 2(x_{1}+x_{2})\end{pmatrix}. Furthermore, the Hessian of ff is 2f(x1,x2)=(2222)\nabla^{2}f(x_{1},x_{2})=\begin{pmatrix}2&2\\ 2&2\end{pmatrix}. Taking z=(z1z2)z=\begin{pmatrix}z_{1}\\ z_{2}\end{pmatrix}, we infer that zT2f(x1,x2)z=2(z1+z2)20z^{T}\nabla^{2}f(x_{1},x_{2})z=2(z_{1}+z_{2})^{2}\geq 0, so the Hessian matrix 2f0\nabla^{2}f\succeq 0. Using the second order characterization of convexity (see Theorem 7.12 from [11]), we find that the function is convex. Furthermore, computing the solution of the equation f(x1,x2)=(00)\nabla f(x_{1},x_{2})=\begin{pmatrix}0\\ 0\end{pmatrix} and using the fact from above that the objective function is convex, by using Theorem 7.9 of [11], we obtain that the set of stationary points is given by argminf={(x1,x2)2/x1=x2}\operatorname*{argmin}f=\{(x_{1},x_{2})\in\mathbb{R}^{2}\,/\,x_{1}=-x_{2}\}. By the fact that we use the l2l_{2} norm on the gradient, we aim at evaluating the subordinate matrix norm. By the fact that 2f(x1,x2)2\|\nabla^{2}f(x_{1},x_{2})\|_{2} consists of the largest square root of the absolute value of the matrix eigenvalues, we find that 2f(x1,x2)=4\|\nabla^{2}f(x_{1},x_{2})\|=4, so that the Lipschitz constant of the gradient f\nabla f is also L=4L=4. On the other hand, since the argmin\operatorname*{argmin} set contains an infinity of elements, we use the stopping criteria f(xn)f(xn1)2ϵ\|f(x_{n})-f(x_{n-1})\|_{2}\leq\epsilon, for a given tolerance ϵ>0\epsilon>0 sufficiently small. Moreover, it is also natural to adapt the Discrete-Energy function for the case when we do not actually now to what minimizer our algorithm converges. For this, we consider

n:=tn2(f(xn)f(xM))+12sz~n2\displaystyle\mathcal{E}_{n}:=t_{n}^{2}\left(f(x_{n})-f(x_{M})\right)+\dfrac{1}{2s}\|\tilde{z}_{n}\|^{2} (43)
z~n:=(xn1xM)+tn(xnxn1)+λntn+1f(xn1),\displaystyle\tilde{z}_{n}:=(x_{n-1}-x_{M})+t_{n}(x_{n}-x_{n-1})+\lambda_{n}t_{n+1}\nabla f(x_{n-1})\,,

where MM is determined by the stopping condition, i.e. when we reach the tolerance factor ϵ\epsilon, we consider the last iteration value xn=xMx_{n}=x_{M} by the fact that for ϵ\epsilon small enough, xMx_{M} is close to xargminfx^{\ast}\in\operatorname*{argmin}f.
2) The hybrid norm function f:2f:\mathbb{R}^{2}\to\mathbb{R}, defined as

f(x1,x2)=1+x12+1+x22,(x1,x2)2.\displaystyle f(x_{1},x_{2})=\sqrt{1+x_{1}^{2}}+\sqrt{1+x_{2}^{2}}\,\,,\,\forall(x_{1},x_{2})\in\mathbb{R}^{2}\,. (F2)

The gradient of the objective function has the form f(x1,x2)=(x11+x12x21+x22)\nabla f(x_{1},x_{2})=\begin{pmatrix}\dfrac{x_{1}}{\sqrt{1+x_{1}^{2}}}\\ \dfrac{x_{2}}{\sqrt{1+x_{2}^{2}}}\end{pmatrix}. On the other hand, the Hessian matrix associated to ff is given by 2f(x1,x2)=(1(1+x12)3/2001(1+x22)3/2)\nabla^{2}f(x_{1},x_{2})=\begin{pmatrix}\dfrac{1}{(1+x_{1}^{2})^{3/2}}&0\\ 0&\dfrac{1}{(1+x_{2}^{2})^{3/2}}\end{pmatrix}. Taking z=(z1z2)z=\begin{pmatrix}z_{1}\\ z_{2}\end{pmatrix} we obtain that zT2f(x1,x2)z=z121(1+x12)3/2+z221(1+x22)3/20z^{T}\nabla^{2}f(x_{1},x_{2})z=z_{1}^{2}\dfrac{1}{(1+x_{1}^{2})^{3/2}}+z_{2}^{2}\dfrac{1}{(1+x_{2}^{2})^{3/2}}\geq 0, so the Hessian satisfies 2f0\nabla^{2}f\succeq 0. As for the case of F1, we use Theorem 7.12 from [11] and we reach the convexity of F2. Furthermore, computing the solution of the equation f(x1,x2)=(00)\nabla f(x_{1},x_{2})=\begin{pmatrix}0\\ 0\end{pmatrix} and using the fact from above that the objective function is convex, we obtain that the set of stationary points is given by argminf={(0,0)}\operatorname*{argmin}f=\{(0,0)\} (this follows from Theorem 7.9 of [11], namely the necessity and sufficiency of stationary points). By the fact that we use the l2l_{2} norm on the gradient, we aim at evaluating the subordinate matrix norm. It is easy to observe that 2f(x1,x2)1=2f(x1,x2)=max{1(1+x12)3/2,1(1+x22)3/2}1\|\nabla^{2}f(x_{1},x_{2})\|_{1}=\|\nabla^{2}f(x_{1},x_{2})\|_{\infty}=\max\Big{\{}\dfrac{1}{(1+x_{1}^{2})^{3/2}},\dfrac{1}{(1+x_{2}^{2})^{3/2}}\Big{\}}\leq 1. Since 2f(x1,x2)2×2\nabla^{2}f(x_{1},x_{2})\in\mathbb{R}^{2\times 2}, by elementary algebra computations, we have that 2f(x1,x2)222f(x1,x2)\|\nabla^{2}f(x_{1},x_{2})\|_{2}\leq\sqrt{2}\|\nabla^{2}f(x_{1},x_{2})\|_{\infty}, so the tightest Lipschitz constant of the gradient is L=2L=\sqrt{2}. At the same time, since the argmin\operatorname*{argmin} set contains just one global minimum point, we use the stopping criteria f(xn)f(x)2ϵ\|f(x_{n})-f(x^{\ast})\|_{2}\leq\epsilon, where xx^{\ast} is the unique element of argminf\operatorname*{argmin}f. Also, we consider the Discrete-Energy for the Lyapunov functional.

Refer to caption
(a) The representation of F1 along with
the set of minimum points argminf\operatorname*{argmin}f
Refer to caption
(b) The representation of F2 along with
the set of minimum points argminf\operatorname*{argmin}f
Figure 1: The convex objective functions F1 and F2

Following the proof of Theorem 20, we consider N1=α1N_{1}=\alpha-1 and N2N_{2} given by GnG_{n}, HnH_{n} and InI_{n} as in the proof of the aforementioned theorem. Now we turn our attention to some choices of the parameters that appear in the three examples from the previous section, concerning the index NN^{\prime}. However, we specify that even though GnG_{n}, HnH_{n} and InI_{n} depend on the Lipschitz constant of the gradient f\nabla f (this is also valid for the case of Nesterov’s method AGM2α), the empirical convergence is in fact determined by the index NN^{\prime} that contain some predetermined values of the inherent parameters.

4.2 Some discussions over the algorithms parameters

In this subsection we consider some situations over the parameters referring to Example 24, Example 25 and Example 26. For each of them, we analyze in an empiric manner the key role of the underlying parameters over the convergence of the optimization algorithms of type LT-S-IGAHD.
Likewise, for two parameters AA and BB, we use the heuristic notations ”ABA\ll B” for the situation when AA is much smaller than BB and ”ABA\gg B” for the case AA is much larger than BB, respectively. Therefore, for our analysis, we consider a heuristic analysis for our algorithms.

The case involving Example 24) We aim at considering two subcases:
A.) When μs1\dfrac{\mu}{s}\ll 1. This is, in fact, the practical case when the parameters have small value. Because, in general, ss is relatively small, this allow us to take μ1\mu\ll 1. For this, we consider the following situations:

  • A1.)

    When a+14<ba+\dfrac{1}{4}<b. Now, if a1a\gg 1 (aa has a big value), then we can take a(μs)2a\gg\left(\dfrac{\mu}{s}\right)^{2}, so N0N^{\prime}\ll 0. On the other hand, when a1a\ll 1, then we can take a(μs)2a\approx\left(\dfrac{\mu}{s}\right)^{2}, so N>0N^{\prime}>0, i.e. N=O(μL)N^{\prime}=O\left(\mu L\right), but μ1\mu\ll 1 annihilates the Lipschitz constant LL in the index value NN^{\prime}.

  • A2.)

    When ba+14b\leq a+\dfrac{1}{4}. If b1b\gg 1, then a1a\gg 1, so we may take a(μs)2a\gg\left(\dfrac{\mu}{s}\right)^{2}. This is in fact the same situation as before when aa can take a big value, because the difference under the square root, namely aba-b, can be close to 0, in the sense that for bab\approx a, N12bN^{\prime}\approx\dfrac{1}{2}-b, i.e. N0N^{\prime}\approx 0. On the other hand, for the case when b1b\ll 1 and a1a\gg 1, we find that NΔ2N^{\prime}\approx\dfrac{\sqrt{\Delta}}{2}, hence N=O(a)N^{\prime}=O(\sqrt{a}), if we can take aa large enough such that a(μs)2a\gg\left(\dfrac{\mu}{s}\right)^{2}.

B.) When μs1\dfrac{\mu}{s}\gg 1. By the fact that in many situations s1s\ll 1, we can take μ>1\mu>1 or μ1\mu\gg 1. For this, we consider the following situations:

  • B1.)

    When a+14<ba+\dfrac{1}{4}<b. If a1a\gg 1, then b1b\gg 1. We can take a(μs)2a\approx\left(\dfrac{\mu}{s}\right)^{2}, so N=O(μL)N^{\prime}=O(\mu L), which can be eventually large. On the other hand, if b1b\gg 1 and a1a\ll 1, we can take a(μs)2a\ll\left(\dfrac{\mu}{s}\right)^{2}, so N=O((μs)2)N^{\prime}=O\left(\left(\dfrac{\mu}{s}\right)^{2}\right), i.e. N1N^{\prime}\gg 1.

  • B2.)

    When ba+14b\leq a+\dfrac{1}{4}. If a1a\gg 1, thus b1b\gg 1, then we are in a similar situations as presented above, since we can take aa close to bb such that the difference aba-b in the square root in the definition of NN^{\prime} is close to 0. On the other hand, if a1a\gg 1 and b1b\ll 1, then we can take a(μs)2a\approx\left(\dfrac{\mu}{s}\right)^{2}, thus N=O(a)N^{\prime}=O\left(\sqrt{a}\right), which is equivalent to N1N^{\prime}\gg 1.

The case involving Example 25) For this example, we consider the only valid situation when 2s>β2\sqrt{s}>\beta:
Since the stepsize ss has small values in many practical applications, we must consider β<1\beta<1 or β0\beta\approx 0. For small values of β\beta, we observe that N=O(μs2sb+Δ2(2sβ))N^{\prime}=O\left(\dfrac{\dfrac{\mu}{\sqrt{s}}-2\sqrt{s}b+\sqrt{\Delta}}{2(2\sqrt{s}-\beta)}\right). Now, we consider the following subcases:
C1.) When μs1\dfrac{\mu}{\sqrt{s}}\gg 1 and bb is small or has moderate values, we have that

N=O(μs+(2sβ)μs+(μs)22(2sβ))1.N^{\prime}=O\left(\dfrac{\dfrac{\mu}{\sqrt{s}}+\sqrt{(2\sqrt{s}-\beta)\dfrac{\mu}{\sqrt{s}}+\left(\dfrac{\mu}{\sqrt{s}}\right)^{2}}}{2(2\sqrt{s}-\beta)}\right)\gg 1.

C2.) If we are in one of the cases when μs<1\dfrac{\mu}{\sqrt{s}}<1 or μs1\dfrac{\mu}{\sqrt{s}}\ll 1, then, by the analysis made above, we obtain that N0N^{\prime}\approx 0, since μ<s\mu<\sqrt{s} or μ1\mu\ll 1.
C3.) If b0b\gg 0 and dominates the other parameters, then it is easy to see that N=O((2sβ)βb)N^{\prime}=O\left((2\sqrt{s}-\beta)\beta\sqrt{b}\right), i.e. N1N^{\prime}\gg 1. Furthermore, in order to have N0N^{\prime}\approx 0, then we must take βb\beta\ll b.


The case involving Example 26) This is similar with the first example involving the non-negative coefficients aa, bb and μ\mu. For brevity, we consider the following cases for our heuristic interpretation:
D.) When μs1\dfrac{\mu}{s}\ll 1. Since many times s1s\ll 1, we have that μ1\mu\ll 1. Here, we consider the following subcases:

  • D1.)

    When aba\leq b. If a1a\ll 1 and b1b\gg 1, i.e. aa small and bb has large values, we find that N3+(μs)2=O((μs)2)0N^{\prime}\approx\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}=O\left(\left(\dfrac{\mu}{s}\right)^{2}\right)\approx 0. On the other hand, if a1a\gg 1 and b1b\gg 1, we obtain that N0N^{\prime}\approx 0.

  • D2.)

    When bab\leq a. If a1a\gg 1, then b1b\gg 1 and, as before, N0N^{\prime}\approx 0. On the other hand, if a1a\gg 1 and b1b\ll 1, it follows that N3+(μs)2N^{\prime}\approx\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}, i.e. the index NN^{\prime} is small.

E.) When μs1\dfrac{\mu}{s}\gg 1, then (sL)2+1+(μs)20\sqrt{(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}}\gg 0. Likewise, in practical situations we are obliged to take μ1\mu\gg 1. We consider the following subcases:

  • E1.)

    When aba\leq b. In order to obtain the optimal case when N0N^{\prime}\leq 0, we can take a0a\gg 0, more precisely a3+(μs)2(sL)2+1+(μs)2a\geq\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}\geq\sqrt{(sL)^{2}+1+\left(\dfrac{\mu}{s}\right)^{2}}. It is trivial to see that since aa is large, then bb is also large.

  • E2.)

    When bab\leq a. As in the previous case, we consider the optimal case when N0N^{\prime}\leq 0, so we can take easily take b3+(μs)2b\geq\sqrt{3+\left(\dfrac{\mu}{s}\right)^{2}}. In this situation b1b\gg 1, so a1a\gg 1.

4.3 A computational perspective over the optimization algorithms

In the present subsection, we consider some numerical experiments over the algorithms of type LT-S-IGAHD, taking into account the theoretical rate of convergence in the convex objective function obtained in the proof of Theorem 20. At the same time, we precisely consider the cases and the subcases presented in Subsection 4.2.
In Figure 2(a), we have considered the convex objective function F2 along with the initial condition x0=(12)x_{0}=\begin{pmatrix}1\\ -2\end{pmatrix} Also, we have set α=3\alpha=3 and the tolerance ϵ=1010\epsilon=10^{-10}. We have the behavior of the discretization schemes along the trajectories when the stepsize is taken as s=0.001s=0.001 in order to simulate the underyling dynamical systems. In the Subfigure 2(a), we have considered the IGAHD algorithm β=102\beta=10^{-2} and b=1b=1 (black color) and the case when β=101\beta=10^{-1} and b=1b=1 (blue color). Furthermore, we have considered also the LT-S-IGAHD from Example 25, by the fact that when μ0\mu\approx 0, then the algorithms becomes IGAHD-type. We have chosen the following parameters : μ=101\mu=10^{-1}, β=102\beta=10^{-2}, b=1b=1 (red color), μ=101\mu=10^{-1}, β=101\beta=10^{-1}, b=1b=1 (saddle brown color, [139,69,19][139,69,19] in RGB), μ=0.25\mu=0.25, β=102\beta=10^{-2}, b=1b=1 (yellow color) and μ=0.25\mu=0.25, β=101\beta=10^{-1}, b=1b=1 (rustic red color, [75,0,13][75,0,13] in RGB). With this choices of coefficients, we observe that our algorithms have trajectories very close to the ones of IGAHD. This is valided by the theory where when μ\mu is taken small, then the only difference between the algorithms gave in Example 25 and IGAHD is between f(xn)\nabla f(x_{n}) and f(xn1)\nabla f(x_{n-1}). On the other hand, in Subfigure 2(b), we have considered the algorithm from Example 26, with the following choices of the algorithm’s coefficients: a=2a=2, μ=1015\mu=10^{-15}, b=0.1b=0.1 (red color), a=2a=2, μ=101\mu=10^{-1}, b=0.1b=0.1 (blue color) and finally, a=25a=25, μ=2\mu=2, b=3b=3 (black color). In this situation we observe the phenomenon of the high amplification of the coefficients, namely when the parameters have large values, the trajectories have a high acceleration toward the critical point. From this subfigure and also from our preliminary numerical simulations, we have observed that there are some optimal choices of the parameters that lead to a faster convergence, that is given by a low number of iterations for a given tolerance.

Refer to caption
(a) Zoomed trajectories of IGAHD (from [4]) and LT-S-IGAHD type algorithms from Example 25
Refer to caption
(b) The convergence of the trajectories for LT-S-IGAHD of Example 26
Figure 2: The phase plane trajectories for LT-S-IGAHD-type algorithms for the function F2

Now, we focus on the discrete velocity vn=(xnxn1)/sv_{n}=(x_{n}-x_{n-1})/\sqrt{s}, where s=h>0\sqrt{s}=h>0. This is considered in Figure 3. We have taken ϵ=1010\epsilon=10^{-10}, α=3\alpha=3 and x0=(12)x_{0}=\begin{pmatrix}1\\ -2\end{pmatrix} For the algorithm LT-S-IGAHD from Example 25, we have set the values for the parameters as follows: μ=0.5\mu=0.5, β=105\beta=10^{-5} and b=2b=2. In Subfigre 3(a) we have plotted the first component of velocity in 2\mathbb{R}^{2} for s=0.25s=0.25, while in Subfigure 3(b) we have taken a smaller stepsize s=0.025s=0.025. We observe that, in the beginning, the velocity presents high oscillations and then it stabilizes through the value 0, and this is related to the fact that the velocity of the underlying evolution equation has the same behavior. The same stands true also for the second component of velocity and for other choices of the algorithm’s parameters. Moreover, we observe that the stabilization process acts much faster than the convergence of the numerical scheme, when the tolerance is of low value.
Second to last, we turn our attention to Table 1, Table 2, Table 3 and Table 4. Here, we have validated all of the cases from Subsection 4.2 involving both convex functions F1 and F2, respectively. First thing to note is that error represents the difference in two consecutive values in the objective function that is less than the given tolerance ϵ\epsilon and along with this, it satisfies and additional stopping criteria, namely n>Nn>N, where N=max{N1,N2,N}N=\max\{N_{1},N_{2},N^{\prime}\}. We observe that our heuristic analysis stands very close to the simulations that we have made. For this, observe the values of NN^{\prime} given for each of the cases. An important difference is in the case A2A_{2} where the value has an error by the fact that the coefficient aa is not much larger than 11. Furthermore, another difference is for the case B2B_{2} when we do not have taken ab0a-b\approx 0 and this shows the fact that the heuristic interpretation that we gave leads to instability in the coefficients, in the sense that the values must take extreme values (see also the situation B1B_{1}). This is linked to the fact that in Subsection 4.2 we have considered an asymptotic analysis when some coefficients have very small values while others have very large values. The cases A2A_{2}, B1B_{1} and B2B_{2} justifies the difference between simulations and the heuristic analysis. On the other hand, for all of the other cases presented in the four tables, NN^{\prime} is very close to the actual theoretical value. Likewise, it is interesting to note that the empirical values N2N_{2} and NN^{\prime} have an empiric relation, namely in the case of moderate or of high value coefficients, NN^{\prime} can be much greater than N2N_{2}, so in this sense, in the majority of cases, the index NN^{\prime} has a large influence over the convergence of the algorithms of type LT-S-IGAHD. Nevertheless, there are some cases when N2N_{2} has a larger influence over the rate of convergence, i.e. when the coefficients have extremely have numerical values, then N2N_{2} increases (with respect to the values presented above in the other cases) and NN^{\prime} decreases. Even though NN^{\prime} is larger than N2N_{2}, the difference is lower than in the previous cases. Finally, we point out that in our numerical simulations, the coefficient GnG_{n} is strictly positive, when n>max{N1,N}n>\max\{N_{1},N^{\prime}\}. This remains valid for all of the cases and for all of the three examples that we gave and it is consistent with the arguments from the third step of the proof of Theorem 20.

Refer to caption
(a) The case when the stepsize has the value s=0.25s=0.25
Refer to caption
(b) The case when the stepsize has the value s=0.025s=0.025
Figure 3: The first component of the discrete velocity of algorithms LT-S-IGAHD from Example 25 for the function F2
Table 1: Analysis of the cases A1A_{1}, A2A_{2}, B1B_{1} and B2B_{2} of Subsection 4.2,
for the Example 24 and for the function F1
CasesCases Params.Params. errorerror ε\varepsilon μ\mu aa bb N2N_{2} NN^{\prime} NN
A1A_{1} 1.22e-11 1e-10 1e-02 4.0 10.0 3.91 -3.56 3.91
A1A_{1} 3.31e-11 1e-10 1e-02 1e-02 10.0 3.99 0.43 3.99
A2A_{2} 2.82e-11 1e-10 1e-02 10.0 4.0 3.80 -1.00 3.80
A2A_{2} 1.50e-12 1e-10 1e-05 3.0 1e-01 3.92 2.17 3.92
B1B_{1} 0.0 1e-10 1.0 100.0 102.0 3.74 11.63 11.63
B1B_{1} 0.0 1e-10 2.0 5e-01 6.0 3.48 422.45 422.45
B2B_{2} 0.0 1e-10 1.5 2.0 1.75 3.58 240.29 240.29
B2B_{2} 0.0 1e-10 1.5 225 1e-02 8.52 17.29 17.29
Table 2: Analysis of the cases A1A_{1}, A2A_{2}, B1B_{1} and B2B_{2} of Subsection 4.2,
for the Example 24 and for the function F2
CasesCases Params.Params. errorerror ε\varepsilon μ\mu aa bb N2N_{2} NN^{\prime} NN
A1A_{1} 7.03e-11 1e-10 1e-02 4.0 10.0 3.38 -3.91 3.38
A1A_{1} 1.92e-12 1e-10 1e-02 1e-02 10.0 3.38 0.08 3.38
A2A_{2} 3.47e-11 1e-10 1e-02 10.0 4.0 3.37 -1.0 3.37
A2A_{2} 7.53e-11 1e-10 1e-05 3.0 1e-01 3.38 2.17 3.38
B1B_{1} 0.0 1e-10 1 100 102 3.50 4.04 4.04
B1B_{1} 0.0 1e-10 2.0 5e-01 6.0 3.43 407.54 407.54
B2B_{2} 0.0 1e-10 1.5 2.0 1.75 3.5 229.04 229.04
B2B_{2} 0.0 1e-10 1.5 225 1e-02 4.46 15.50 15.50
Table 3: Analysis of the cases D1D_{1}, D2D_{2}, E1E_{1} and E2E_{2} of Subsection 4.2,
for the Example 26 and for the function F1
CasesCases Params.Params. errorerror ε\varepsilon μ\mu aa bb N2N_{2} NN^{\prime} NN
D1D_{1} 3.37e-11 1e-10 0.0 0.25 3.5 3.18 0.83 3.18
D1D_{1} 9.79e-11 1e-10 1e-3 1.25 5.5 3.18 -0.17 3.18
D2D_{2} 3.28e-11 1e-10 1e-3 5.5 1.25 3.19 -0.17 3.19
D2D_{2} 1.11e-11 1e-10 1e-3 3.5 0.25 3.19 0.83 3.19
E1E_{1} 0.0 1e-10 2.0 21.0 24.0 5.82 -0.97 5.82
E2E_{2} 0.0 1e-10 2.0 24.0 21.0 6.09 -0.97 6.09
Table 4: Analysis of the cases D1D_{1}, D2D_{2}, E1E_{1} and E2E_{2} of Subsection 4.2,
for the Example 26 and for the function F2
CasesCases Params.Params. errorerror ε\varepsilon μ\mu aa bb N2N_{2} NN^{\prime} NN
D1D_{1} 4.21e-12 1e-10 0.0 0.25 3.5 3.23 0.76 3.23
D1D_{1} 9.18e-11 1e-10 1e-03 1.25 5.5 3.23 -0.24 3.23
D2D_{2} 5.68e-11 1e-10 1e-03 5.5 1.25 3.23 -0.24 3.23
D2D_{2} 7.18e-11 1e-10 1e-03 3.5 0.25 3.23 0.76 3.23
E1E_{1} 0.0 1e-10 2.0 21.0 24.0 4.14 -0.97 4.14
E2E_{2} 0.0 1e-10 2.0 24.0 21.0 4.24 -0.97 4.24

Conclusions and Discussions

The novelty of the present paper consists in the following both theoretical and practical aspects regarding Hessian driven damping dynamical systems and also their numerical methods. We conclude the link between the discrete case and the continuous counterpart of the optimization algorithms, by pointing out our major contributions and some discussions that arise naturally from our research:

  • 1)

    In Subsection 2.1, we have shown that Nesterov’s algorithm AGM2α is in fact a natural discretization of a Hessian-type dynamical system. Even though other authors (see [18] and [33]) have pointed out a link between continuous Hessian dynamical systems and Nesterov algorithm, we have introduced a rigorous perspective such that Nesterov method can be interpreted as a Lie-Trotter discretization of the underlying second evolution equation. This approach is totally new with respect to inertial algorithms, in the following sense: only recently (see [1] and [29]) it was shown that Nesterov’s method is a natural discretization of the Extended-AVD dynamical system. Before that, no other constructions were available such that Nesterov algorithm could be interpreted as an explicit numerical scheme of a continuous dynamical system without an extra-gradient step, so our method represents a totally new approach to optimization. Furthermore, the use of the splitting technique is somewhat new to optimization problems, so that this article strengthens the relationship between distinct mathematical fields.

  • 2)

    We have presented a geometrical-type perspective upon optimization methods in the setting of minimization problems that involve convex objective functions (see Subsection 2.3). Likewise, we gave intuitive explanations over the choice of our algorithms in Subsection 2.4. At the same time, in Subsection 3.2 we have constructed Lemma 15 that represents a generalization of Lemma 14 that was given in [4]. Also, our proof of Theorem 20 is based upon the proof from the same paper, namely [4], but in a much more general setting.

  • 3)

    We have introduced new algorithms that contain the discrete version of the Hessian system, namely LT-S-IGAHD. Also, the particular case given in Example 25 resembles the newly introduced algorithm IGAHD in [4]. Also, in Example 24 and Example 26 we have gave general algorithms that contain three arbitrary parameters (under some suitable conditions) that have numerical benefits in practical applications (see Subsections 4.2 and 4.3). On the other hand, we emphasize that our algorithms have been developed in an elegant manner based upon the combination of splitting methods with symplectic algorithms, similar to the simpler case of Nesterov inertial algorithm (see [29]).

  • 4)

    We have developed some new optimization algorithms based upon two gradient evaluations at each iteration. For each of these numerical schemes we have presented the associated continuous versions, such that these numerical schemes are constructed through various geometric principles like splitting and symplectic methods. The associated continuous dynamical systems DynSys-ARDM and the system from Theorem 7 are worth studying by the fact that their discretizations are obtained in a natural manner, as in the case with Nesterov’s method (see [1]and [29]).

Acknowledgments

I am gratefully indebted to Szilard Csaba László for his suggestions concerning the organization of this article and for the helpful comments regarding the dynamical systems and the optimization algorithms of the present research paper.

References

  • [1] C.D. Alecsa, S.C. László, T. Pinţa, An extension of the second order dynamical system that models Nesterov’s convex gradient method, 2019, arXiv:1908.02574.
  • [2] C.D. Alecsa, I. Boros, T. Pinţa, New optimization algorithms for neural network training using operator splitting techniques, 2019, arXiv:1904.12952.
  • [3] F. Álvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, J. Math. Pures Appl. 81 (2002), no. 8, pp. 747-779.
  • [4] H. Attouch, Z. Chbani, J. Fadili, H. Riahi, First-order optimization algorithms via inertial systems with Hessian driven damping, 2019, arXiv:1907.10536.
  • [5] H. Attouch, A. Cabot, Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity, J. Differential Equations 263 (2017), pp. 5412-5458, doi:10.1016/j.jde.2017.06.024.
  • [6] H. Attouch, A. Cabot, Convergence rates of inertial forward-backward algorithms, SIAM J. Optim (28) 2018, no.1, pp. 849-874, https://doi.org/10.1137/17M1114739.
  • [7] H. Attouch, X. Goudou, P. Redont, The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative system, Commun. Contemp. Math. 2 (2000), pp. 1-34.
  • [8] H. Attouch, J. Peypouquet, P. Redont, Fast convex minimization via inertial dynamics with Hessian driven damping, J. Differential Equations 261 (2016), pp. 5734-5783.
  • [9] N. Bansal, A. Gupta, Potential-function proofs for gradient methods, Theory of Computing (2019), pp. 1-32.
  • [10] A. Bátkai, P. Csomós, B. Farkas, G. Nickel, Operator splitting for non-autonomous evolution equations, J. Functional Analysis 260 (2011), pp. 2163-2190.
  • [11] A. Beck, Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB, vol. 19, SIAM, 2014.
  • [12] S. Blanes, F. Casas, A. Murua, Splitting and composition methods in the numerical integration of differential equations, Bol. Soc. Esp. Mat. Apl. 45 (2008), pp. 89-145.
  • [13] R.I. Bot, R. Csetnek, S.C. László, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applic. Analysis, 2018, doi:10.1080/00036811.2018.1495330.
  • [14] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second-order differential equations with asymptotically small dissipation, Trans. Amer. Math. Soc. 361 (2009), pp. 5983-6017.
  • [15] A. Cabot, H. Engler, S. Gadat, Second order differential equations with asymptotically small dissipation and piecewise flat potentials, Electron J. Differential Equations 17 (2009), pp. 33-38.
  • [16] C. Castera, J. Bolte, C. Févotte, E. Pauwels, An inertial Newton algorithm for deep learning, 2019, arXiv:1905.12278.
  • [17] A. Chambolle, C. Dossal, On the convergence of the iterates of FISTA, HAL Id : hal-01060130, 2014, https://hal.inria.fr/hal-01060130v3.
  • [18] L. Chen, H. Luo, First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow, 2019, arXiv:1912.09276.
  • [19] A. Defazio, On the curved geometry of accelerated optimization, Adv. Neural Inf. Processing Syst. 33 (NIPS 2019), 2019.
  • [20] K. Feng, On difference schemes and symplectic geometry, Proceedings of the 5-th Intern. Symposium on differential geometry &\& differential equations, August 1984, Beijing (1985), pp. 42-58.
  • [21] G. França, D.P. Robinson, R. Vidal, Gradient flows and accelerated proximal splitting methods, 2019, arXiv:1908.00865.
  • [22] G. França, J. Sulam, D.P. Robinson, R. Vidal, Conformal symplectic and relativistic optimization, 2019, arXiv:1903.04100.
  • [23] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration. Structure-Preserving Algorithms for ODE’s, second edition, Springer-Verlag Berlin Heidelberg, 2006, doi:10.1007/3-540-30666-8.
  • [24] E. Hairer, G. Wanner, Euler methods, explicit, implicit and symplectic. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 451-455.
  • [25] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilinear parabolic problems, Appl. Num. Math. 62 (2012), pp. 1428-1435.
  • [26] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Math. of Comput. 78 (2009), no. 267, pp. 1485-1496.
  • [27] S.C. László, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, 2018, arXiv:1807.00387.
  • [28] G. Marchuk, Some applications of splitting-up methods to the solution of mathematical physics problems, Aplikace Matematiky 13 (1968), pp. 103-132.
  • [29] M. Muehlebach, M.I. Jordan, A dynamical systems perspective on Nesterov acceleration, 2019, arXiv:1905.07436.
  • [30] Y. Nesterov, A method for solving a convex programming problem with convergence rate O(1/k2)O(1/k^{2}) (in Russian), Soviet Math. Dokl. 27 (1983), no. 2, pp. 543-547.
  • [31] Y. Nesterov, Introductory lectures on convex optimization : A basic course, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, 2004.
  • [32] N.C. Nguyen, P. Fernandez, R.M. Freund, J. Peraire, Accelerated residual methods for iterative solution of systems of equations, SIAM J. Sci. Computing 40 (2018), pp. A3157-A3179.
  • [33] M. Laborde, A.M. Oberman, A Lyapunov analysis for accelerated gradient methods : from deterministic to stochastic case, 2019, arXiv:1908.07861.
  • [34] B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys. (4) 1964, pp. 1-17.
  • [35] J.M. Sanz-Serna, Symplectic methods. In : Encyclopedia of Applied and Computational Mathematics, B. Engquist ed., Springer, Berlin Heidelberg, 2015, pp. 1451-1458.
  • [36] B. Shi, S.S. Du, W.J. Su, M. I. Jordan, Acceleration via symplectic discretization of high-resolution differential equations, 2019, arXiv:1902.03694.
  • [37] B. Shi, S.S. Iyengar, A conservation law method based on optimization. In : Mathematical Theories of Machine Learning - Theory and Applications, Springer, Cham, 2019, pp. 63-85 https://doi.org/10.1007/978-3-030-17076-9_8.
  • [38] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numer. Anal. 5 (1968), no. 3, pp. 506-517.
  • [39] W.J. Su, S. Boyd, E.J. Candès, A differential equation for modeling Nesterov’s accelerated gradient method : Theory and insights, J. Machine Learning Res. (17) 2016, pp. 1-43.
  • [40] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic convergence analysis of heavy-ball algorithms, AAAI, 2019.
  • [41] H.F. Trotter, On the product of semi-groups of operators, Proc. Am. Math. Soc. 10 (1959), no. 4, pp. 545-551.
  • [42] P. Tseng, On accelerated proximal-gradient methods for convex-concave optimization, Manuscript, May 21 2018, mit.edu/~dimitrib/PTseng/papers/apgm.pdf.
  • [43] R. de Vogelaere, Methods of integration which preserve the contact transformation property of the Hamiltonian equations, Report no. 4, Dept. Math., Univ. of Notre Dame, Notre Dame, Ind., 1956.
2020

Related Posts