An extension of the second order dynamical system that models Nesterov’s convex gradient method

Abstract

In this paper we deal with a general second order continuous dynamical system associated to a convex minimization problem with a Frechet differentiable objective function.

We show that inertial algorithms, such as Nesterov’s algorithm, can be obtained via the natural explicit discretization from our dynamical system.

Our dynamical system can be viewed as a perturbed version of the heavy ball method with vanishing damping, however the perturbation is made in the argument of the gradient of the objective function.

This perturbation seems to have a smoothing effect for the energy error and eliminates the oscillations obtained for this error in the case of the heavy ball method with vanishing damping, as some numerical experiments show.

We prove that the value of the objective function in a generated trajectory converges in order \(\mathcal{O}(1/t^2)\) to the global minimum of the objective function.

Moreover, we obtain that a trajectory generated by the dynamical system converges to a minimum point of the objective function.

Authors

Cristian Daniel Alecsa
Department of Mathematics Babes-Bolyai University, Cluj-Napoca, Romania
Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy, Cluj-Napoca, Romania

Szilard Csaba Laszlo
Technical University of Cluj-Napoca, Department of Mathematics, Cluj-Napoca, Romania

Titus Pinta
Mathematical Institute University of Oxford, Oxford, England

Keywords

Unconstrained optimization problems; convex optimization; heavy ball method; continuous second order dynamical system; convergence rate; inertial algorithm; Lyapunov energy functions.

Paper coordinates

C.-D. Alecsa, S.C. Laszlo, T. Pinta, An extension of the second order dynamical system that models Nesterov’s convex gradient method, Appl. Math. Optim., 84 (2021), pp. 1687–1716. doi: 10.1007/s00245-020-09692-1

PDF

About this paper

Publisher Name

Springer

Print ISSN

0095-4616

Online ISSN

1432-0606

google scholar link

[1] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, 2012,arXiv:1206.5533.

[2]  C. Bishop, Training with noise is equivalent to Tikhonov regularization, Neural computation 7 (1),1995, 108– 116.

[3] L. Bottou, Stochastic gradient descent tricks, Neural networks: Tricks of the trade, Springer, 2012,421– 436.

[4] L. Bottou, F.E. Curtis, J. Nocedal, Optimization Methods for Large-Scale Machine Learning, SIAM Review, vol. 60, no. 2, 2018, 223-311

[5] R.I. Bot, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for minimizing the sum oftwo non-convex functions, Euro Journal on Computational Optimization 4(1), 2016, 3-25.

[6] A.B. Da Silva, M. Gazeau, A general system of differential equations to model first order adaptive algorithms, 2018, arXiv:1810.13108.

[7] T. Dozat, Incorporating Nesterov momentum into Adam, ICLR Workshop, 2016, 1-4.

[8] J. Duchi, E. Hazan, Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research, 2011, 2121–2159.

[9] I. Faragó, New operator splitting methods and their applications, Boyanov et al. (Eds.) NumericalMethods and Applications, Lecture Notes in Computational Sciences 4310, Springer Verlag, Berlin, 2007, 443-450.

[10] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016,http://www.deeplearningbook.org.

[11] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Mathematics ofcomputation, vol. 78, no. 267, 2009, 1485-1496.

[12] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilin-ear parabolic problems, Applied Numerical Mathematics, vol. 62, issue 10, 2012, 1428-1435.

[13] C.F. Higham, D.J. Higham, Deep learning : An introduction for applied mathematicians, 2018, arXiv:1801.05894.

[14] H. Holden, K.H. Karlsen, K-A. Lie, Splitting methods for Partial Differential Equations with Rough Solutions : Analysis and MATLAB Programs, European Mathematical Society Publishing House, 2010,DOI:10.4171/078.16
APREPRINT – JUNE 18, 2019

[15] D.P. Kingma, J.L. Ba, Adam : A method for stochastic optimization, 2014,arXiv:1412.6980.

[16] Y. LeCun, L. Bottou, G.B. Orr, K.R. Müller, Efficient backprop, Neural networks: Tricks of thetrade, Springer, 1998, 9–50.

[17] G.I. Marchuk, Some application of splitting-up methods to the solution of mathematical physics problems, Aplikace matematiky, 1968, 103–132.

[18] P. Mehta, M. Bukov, C.H. Wang, A.G.R. Day, C. Richardson, C.K. Fisher, and D.J. Schwab, AHigh-Bias, Low-Variance Introduction to Machine Learning for Physicists, 2018, arXiv:1803.08823.

[19] Y. Nesterov, A method of solving a convex programming problem with convergence rate o(1/k2), Soviet Mathematics Doklady, vol. 27, 1983, 372–376.

[20] M. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.

[21] J. Nocedal, S. Wright, Numerical Optimization, Springer-Verlag, New York, 2006,10.1007/978-0-387-40065-5.

[22] B. Polyak, Some methods of speeding up the convergence of iteration methods, USSR ComputationalMathematics and Mathematical Physics, 4 (5), 1964, 1–17.

[23] S. Ruder, An overview of gradient descent optimization algorithms, 2016, arXiv:1609.04747.

[24] L. Smith, Cyclical learning rates for training neural networks, 2015, arXiv:1506.01186.

[25] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numerical Anaysis,vol. 5, no. 3, 1968, 506-517.

[26] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradientmethod: theory and insights, Journal of Machine Learning Research, (17), 2016, 1-43.

[27] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic Convergence Analysis of Heavy-Ball Algorithms, 2018, https://arxiv.org/abs/1811.01777.

[28] I. Sutskever, J. Martens, G. Dahl, G. Hinton , On the importance of initialization and mo-mentum in deep learning, International conference on machine learning, 2013, 1139–1147.

[29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.Rabinovich, Going deeper with convolutions, CVPR, 2015, arxiv.org/pdf/1409.4842.pdf.

[30] T. Tieleman, G. Hinton, Lecture 6.5 – rmsprop. Divide the gradient by a running average ofits recent magnitude, Coursera : Neural Networks for Machine Learning, 4, 2012.

[31] A.C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, The marginal value of adaptive gradient methods in machine learning, 2017, arXiv:1705.08292.

[32] M.D. Zeiler, ADADELTA : An adaptive learning rate method, 2012,http://arxiv.org/abs/1212.5701.

[33] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, 2016, arXiv:1611.03530.17

Paper (preprint) in HTML form

An extension of the second order dynamical system that models Nesterov’s convex gradient methodthanks: This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P1-1.1-TE-2016-0266.

Cristian Daniel Alecsa Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy, Cluj-Napoca, Romania and Department of Mathematics, Babes-Bolyai University, Cluj-Napoca, Romania, e-mail: cristian.alecsa@math.ubbcluj.ro    Szilárd Csaba László Technical University of Cluj-Napoca, Department of Mathematics, Str. Memorandumului nr. 28, 400114 Cluj-Napoca, Romania, e-mail: laszlosziszi@yahoo.com    Titus Pinţa Department of Mathematics, Babes-Bolyai University, Cluj-Napoca, Romania, e-mail: titus.pinta@gmail.com

Abstract. In this paper we deal with a general second order continuous dynamical system associated to a convex minimization problem with a Frèchet differentiable objective function. We show that inertial algorithms, such as Nesterov’s algorithm, can be obtained via the natural explicit discretization from our dynamical system. Our dynamical system can be viewed as a perturbed version of the heavy ball method with vanishing damping, however the perturbation is made in the argument of the gradient of the objective function. This perturbation seems to have a smoothing effect for the energy error and eliminates the oscillations obtained for this error in the case of the heavy ball method with vanishing damping, as some numerical experiments show. We prove that the value of the objective function in a generated trajectory converges in order 𝒪(1/t2)\mathcal{O}(1/t^{2}) to the global minimum of the objective function. Moreover, we obtain that a trajectory generated by the dynamical system converges to a minimum point of the objective function.

Key Words. convex optimization, heavy ball method, continuous second order dynamical system, convergence rate, inertial algorithm

AMS subject classification. 34G20, 47J25, 90C25, 90C30, 65K10

1 Introduction

Since Su, Boyd and Candès in [33] showed that Nesterov’s accelerated convex gradient method has the exact limit the second order differential equation that governs the heavy ball system with vanishing damping, that is,

x¨(t)+αtx˙(t)+g(x(t))=0,x(t0)=u0,x˙(t0)=v0,t0>0,u0,v0m,\ddot{x}(t)+\frac{\alpha}{t}\dot{x}(t)+{\nabla}g(x(t))=0,\,\,x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0},\,t_{0}>0,\,u_{0},v_{0}\in\mathbb{R}^{m}, (1)

with α=3\alpha=3, the latter system has been intensively studied in the literature in connection to the minimization problem infxmg(x).\inf_{x\in\mathbb{R}^{m}}g(x). Here g:mg:\mathbb{R}^{m}\longrightarrow\mathbb{R} is a convex Frèchet differentiable function with Lipschitz continuous gradient.

In [33] the authors proved that

g(x(t))ming=𝒪(1t2)g(x(t))-\min g=\mathcal{O}\left(\frac{1}{t^{2}}\right)

for every α3,\alpha\geq 3, however they did not show the convergence of a generated trajectory to a minimum of the objective function g.g.

In [10], Attouch, Chbani, Peypouquet and Redont considered the case α>3\alpha>3 in (1), and showed that the generated trajectory x(t)x(t) converges to a minimizer of gg as t+t\longrightarrow+\infty. Actually in [10] the authors considered the perturbed version of the heavy ball system with vanishing damping, that is,

x¨(t)+αtx˙(t)+g(x(t))=h(t),x(t0)=u0,x˙(t0)=v0,t0>0,u0,v0m,\ddot{x}(t)+\frac{\alpha}{t}\dot{x}(t)+{\nabla}g(x(t))=h(t),\,\,x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0},\,t_{0}>0,\,u_{0},v_{0}\in\mathbb{R}^{m}, (2)

where h:[t0,+)h:[t_{0},+\infty)\longrightarrow\mathbb{R} is a small perturbation therm that satisfies t0+th(t)𝑑t<+.\int_{t_{0}}^{+\infty}t\|h(t)\|dt<+\infty. Beside the convergence of a generated trajectory x(t)x(t) to a minimizer of gg, they showed that also in this case the convergence rate of the objective function along the trajectory, that is g(x(t))ming,g(x(t))-\min g, is of order 𝒪(1t2)\mathcal{O}\left(\frac{1}{t^{2}}\right).

Another perturbed version of (1) was studied by Attouch, Peypouquet and Redont in [12]. They assumed that the objective gg is twice continuously differentiable and the perturbation of their system is made at the damping therm. More precisely, they studied the dynamical system with Hessian driven damping

x¨(t)+αtx˙(t)+β2g(x(t))x˙(t)+g(x(t))=0,x(t0)=u0,x˙(t0)=v0,t0>0,u0,v0m,\ddot{x}(t)+\frac{\alpha}{t}\dot{x}(t)+\beta{\nabla}^{2}g(x(t))\dot{x}(t)+{\nabla}g(x(t))=0,\,\,x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0},\,t_{0}>0,\,u_{0},v_{0}\in\mathbb{R}^{m}, (3)

where α>0\alpha>0 and β>0.\beta>0. In case α>3,β>0\alpha>3,\,\beta>0 they showed the convergence of a generated trajectory to a minimizer of gg. Moreover, they obtained that in this case the convergence rate of the objective function along the trajectory, that is g(x(t))ming,g(x(t))-\min g, is of order o(1t2){o}\left(\frac{1}{t^{2}}\right).

Further, Attouch, Chbani and Riahi in [8] studied the subcritical case α3\alpha\leq 3 and they proved that in this case the convergence rates of the objective function gg along the trajectory generated by (1), i.e g(x(t))ming,g(x(t))-\min g, is of order 𝒪(1t23α)\mathcal{O}\left(\frac{1}{t^{\frac{2}{3}\alpha}}\right).

Another approach is due to Aujol, Dossal and Rondepierre [2], who assumed that beside convexity, the objective gg in (1) satisfies some geometrical conditions, such as the Łojasiewicz property. The importance of their results obtained in [2] is underlined by the fact that applying the classical Nesterov scheme on a convex objective function without studying its geometrical properties may lead to sub-optimal algorithms.

It is worth mentioning the work of Aujol and Dossal [1], who did not assumed the convexity of g,g, but the convexity of the function (g(x(t))g(x))β(g(x(t))-g(x^{*}))^{\beta}, where β\beta is strongly related to the damping parameter α\alpha and xx^{*} is a global minimizer of g.g. Under these assumptions, they obtained some general convergence rates and also the convergence of the generated trajectories of (1). In case β=1\beta=1 they results reduce to the results obtained in [10, 12, 8].

However, the convergence of the trajectories generated by the continuous heavy ball system with vanishing damping (1), in the general case when gg is nonconvex is still an open question. Some important steps in this direction have been made in [20] (see also [18]), where convergence of the trajectories of a perturbed system, have been obtained in a nonconvex setting. More precisely in [20] is considered the system

x¨(t)+(γ+αt)x˙(t)+g(x(t))=0,x(t0)=u0,x˙(t0)=v0,\ddot{x}(t)+\left(\gamma+\frac{\alpha}{t}\right)\dot{x}(t)+{\nabla}g(x(t))=0,\,x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0}, (4)

where t0>0,u0,v0m,γ>0,α.t_{0}>0,u_{0},v_{0}\in\mathbb{R}^{m},\,\gamma>0,\,\alpha\in\mathbb{R}. Note that here α\alpha can take nonpositive values. For α=0\alpha=0 we recover the dynamical system studied in [16]. According to [20], the trajectory generated by the dynamical system (4) converges to a critical point of g,g, if a regularization of gg satisfies the Kurdyka-Łojasiewicz property.

Further results concerning the heavy ball method and its extensions can be found in [6, 7, 11, 14, 22, 23].

1.1 An extension of the heavy ball method and the Nesterov type algorithms obtained via explicit discretization

What one can notice concerning the heavy ball system and its variants is, that despite of the result of Su et al. [33], this system will never give through the natural implicit/explicit discretization the Nesterov algorithm. This is due to the fact that the gradient of gg is evaluated in x(t),x(t), and this via discretization will become g(xn)g(x_{n}), (or g(xn+1)g(x_{n+1})) and never of the form g(yn),yn=xn+αn(xnxn1)g(y_{n}),\,y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1}) as Nesterov’s gradient method requires. Another observation is that using the same approach as in [33], one can show, see [26], that (1), (and also (4)), models beside Nesterov’s algorithm other algorithms too. In this paper we overcome the deficiency emphasized above, by introducing a dynamical system that via explicit discretization leads to inertial algorithms of gradient type. To this end, let us consider the optimization problem

(P)infxmg(x)(P)\ \inf_{x\in\mathbb{R}^{m}}g(x) (5)

where g:mg:\mathbb{R}^{m}\longrightarrow\mathbb{R} is a convex Fréchet differentiable, function with LgL_{g}-Lipschitz continuous gradient, i.e. there exists Lg0L_{g}\geq 0 such that g(x)g(y)Lgxy\|{\nabla}g(x)-{\nabla}g(y)\|\leq L_{g}\|x-y\| for all x,yn.x,y\in\mathbb{R}^{n}.

We associate to (5) the following second order dynamical system:

{x¨(t)+αtx˙(t)+g(x(t)+(γ+βt)x˙(t))=0x(t0)=u0,x˙(t0)=v0,\left\{\begin{array}[]{ll}\ddot{x}(t)+\frac{\alpha}{t}\dot{x}(t)+\nabla g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)=0\\ x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0},\end{array}\right. (6)

where u0,v0m,u_{0},v_{0}\in\mathbb{R}^{m}, t00t_{0}\geq 0 and α>0,β,γ0.\alpha>0,\,\beta\in\mathbb{R},\,\gamma\geq 0.

Remark 1

The connection of (6) with the heavy ball system with vanishing damping (1) is obvious, the latter one can be obtained from (6) for γ=β=0.\gamma=\beta=0. The study of the dynamical system (6) in connection to the optimization problem (5) is motivated by the following facts:

1. The dynamical system (6) leads via explicit discretization to inertial algorithms. In particular Nesterov’s algorithm can be obtained via this natural discretization.

2. A generated trajectory and the objective function value in this trajectory in general have a better convergence behaviour than a trajectory generated by the heavy ball system with vanishing damping (1), as some numerical examples shows.

3. The same numerical experiments reveal that the perturbation term (γ+βt)x˙(t)\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t) in the argument of the gradient of the objective function gg has a smoothing effect and annihilates the oscillations obtained in case of the dynamical system (1) for the errors g(x(t)mingg(x(t)-\min g and x(t)x,\|x(t)-x^{*}\|, where xx^{*} is a minimizer of g.g.

4. A trajectory x(t)x(t) generated by the dynamical system (6) ensures the convergence rate of order 𝒪(1t2)\mathcal{O}\left(\frac{1}{t^{2}}\right) for the decay g(x(t)+(γ+βt)x˙(t))mingg\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g, provided it holds that α>3,γ>0,β\alpha>3,\,\gamma>0,\,\beta\in\mathbb{R} or α3,g=0,β0.\alpha\geq 3,\,g=0,\,\beta\geq 0.

5. A trajectory x(t)x(t) generated by the dynamical system (6) ensures the same convergence rate of order 𝒪(1t2)\mathcal{O}\left(\frac{1}{t^{2}}\right) for the decay g(x(t))mingg(x(t))-\min g as the heavy ball method with vanishing damping, for the cases α>3,γ>0,β\alpha>3,\,\gamma>0,\,\beta\in\mathbb{R} and α>3,γ=0,β0.\alpha>3,\,\gamma=0,\,\beta\geq 0.

6. The convergence of a generated trajectory x(t)x(t) to a minimizer of the objective function gg can be obtained in case α>3,γ>0\alpha>3,\,\gamma>0 and β\beta\in\mathbb{R} and also in the case α>3,γ=0\alpha>3,\,\gamma=0 and β0.\beta\geq 0.

Remark 2

Nevertheless, in case γ=0\gamma=0 and β<0\beta<0 the dynamical system (6) can generate periodical solutions, hence the convergence of a generated trajectory to a minimizer of the objective is hopeless. To illustrate this fact, for β<0,α>0,γ=0\beta<0,\,\alpha>0,\,\gamma=0 consider the strongly convex objective function g:,g:\mathbb{R}\longrightarrow\mathbb{R}, g(x)=α2βx2g(x)=\frac{\alpha}{-2\beta}x^{2}. Then, taking into account that γ=0\gamma=0 the dynamical system (6) becomes

{x¨(t)+αβx(t)=0,x(0)=0,x˙(0)=αβ.\left\{\begin{array}[]{ll}\ddot{x}(t)+\frac{\alpha}{-\beta}x(t)=0,\\ x(0)=0,\,\dot{x}(0)=\sqrt{\frac{\alpha}{-\beta}}.\end{array}\right. (7)

Now, the periodical function x(t)=sinαβtx(t)=\sin\sqrt{\frac{\alpha}{-\beta}}t is a solution of (7), consequently do not exist the limit limt+x(t).\lim_{t\longrightarrow+\infty}x(t).

We emphasize, that despite the novelty of the dynamical system (6), its formulation is natural since by explicit discretization leads to inertial gradient methods, in particular the famous Polyak and Nesterov numerical schemes can be obtained from (6). For other inertial algorithms of gradient type we refer to [17, 19, 26].

Indeed, explicit discretization of (6), with the constant stepsize hh, tn=nh,xn=x(tn)t_{n}=nh,\,\,x_{n}=x(t_{n}) leads to

xn+12xn+xn1h2+αnh2(xnxn1)+g(xn+(γh+βnh2)(xnxn1))=0.\frac{x_{n+1}-2x_{n}+x_{n-1}}{h^{2}}+\frac{\alpha}{nh^{2}}(x_{n}-x_{n-1})+{\nabla}g\left(x_{n}+\left(\frac{\gamma}{h}+\frac{\beta}{nh^{2}}\right)(x_{n}-x_{n-1})\right)=0.

Equivalently, the latter equation can be written as

xn+1=xn+(1αn)(xnxn1)h2g(xn+(γh+βnh2)(xnxn1)).x_{n+1}=x_{n}+\left(1-\frac{\alpha}{n}\right)(x_{n}-x_{n-1})-h^{2}{\nabla}g\left(x_{n}+\left(\frac{\gamma}{h}+\frac{\beta}{nh^{2}}\right)(x_{n}-x_{n-1})\right). (8)

Now, setting h2=sh^{2}=s and denoting the constants γh\frac{\gamma}{h} and βh2\frac{\beta}{h^{2}} still with γ\gamma and β\beta, we get the following general inertial algorithm:

Let x0,x1mx_{0},x_{-1}\in\mathbb{R}^{m} and for all nn\in\mathbb{N} consider the sequences

{yn=xn+(1αn)(xnxn1)zn=xn+(γ+βn)(xnxn1)xn+1=ynsg(zn).\left\{\begin{array}[]{lll}y_{n}=x_{n}+\left(1-\frac{\alpha}{n}\right)(x_{n}-x_{n-1})\\ z_{n}=x_{n}+\left({\gamma}+\frac{\beta}{n}\right)(x_{n}-x_{n-1})\\ x_{n+1}=y_{n}-s{\nabla}g\left(z_{n}\right).\end{array}\right. (9)

However, from a practical point of view it is more convenient to work with the following equivalent formulation: Let x0,x1mx_{0},x_{-1}\in\mathbb{R}^{m} and for all nn\in\mathbb{N} consider the sequences

{yn=xn+nn+α(xnxn1)zn=xn+γn+βn+α(xnxn1)xn+1=ynsg(zn),\left\{\begin{array}[]{lll}y_{n}=x_{n}+\frac{n}{n+\alpha}(x_{n}-x_{n-1})\\ \\ z_{n}=x_{n}+\frac{\gamma n+\beta}{n+\alpha}(x_{n}-x_{n-1})\\ \\ x_{n+1}=y_{n}-s{\nabla}g\left(z_{n}\right),\end{array}\right. (10)

where α>0,β\alpha>0,\,\beta\in\mathbb{R} and γ0.\gamma\geq 0.

Remark 3

Notice that for γ=β=0\gamma=\beta=0 we obtain a variant of Polyak’s algorithm [31] and for γ=1\gamma=1 and β=0\beta=0 we obtain Nesterov’s algorithm [29]. An interesting fact is that Algorithm (10) allows different inertial steps and this approach seems to be new in the literature.

Remark 4

Independently to us, very recently, a system similar to (6) was studied by Muehlebach and Jordan in [27] and they show that Nesterov’s accelerated gradient method can be obtained from their system via a semi-implicit Euler discretization scheme. They considered a constant damping instead of αt\frac{\alpha}{t} and also took β=0.\beta=0. Further they also treated the ODE

x¨(t)+3t+2x˙(t)+sg(x(t)+t1t+2x˙(t))=0\ddot{x}(t)+\frac{3}{t+2}\dot{x}(t)+s\nabla g\left(x(t)+\frac{t-1}{t+2}\dot{x}(t)\right)=0

which for s=1s=1 is obviously equivalent to the particular case of the governing ODE from (6), obtained for α=3,γ=1,β=3.\alpha=3,\,\gamma=1,\,\beta=-3. However, the freedom of controlling the parameters β\beta and γ\gamma in (6) is essential as the next numerical experiments show.

1.2 Some numerical experiments

In this section we consider two numerical experiments for the trajectories generated by the dynamical system (6) for a strongly convex and for a convex but not strongly convex objective function.

Everywhere in the following numerical experiments we consider the continuous time dynamical system (6), solved numerically with a Runge Kutta 4-5 (ode45) adaptive method in MATLAB. We solved the dynamical system with ode45 on the interval [1,100][1,100] and the plot in Fig. 1 - Fig. 2 show the energy error |g(x(t))g(x)||g(x(t))-g(x^{*})| on the left, and the iterate error x(t)x\|x(t)-x^{*}\| on the right.

We show the evolution of the two errors with respect to different values for α\alpha, β\beta and γ\gamma, including the case that yields the Heavy Ball with Friction. One can observe that the best choice is not γ=β=0\gamma=\beta=0 which is the case of heavy ball system with vanishing damping (1).

1. Consider the function g:2,g:\mathbb{R}^{2}\longrightarrow\mathbb{R},

g(x,y)=2x2+5y24x+10y+7.g(x,y)=2x^{2}+5y^{2}-4x+10y+7.

Then, gg is a strongly convex function, g(x,y)=(4x4,10y+10){\nabla}g(x,y)=(4x-4,10y+10), further x=(1,1)x^{*}=(1,-1) is the unique minimizer of gg and g=g(1,1)=0.g^{*}=g(1,-1)=0.

We compare the convergence behaviour of the generated trajectories of (6) by taking into account the following instances.

α\alpha β\beta γ\gamma Color
3 0 0 blue
3.1 1 0 red
3.1 0 0.5 yellow
3.1 1 1 purple
3.1 -1 1 green

The result are depicted in Figure.1 for the starting points u0=v0=(5,30)u_{0}=v_{0}=(-5,30) and u0=v0=(5,30)u_{0}=v_{0}=(5,-30), respectively.

Refer to caption
(a) u0=v0=(5,30).u_{0}=v_{0}=(-5,30).
Refer to caption
(b) u0=v0=(5,30).u_{0}=v_{0}=(5,-30).
Figure 1: Error analysis with different parameters in dynamical system (4) for a strongly convex objective function.

2. In the next experiment we consider the convex, but not strongly convex function g:2,g:\mathbb{R}^{2}\longrightarrow\mathbb{R},

g(x,y)=x4+5y24x10y+8.g(x,y)=x^{4}+5y^{2}-4x-10y+8.

Then, g(x,y)=(4x34,10y10){\nabla}g(x,y)=(4x^{3}-4,10y-10), further x=(1,1)x^{*}=(1,1) is the unique minimizer of gg and g=g(1,1)=0.g^{*}=g(1,1)=0.

We compare the convergence behaviour of the generated trajectories of (6) by taking into account the following instances.

α\alpha β\beta γ\gamma Color
3.1 0 0 blue
3.1 2 0 red
3.1 0 1 yellow
3.1 0.5 0.5 purple
3.1 -0.5 1 green

The result are depicted in Figure.2 for the starting points u0=v0=(1,5)u_{0}=v_{0}=(-1,5) and u0=v0=(2,2)u_{0}=v_{0}=(2,-2), respectively.

Refer to caption
(a) u0=v0=(1,5).u_{0}=v_{0}=(-1,5).
Refer to caption
(b) u0=v0=(2,2).u_{0}=v_{0}=(2,-2).
Figure 2: Error analysis with different parameters in dynamical system (4) for a convex, but not strongly convex, objective function.
Remark 5

Observe that in all cases the trajectories generated by dynamical system (6) have a better behaviour than the trajectories generated by the heavy ball system with vanishing damping (1). As we have emphasized before, it seems that the perturbation (γ+βt)x˙(t)\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t) in the argument of the gradient of the objective function has a smoothing effect. The choice of the parameters γ\gamma and β\beta will be validated by the theoretical results from Section 3, where we show that in case α>3,γ>0,β\alpha>3,\,\gamma>0,\,\beta\in\mathbb{R} and also in the case α>3,γ=0,β0\alpha>3,\,\gamma=0,\,\beta\geq 0 the energy error g(x(t)mingg(x(t)-\min g is of order 𝒪(1t2)\mathcal{O}\left(\frac{1}{t^{2}}\right) just as the case of heavy ball method. Further, for the values α>3,γ>0,β\alpha>3,\,\gamma>0,\,\beta\in\mathbb{R} and for α>0,γ=0,β0\alpha>0,\,\gamma=0,\,\beta\geq 0 we are able to show that a generated trajectory x(t)x(t) converges to a minimum of the objective function g.g.

1.3 The organization of the paper

The outline of the paper is the following. In the next section we show the existence and uniqueness of the trajectories generated by the system (6). Further we show that the third order derivative exists almost for every tt0t\geq t_{0} and we give some estimates of the third order derivative in terms of velocity and acceleration. We do not assume the convexity of the objective function gg in these results. However, it seems that the assumption that gg has Lipschitz continuous gradient is essential in obtaining existence and uniqueness of the generated trajectories of the dynamical system (6). In Section 3 we deal with the convergence analysis of the generated trajectories. We introduce a general energy functional, which will play the role of a Lyapunov function associated to the dynamical system (6). We show convergence of the generated trajectories and also a rate of order 𝒪(1/t2)\mathcal{O}(1/t^{2}) for the decay g(x(t)+(γ+βt)x˙(t))ming.g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g. Further, we show that also the error g(x(t))mingg(x(t))-\min g has a rate of order 𝒪(1/t2)\mathcal{O}(1/t^{2}). Finally, we show the convergence of the generated trajectories to a minimum of the objective function g.g. In section 4 we conclude our paper and we present some possible related future researches.

2 Existence and uniqueness

2.1 On strong global solutions

The first step toward our existence and uniqueness result obtained in the present section concerns the definition of a strong global solution of the dynamical system (6).

Definition 1

We call the function x:[t0,+)mx:[t_{0},+\infty)\to\mathbb{R}^{m} a strong global solution of the dynamical system (6) if satisfies the following properties:

(i)x,x˙:[t0,+)m are locally absolutely continuous;\displaystyle(i)\quad x,\dot{x}:[t_{0},+\infty)\to\mathbb{R}^{m}\text{ are locally absolutely continuous};
(ii)x¨(t)+αtx˙(t)+g(x(t)+(γ+βt)x˙(t))=0 for almost every tt0;\displaystyle(ii)\quad\ddot{x}(t)+\frac{\alpha}{t}\dot{x}(t)+\nabla g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)=0\text{ for almost every }t\geq t_{0};
(iii)x(t0)=u0 and x˙(t0)=v0.\displaystyle(iii)\quad x(t_{0})=u_{0}\text{ and }\dot{x}(t_{0})=v_{0}.

For brevity reasons, we recall that a mapping x:[t0,+)mx:[t_{0},+\infty)\to\mathbb{R}^{m} is called locally absolutely continuous if it is absolutely continuous on every compact interval [t0,T][t_{0},T], where T>t0T>t_{0}. Further, we have the following equivalent characterizations for an absolutely continuous function x:[t0,T]mx:[t_{0},T]\longrightarrow\mathbb{R}^{m}, (see, for instance, [13, 5]):

  • (a)

    there exists y:[t0,T]my:[t_{0},T]\to\mathbb{R}^{m} an integrable function, such that

    x(t)=x(t0)+t0ty(t)𝑑s,t[t0,T];x(t)=x(t_{0})+\int\limits_{t_{0}}^{t}y(t)ds,\forall t\in[t_{0},T];
  • (b)

    xx is a continuous function and its distributional derivative is Lebesque integrable on the interval [t0,T];[t_{0},T];

  • (c)

    for every ε>0\varepsilon>0, there exists η>0\eta>0, such that for every finite family Ik=(ak,bk)I_{k}=(a_{k},b_{k}) from [t0,T][t_{0},T], the following implication is valid :

    [IkIj= and k|bkak|<η][kx(bk)x(ak)<ε].\displaystyle\left[I_{k}\cap I_{j}=\emptyset\text{ and }\sum\limits_{k}|b_{k}-a_{k}|<\eta\right]\Rightarrow\left[\sum\limits_{k}\|x(b_{k})-x(a_{k})\|<\varepsilon\right].
Remark 6

Let x:[t0,+)mx:[t_{0},+\infty)\to\mathbb{R}^{m} be a locally absolutely continuous function. Then xx is differentiable almost everywhere and its derivative coincides with its distributional derivative almost everywhere. On the other hand, we have the equality x˙(t)=y(t)\dot{x}(t)=y(t) for almost every t[t0,+)t\in[t_{0},+\infty), where y=y(t)y=y(t) is defined at the integration formula (a).

The first result of the present section concerns the existence and uniqueness of the trajectories generated by the dynamical system (6). We prove existence and uniqueness of a strong global solution of (6) by making use of the Cauchy-Lipschitz-Picard Theorem for absolutely continues trajectories (see for example [25, Proposition 6.2.1], [32, Theorem 54]). The key argument is that one can rewrite (6) as a particular first order dynamical system in a suitably chosen product space (see also [6, 9, 20, 21]).

Theorem 7

Let (u0,v0)m×m(u_{0},v_{0})\in\mathbb{R}^{m}\times\mathbb{R}^{m}. Then, the dynamical system (6) admits a unique strong global solution.

Proof.

By making use of the notation X(t)=(x(t),x˙(t))X(t)=(x(t),\dot{x}(t)) the system (6) can be rewritten as a first order dynamical system:

{X˙(t)=F(t,X(t))X(t0)=(u0,v0),\left\{\begin{array}[]{ll}\dot{X}(t)=F(t,X(t))\\ X(t_{0})=(u_{0},v_{0}),\end{array}\right. (11)

where F:[t0,+)×m×mm×m,F(t,u,v)=(v,αtvg(u+(γ+βt)v)).F:[t_{0},+\infty)\times\mathbb{R}^{m}\times\mathbb{R}^{m}\longrightarrow\mathbb{R}^{m}\times\mathbb{R}^{m},\,F(t,u,v)=\left(v,-\frac{\alpha}{t}v-\nabla g\left(u+\left(\gamma+\frac{\beta}{t}\right)v\right)\right).

The existence and uniqueness of a strong global solution of (6) follows according to the Cauchy-Lipschitz-Picard Theorem applied to the first order dynamical system (11). In order to prove the existence and uniqueness of the trajectories generated by (11) we show the following:

(I) For every t[t0,+)t\in[t_{0},+\infty) the mapping F(t,,)F(t,\cdot,\cdot) is L(t)L(t)-Lipschitz continuous and L()Lloc1([t0,+))L(\cdot)\in L^{1}_{loc}([t_{0},+\infty)).

(II) For all u,vmu,v\in{\mathbb{R}^{m}} one has F(,u,v)Lloc1([t0,+),m×m)F(\cdot,u,v)\in L^{1}_{loc}([t_{0},+\infty),{\mathbb{R}^{m}}\times{\mathbb{R}^{m}}) .

Let us prove (I). Let t[t0,+)t\in[t_{0},+\infty) be fixed and consider the pairs (u,v)(u,v) and (u¯,v¯)(\bar{u},\bar{v}) from m×m\mathbb{R}^{m}\times\mathbb{R}^{m}. Using the Lipschitz continuity of g{\nabla}g and the obvious inequality A+B22A2+2B2\|A+B\|^{2}\leq 2\|A\|^{2}+2\|B\|^{2} for all A,BmA,B\in\mathbb{R}^{m}, we make the following estimations :

F(t,u,v)F(t,u¯,v¯)\displaystyle\|F(t,u,v)-F(t,\overline{u},\overline{v})\| =vv¯2+αt(v¯v)+g(u¯+(γ+βt)v¯)g(u+(γ+βt)v)2\displaystyle=\sqrt{\|v-\overline{v}\|^{2}+\left\|\frac{\alpha}{t}(\overline{v}-v)+\nabla g\left(\overline{u}+\left(\gamma+\frac{\beta}{t}\right)\overline{v}\right)-\nabla g\left(u+\left(\gamma+\frac{\beta}{t}\right)v\right)\right\|^{2}}
(1+2(αt)2)vv¯2+2Lg2(uu¯)+(γ+βt)(vv¯)2\displaystyle\leq\sqrt{\left(1+2\left(\frac{\alpha}{t}\right)^{2}\right)\|v-\overline{v}\|^{2}+2L_{g}^{2}\left\|(u-\overline{u})+\left(\gamma+\frac{\beta}{t}\right)(v-\overline{v})\right\|^{2}}
(1+2(αt)2+4Lg2(γ+βt)2)vv¯2+4Lg2uu¯2\displaystyle\leq\sqrt{\left(1+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}\right)\|v-\overline{v}\|^{2}+4L_{g}^{2}\left\|u-\overline{u}\right\|^{2}}
1+4Lg2+2(αt)2+4Lg2(γ+βt)2vv¯2+uu¯2\displaystyle\leq\sqrt{1+4L_{g}^{2}+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}}\sqrt{\|v-\overline{v}\|^{2}+\|u-\overline{u}\|^{2}}
=1+4Lg2+2(αt)2+4Lg2(γ+βt)2(u,v)(u¯,v¯).\displaystyle=\sqrt{1+4L_{g}^{2}+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}}\,\|(u,v)-(\overline{u},\overline{v})\|.

By employing the notation L(t)=1+4Lg2+2(αt)2+4Lg2(γ+βt)2L(t)=\sqrt{1+4L_{g}^{2}+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}}, we have that

F(t,u,v)F(t,u¯,v¯)L(t)(u,v)(u¯,v¯).\|F(t,u,v)-F(t,\bar{u},\bar{v})\|\leq L(t)\cdot\|(u,v)-(\bar{u},\bar{v})\|.

Obviously the function tL(t)t\longrightarrow L(t) is continuous on [t0,+)[t_{0},+\infty), hence L()L(\cdot) is integrable on [t0,T][t_{0},T] for all t0<T<+t_{0}<T<+\infty.

For proving (II) consider (u,v)m×m(u,v)\in\mathbb{R}^{m}\times\mathbb{R}^{m} a fixed pair of elements and let T>t0T>t_{0}. We consider the following estimations:

t0TF(t,u,v)𝑑t\displaystyle\int_{t_{0}}^{T}\|F(t,u,v)\|dt =t0Tv2+αtv+g(u+(γ+βt)v)2𝑑t\displaystyle=\int_{t_{0}}^{T}\sqrt{\|v\|^{2}+\left\|\frac{\alpha}{t}v+\nabla g\left(u+\left(\gamma+\frac{\beta}{t}\right)v\right)\right\|^{2}}dt
t0T(1+2(αt)2)v2+4g(u+(γ+βt)v)g(u)2+4g(u)2𝑑t\displaystyle\leq\int_{t_{0}}^{T}\sqrt{\left(1+2\left(\frac{\alpha}{t}\right)^{2}\right)\|v\|^{2}+4\left\|\nabla g\left(u+\left(\gamma+\frac{\beta}{t}\right)v\right)-{\nabla}g(u)\right\|^{2}+4\|{\nabla}g(u)\|^{2}}dt
t0T(1+2(αt)2+4Lg2(γ+βt)2)v2+4g(u)2𝑑t\displaystyle\leq\int_{t_{0}}^{T}\sqrt{\left(1+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}\right)\|v\|^{2}+4\|{\nabla}g(u)\|^{2}}dt
v2+g(u)2t0T5+2(αt)2+4Lg2(γ+βt)2𝑑t\displaystyle\leq\sqrt{\|v\|^{2}+\|\nabla g(u)\|^{2}}\int_{t_{0}}^{T}\sqrt{5+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}}dt

and the conclusion follows by the continuity of the function t5+2(αt)2+4Lg2(γ+βt)2t\longrightarrow\sqrt{5+2\left(\frac{\alpha}{t}\right)^{2}+4L_{g}^{2}\left(\gamma+\frac{\beta}{t}\right)^{2}} on [t0,T][t_{0},T].

The Cauchy-Lipschitz-Picard theorem guarantees existence and uniqueness of the trajectory of the first order dynamical system (11) and thus of the second order dynamical system (6). \blacksquare

Remark 8

Note that we did not use the convexity assumption imposed on gg in the proof of Theorem 7. However, we emphasize that according to the proof of Theorem 7, the assumption that gg has a Lipschitz continuous gradient is essential in order to obtain existence and uniqueness of the trajectories generated by the dynamical system (6).

2.2 On the third order derivatives

In this section we show that the third order derivative of a strong global solution of the dynamical system (6) exists almost everywhere on [t0,+)[t_{0},+\infty). Further we give an upper bound estimate for the third order derivative in terms of velocity and acceleration of a strong global solution of (6). For simplicity, in the proof of the following sequel we employ the 11-norm on m×m\mathbb{R}^{m}\times\mathbb{R}^{m}, defined as (x1,x2)1=x1+x2\|(x_{1},x_{2})\|_{1}=\|x_{1}\|+\|x_{2}\|, for all x1,x2mx_{1},x_{2}\in\mathbb{R}^{m}. Obviously one has

12(x1,x2)1(x1,x2)=x12+x22(x1,x2)1, for all x1,x2m.\frac{1}{\sqrt{2}}\|(x_{1},x_{2})\|_{1}\leq\|(x_{1},x_{2})\|=\sqrt{\|x_{1}\|^{2}+\|x_{2}\|^{2}}\leq\|(x_{1},x_{2})\|_{1},\mbox{ for all }x_{1},x_{2}\in\mathbb{R}^{m}.
Proposition 9

For the starting points (u0,v0)m×m(u_{0},v_{0})\in\mathbb{R}^{m}\times\mathbb{R}^{m} let xx be the unique strong global solution of the dynamical system (6). Then, x¨\ddot{x} is locally absolutely continuous on [t0,+)[t_{0},+\infty), consequently the third order derivative x(3)x^{(3)} exists almost everywhere on [t0,+)[t_{0},+\infty).

Proof.

We show that X˙(t)=(x˙(t),x¨(t))\dot{X}(t)=(\dot{x}(t),\ddot{x}(t)) is locally absolutely continuous, hence x¨\ddot{x} is also locally absolutely continuous. This implies by Remark 6 that x(3)x^{(3)} exists almost everywhere on [t0,+)[t_{0},+\infty).

Let T>t0T>t_{0} and s,t[t0,T]s,t\in[t_{0},T]. We consider the following chain of inequalities :

X˙(s)X˙(t)1\displaystyle\|\dot{X}(s)-\dot{X}(t)\|_{1} =F(s,X(s))F(t,X(t))1\displaystyle=\|F(s,X(s))-F(t,X(t))\|_{1}
=(x˙(s)x˙(t),αtx˙(t)αsx˙(t)+g(x(t)+(γ+βt)x˙(t))g(x(s)+(γ+βs)x˙(s)))1\displaystyle=\left\|\left(\dot{x}(s)-\dot{x}(t),\frac{\alpha}{t}\dot{x}(t)-\frac{\alpha}{s}\dot{x}(t)+{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-{\nabla}g\left(x(s)+\left(\gamma+\frac{\beta}{s}\right)\dot{x}(s)\right)\right)\right\|_{1}
x˙(s)x˙(t)+αsx˙(s)αtx˙(t)\displaystyle\leq\|\dot{x}(s)-\dot{x}(t)\|+\left\|\frac{\alpha}{s}\dot{x}(s)-\frac{\alpha}{t}\dot{x}(t)\right\|
+g(x(s)+(γ+βs)x˙(s))g(x(t)+(γ+βt)x˙(t))\displaystyle+\left\|{\nabla}g\left(x(s)+\left(\gamma+\frac{\beta}{s}\right)\dot{x}(s)\right)-{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|

and by using the LgL_{g}-Lipschitz continuity of g{\nabla}g we obtain

X˙(s)X˙(t)1\displaystyle\|\dot{X}(s)-\dot{X}(t)\|_{1} x˙(s)x˙(t)+αsx˙(s)αtx˙(t)+Lgx(s)x(t)+Lg(γ+βs)x˙(s)(γ+βt)x˙(t)\displaystyle\leq\|\dot{x}(s)-\dot{x}(t)\|+\left\|\frac{\alpha}{s}\dot{x}(s)-\frac{\alpha}{t}\dot{x}(t)\right\|+L_{g}\|x(s)-x(t)\|+L_{g}\left\|\left(\gamma+\frac{\beta}{s}\right)\dot{x}(s)-\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right\|
x˙(s)x˙(t)+αsx˙(s)x˙(t)+|αsαt|x˙(t)+Lgx(s)x(t)\displaystyle\leq\|\dot{x}(s)-\dot{x}(t)\|+\dfrac{\alpha}{s}\|\dot{x}(s)-\dot{x}(t)\|+\left|\dfrac{\alpha}{s}-\frac{\alpha}{t}\right|\|\dot{x}(t)\|+L_{g}\|x(s)-x(t)\|
+Lg(γ+βs)x˙(s)x˙(t)+Lg|βsβt|x˙(t)\displaystyle+L_{g}\left(\gamma+\frac{\beta}{s}\right)\|\dot{x}(s)-\dot{x}(t)\|+L_{g}\left|\frac{\beta}{s}-\frac{\beta}{t}\right|\cdot\|\dot{x}(t)\|
=(1+αs+Lg(γ+βs))x˙(s)x˙(t)+Lgx(s)x(t)+(α+Lg|β|)|1s1t|x˙(t).\displaystyle=\left(1+\dfrac{\alpha}{s}+L_{g}\left(\gamma+\frac{\beta}{s}\right)\right)\|\dot{x}(s)-\dot{x}(t)\|+L_{g}\|x(s)-x(t)\|+\left(\alpha+L_{g}|\beta|\right)\cdot\left|\dfrac{1}{s}-\dfrac{1}{t}\right|\|\dot{x}(t)\|.

Further, let us introduce the following additional notations:

L1:=maxs[t0,T](1+αs+Lg(γ+βs))=1+αt0+Lg(γ+βt0) and L2:=(α+Lg|β|)maxt[t0,T]x˙(t).L_{1}:=\max\limits_{s\in[t_{0},T]}\left(1+\dfrac{\alpha}{s}+L_{g}\left(\gamma+\frac{\beta}{s}\right)\right)=1+\dfrac{\alpha}{t_{0}}+L_{g}\left(\gamma+\frac{\beta}{t_{0}}\right)\mbox{ and }L_{2}:=\left(\alpha+L_{g}|\beta|\right)\max\limits_{t\in[t_{0},T]}\|\dot{x}(t)\|.

Then, one has

X˙(s)X˙(t)X˙(s)X˙(t)1L1x˙(s)x˙(t)+Lgx(s)x(t)+L2|1s1t|.\displaystyle\|\dot{X}(s)-\dot{X}(t)\|\leq\|\dot{X}(s)-\dot{X}(t)\|_{1}\leq L_{1}\|\dot{x}(s)-\dot{x}(t)\|+L_{g}\|x(s)-x(t)\|+L_{2}\left|\dfrac{1}{s}-\dfrac{1}{t}\right|.

By the fact that xx is the strong global solution for the dynamical system (6), it follows that xx and x˙\dot{x} are absolutely continuous on the interval [t0,T][t_{0},T]. Moreover, the function t1tt\longrightarrow\dfrac{1}{t} belongs to C1([t0,T],)C^{1}([t_{0},T],\mathbb{R}), hence it is also absolutely continuous on the interval [t0,T][t_{0},T]. Let ε>0\varepsilon>0. Then, there exists η>0\eta>0, such that for Ik=(ak,bk)[t0,T]I_{k}=(a_{k},b_{k})\subseteq[t_{0},T] satisfying IkIj=I_{k}\cap I_{j}=\emptyset and k|bkak|<η\sum\limits_{k}|b_{k}-a_{k}|<\eta, we have that

kx˙(bk)x˙(ak)<ε3L1,kx(bk)x(ak)<ε3Lg and k|1bk1ak|<ε3L2.\sum\limits_{k}\|\dot{x}(b_{k})-\dot{x}(a_{k})\|<\dfrac{\varepsilon}{3L_{1}},\,\sum\limits_{k}\|x(b_{k})-x(a_{k})\|<\dfrac{\varepsilon}{3L_{g}}\mbox{ and }\sum\limits_{k}\Big{|}\dfrac{1}{b_{k}}-\dfrac{1}{a_{k}}\Big{|}<\dfrac{\varepsilon}{3L_{2}}.

Summing all up, we obtain

kX˙(bk)X˙(ak)L1kx˙(bk)x˙(ak)+Lgkx(bk)x(ak)+L2k|1bk1ak|<ε,\displaystyle\sum\limits_{k}\|\dot{X}(b_{k})-\dot{X}(a_{k})\|\leq L_{1}\sum\limits_{k}\|\dot{x}(b_{k})-\dot{x}(a_{k})\|+L_{g}\sum\limits_{k}\|x(b_{k})-x(a_{k})\|+L_{2}\sum\limits_{k}\left|\dfrac{1}{b_{k}}-\dfrac{1}{a_{k}}\right|<\varepsilon,

consequently X˙\dot{X} is absolutely continuous on [t0,T][t_{0},T]. and the conclusion follows. \blacksquare

Concerning an upper bound estimate of the third order derivative x(3)x^{(3)} the following result holds.

Lemma 10

For the initial values (u0,v0)m×m(u_{0},v_{0})\in\mathbb{R}^{m}\times\mathbb{R}^{m} consider xx the unique strong global solution of the second-order dynamical system (6). Then, there exists K>0K>0 such that for almost every t[t0,+)t\in[t_{0},+\infty), we have that :

x(3)(t)K(x˙(t)+x¨(t)).\|x^{(3)}(t)\|\leq K(\|\dot{x}(t)\|+\|\ddot{x}(t)\|). (12)
Proof.

For h>0h>0 we consider the following inequalities :

X˙(t+h)X(t)1\displaystyle\|\dot{X}(t+h)-X(t)\|_{1} =F(t+h,X(t+h))F(t,X(t))1\displaystyle=\|F(t+h,X(t+h))-F(t,X(t))\|_{1}
(x˙(t+h)x˙(t))+α1t+hx˙(t+h)1tx˙(t)\displaystyle\leq\left\|\left(\dot{x}(t+h)-\dot{x}(t)\right)\right\|+\alpha\left\|\dfrac{1}{t+h}\dot{x}(t+h)-\dfrac{1}{t}\dot{x}(t)\right\|
+g(x(t+h)+(γ+βt+h)x˙(t+h))g(x(t)+(γ+βt)x˙(t))\displaystyle+\left\|\nabla g\left(x(t+h)+\left(\gamma+\dfrac{\beta}{t+h}\right)\dot{x}(t+h)\right)-\nabla g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right)\right\|
(x˙(t+h)x˙(t))+α1t+hx˙(t+h)1tx˙(t)\displaystyle\leq\left\|\left(\dot{x}(t+h)-\dot{x}(t)\right)\right\|+\alpha\left\|\dfrac{1}{t+h}\dot{x}(t+h)-\dfrac{1}{t}\dot{x}(t)\right\|
+Lgx(t+h)x(t)+Lg(γ+βt+h)x˙(t+h)(γ+βt)x˙(t).\displaystyle+L_{g}\|x(t+h)-x(t)\|+L_{g}\left\|\left(\gamma+\dfrac{\beta}{t+h}\right)\dot{x}(t+h)-\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right\|.

Now, dividing by h>0h>0 and taking the limit h0{h\longrightarrow 0}, it follows that

X¨(t)1x¨(t)+α(1tx˙(t))+Lgx˙(t)+Lg((γ+βt)x˙(t)).\displaystyle\|\ddot{X}(t)\|_{1}\leq\|\ddot{x}(t)\|+\alpha\left\|\left(\dfrac{1}{t}\dot{x}(t)\right)^{\prime}\right\|+L_{g}\|\dot{x}(t)\|+L_{g}\left\|\left(\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right)^{\prime}\right\|.

Consequently,

x(3)(t)α1t2x˙(t)+1tx¨(t)+Lgx˙(t)+Lgβt2x˙(t)+(γ+βt)x¨(t).\displaystyle\|x^{(3)}(t)\|\leq\alpha\left\|-\dfrac{1}{t^{2}}\dot{x}(t)+\dfrac{1}{t}\ddot{x}(t)\right\|+L_{g}\|\dot{x}(t)\|+L_{g}\left\|-\dfrac{\beta}{t^{2}}\dot{x}(t)+\left(\gamma+\dfrac{\beta}{t}\right)\ddot{x}(t)\right\|.

Finally,

x(3)(t)(Lg+α+Lg|β|t2)x˙(t)+(αt+Lg|γ+βt|)x¨(t)K(x˙(t)+x¨(t)),\|x^{(3)}(t)\|\leq\left(L_{g}+\dfrac{\alpha+L_{g}|\beta|}{t^{2}}\right)\|\dot{x}(t)\|+\left(\frac{\alpha}{t}+L_{g}\left|\gamma+\dfrac{\beta}{t}\right|\right)\|\ddot{x}(t)\|\leq K(\|\dot{x}(t)\|+\|\ddot{x}(t)\|),

where K=max{maxtt0(Lg+α+Lg|β|t2),maxtt0(αt+Lg|γ+βt|)}K=\max\left\{\max_{t\geq t_{0}}\left(L_{g}+\dfrac{\alpha+L_{g}|\beta|}{t^{2}}\right),\max_{t\geq t_{0}}\left(\frac{\alpha}{t}+L_{g}\left|\gamma+\dfrac{\beta}{t}\right|\right)\right\}. \blacksquare

3 Convergence analysis

3.1 On a general energy functional associated to the dynamical system (6)

In order to obtain convergence rates for the function values in the trajectories generated by the dynamical system (6), we need to introduce an appropriate energy functional which will play the role of a Lyapunov function. The form of such an energy functional associated to heavy ball system with vanishing damping and its extensions is well known in the literature, see for instance [10], [12], [1] and [2]. However, the novelty of the dynamical system (6), compared with the extended/perturbed variants of the heavy ball system studied in the above mentioned papers, consists in the fact that in system (6) the perturbation is carried out in the argument of the gradient of the objective function. This seems to be a new approach in the literature, therefore the previously mentioned energy functionals are not suitable for a valuable convergence analysis of the dynamical system (6). Hence, let us denote α(t)=αt\alpha(t)=\frac{\alpha}{t} and β(t)=γ+βt\beta(t)=\gamma+\frac{\beta}{t}, and assume that argming.\operatorname*{argmin}g\neq\emptyset. Further, let g=ming=g(x),xargming.g^{*}=\min g=g(x^{*}),\,x^{*}\in\operatorname*{argmin}g. In connection to the dynamical system (6), we introduce the general energy functional

(t)=a(t)(g(x(t)+β(t)x˙(t))g)+12b(t)(x(t)x)+c(t)x˙(t)2+d(t)2x(t)x2,\mathcal{E}(t)=a(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})+\frac{1}{2}\|b(t)(x(t)-x^{*})+c(t)\dot{x}(t)\|^{2}+\frac{d(t)}{2}\|x(t)-x^{*}\|^{2}, (13)

which can be seen as an extension of the energy function studied in [10] in connection to the heavy ball system with vanishing damping.

Our purpose is to define the non-negative functions a(t),b(t),c(t),d(t)a(t),b(t),c(t),d(t) such that ˙(t)0\dot{\mathcal{E}}(t)\leq 0, that is, the function (t)\mathcal{E}(t) is non-increasing after a t1t0.t_{1}\geq t_{0}. Indeed, if (t)\mathcal{E}(t) is non-increasing for tt1,t\geq t_{1}, then

a(t)(g(x(t)+β(t)x˙(t))g)(t)(t1),a(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})\leq\mathcal{E}(t)\leq\mathcal{E}(t_{1}),

in other words

g(x(t)+β(t)x˙(t))g(t1)a(t), for all tt1.g(x(t)+\beta(t)\dot{x}(t))-g^{*}\leq\frac{\mathcal{E}(t_{1})}{a(t)},\mbox{ for all }t\geq t_{1}.

In what follows we derive the conditions which must be imposed on the positive functions a(t),b(t),c(t),d(t)a(t),b(t),c(t),d(t) in order to obtain ˙(t)0\dot{\mathcal{E}}(t)\leq 0 for every tt1.t\geq t_{1}. We have,

˙(t)=a(t)(g(x(t)+β(t)x˙(t))g)+a(t)g(x(t)+β(t)x˙(t)),β(t)x¨(t)+(β(t)+1)x˙(t)+\dot{\mathcal{E}}(t)=a^{\prime}(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})+a(t)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),\beta(t)\ddot{x}(t)+(\beta^{\prime}(t)+1)\dot{x}(t)\rangle+
b(t)(x(t)x)+(b(t)+c(t))x˙(t)+c(t)x¨(t),b(t)(x(t)x)+c(t)x˙(t)+\langle b^{\prime}(t)(x(t)-x^{*})+(b(t)+c^{\prime}(t))\dot{x}(t)+c(t)\ddot{x}(t),b(t)(x(t)-x^{*})+c(t)\dot{x}(t)\rangle+
d(t)2x(t)x2+d(t)x˙(t),x(t)x.\frac{d^{\prime}(t)}{2}\|x(t)-x^{*}\|^{2}+d(t)\langle\dot{x}(t),x(t)-x^{*}\rangle.

Now, from (6) we get

x¨(t)=α(t)x˙(t)g(x(t)+β(t)x˙(t)),\ddot{x}(t)=-\alpha(t)\dot{x}(t)-{\nabla}g(x(t)+\beta(t)\dot{x}(t)),

hence

˙(t)=a(t)(g(x(t)+β(t)x˙(t))g)+\dot{\mathcal{E}}(t)=a^{\prime}(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})+
a(t)g(x(t)+β(t)x˙(t)),β(t)g(x(t)+β(t)x˙(t))+(β(t)α(t)+β(t)+1)x˙(t)+a(t)\left\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),-\beta(t){\nabla}g(x(t)+\beta(t)\dot{x}(t))+\left(-\beta(t)\alpha(t)+\beta^{\prime}(t)+1\right)\dot{x}(t)\right\rangle+
b(t)(x(t)x)+(b(t)+c(t)c(t)α(t))x˙(t)c(t)g(x(t)+β(t)x˙(t)),b(t)(x(t)x)+c(t)x˙(t)+\left\langle b^{\prime}(t)(x(t)-x^{*})+\left(b(t)+c^{\prime}(t)-c(t)\alpha(t)\right)\dot{x}(t)-c(t){\nabla}g(x(t)+\beta(t)\dot{x}(t)),b(t)(x(t)-x^{*})+c(t)\dot{x}(t)\right\rangle+
d(t)2x(t)x2+d(t)x˙(t),x(t)x.\frac{d^{\prime}(t)}{2}\|x(t)-x^{*}\|^{2}+d(t)\langle\dot{x}(t),x(t)-x^{*}\rangle.

Consequently,

˙(t)=a(t)(g(x(t)+β(t)x˙(t))g)+\dot{\mathcal{E}}(t)=a^{\prime}(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})+ (14)
a(t)β(t)g(x(t)+β(t)x˙(t))2+(a(t)α(t)β(t)+a(t)β(t)+a(t)c2(t))g(x(t)+β(t)x˙(t)),x˙(t)+-a(t)\beta(t)\|{\nabla}g(x(t)+\beta(t)\dot{x}(t))\|^{2}+\left(-a(t)\alpha(t)\beta(t)+a(t)\beta^{\prime}(t)+a(t)-c^{2}(t)\right)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),\dot{x}(t)\rangle+
(b(t)b(t)+d(t)2)x(t)x2+(b2(t)+b(t)c(t)+b(t)c(t)b(t)c(t)α(t)+d(t))x˙(t),x(t)x+\left(b^{\prime}(t)b(t)+\frac{d^{\prime}(t)}{2}\right)\|x(t)-x^{*}\|^{2}+\left(b^{2}(t)+b(t)c^{\prime}(t)+b^{\prime}(t)c(t)-b(t)c(t)\alpha(t)+d(t)\right)\langle\dot{x}(t),x(t)-x^{*}\rangle+
c(t)(b(t)+c(t)c(t)α(t))x˙(t)2b(t)c(t)g(x(t)+β(t)x˙(t)),x(t)x.c(t)\left(b(t)+c^{\prime}(t)-c(t)\alpha(t)\right)\|\dot{x}(t)\|^{2}-b(t)c(t)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),x(t)-x^{*}\rangle.

But

g(x(t)+β(t)x˙(t)),x(t)x=g(x(t)+β(t)x˙(t)),x(t)+β(t)x˙(t)xg(x(t)+β(t)x˙(t)),β(t)x˙(t)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),x(t)-x^{*}\rangle=\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),x(t)+\beta(t)\dot{x}(t)-x^{*}\rangle-\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),\beta(t)\dot{x}(t)\rangle

and by the convexity of gg we have

g(x(t)+β(t)x˙(t)),x(t)+β(t)x˙(t)xg(x(t)+β(t)x˙(t))g,\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),x(t)+\beta(t)\dot{x}(t)-x^{*}\rangle\geq g(x(t)+\beta(t)\dot{x}(t))-g^{*},

hence

b(t)c(t)g(x(t)+β(t)x˙(t)),x(t)x-b(t)c(t)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),x(t)-x^{*}\rangle\leq
b(t)c(t)(g(x(t)+β(t)x˙(t))g)+b(t)c(t)β(t)g(x(t)+β(t)x˙(t)),x˙(t).-b(t)c(t)(g(x(t)+\beta(t)\dot{x}(t))-g^{*})+b(t)c(t)\beta(t)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),\dot{x}(t)\rangle.

Therefore, (14) becomes

˙(t)(a(t)b(t)c(t))(g(x(t)+β(t)x˙(t))g)a(t)β(t)g(x(t)+β(t)x˙(t))2+\dot{\mathcal{E}}(t)\leq(a^{\prime}(t)-b(t)c(t))(g(x(t)+\beta(t)\dot{x}(t))-g^{*})-a(t)\beta(t)\|{\nabla}g(x(t)+\beta(t)\dot{x}(t))\|^{2}+ (15)
(a(t)α(t)β(t)+a(t)β(t)+a(t)c2(t)+b(t)c(t)β(t))g(x(t)+β(t)x˙(t)),x˙(t)+\left(-a(t)\alpha(t)\beta(t)+a(t)\beta^{\prime}(t)+a(t)-c^{2}(t)+b(t)c(t)\beta(t)\right)\langle{\nabla}g(x(t)+\beta(t)\dot{x}(t)),\dot{x}(t)\rangle+
(b(t)b(t)+d(t)2)x(t)x2+(b2(t)+b(t)c(t)+b(t)c(t)b(t)c(t)α(t)+d(t))x˙(t),x(t)x+\left(b^{\prime}(t)b(t)+\frac{d^{\prime}(t)}{2}\right)\|x(t)-x^{*}\|^{2}+\left(b^{2}(t)+b(t)c^{\prime}(t)+b^{\prime}(t)c(t)-b(t)c(t)\alpha(t)+d(t)\right)\langle\dot{x}(t),x(t)-x^{*}\rangle+
c(t)(b(t)+c(t)c(t)α(t))x˙(t)2c(t)\left(b(t)+c^{\prime}(t)-c(t)\alpha(t)\right)\|\dot{x}(t)\|^{2}

In order to have ˙(t)0\dot{\mathcal{E}}(t)\leq 0 for all tt1,t1t0,t\geq t_{1},\,t_{1}\geq t_{0}, one must assume that for all tt1t\geq t_{1} the following inequalities hold:

a(t)b(t)c(t)0,\displaystyle a^{\prime}(t)-b(t)c(t)\leq 0, (16)
a(t)β(t)0,\displaystyle-a(t)\beta(t)\leq 0, (17)
a(t)α(t)β(t)+a(t)β(t)+a(t)c2(t)+b(t)c(t)β(t)=0,\displaystyle-a(t)\alpha(t)\beta(t)+a(t)\beta^{\prime}(t)+a(t)-c^{2}(t)+b(t)c(t)\beta(t)=0, (18)
b(t)b(t)+d(t)20,\displaystyle b^{\prime}(t)b(t)+\frac{d^{\prime}(t)}{2}\leq 0, (19)
b(t)c(t)+b(t)(b(t)+c(t)c(t)α(t))+d(t)=0,\displaystyle b^{\prime}(t)c(t)+b(t)\left(b(t)+c^{\prime}(t)-c(t)\alpha(t)\right)+d(t)=0, (20)
c(t)(b(t)+c(t)c(t)α(t))0.\displaystyle c(t)\left(b(t)+c^{\prime}(t)-c(t)\alpha(t)\right)\leq 0. (21)
Remark 11

Observe that (17) implies that β(t)0\beta(t)\geq 0 for all tt1,t1t0t\geq t_{1},\,t_{1}\geq t_{0} and this shows that in dynamical system (6), one must have β0\beta\geq 0 whenever γ=0.\gamma=0. Further, (19) is satisfied whenever b(t)b(t) and d(t)d(t) are constant functions. It is obvious that there exists t1t_{1} such that for all tt1t\geq t_{1} we have α(t)β(t)+β(t)+1=1αγtαβ+βt2>0,-\alpha(t)\beta(t)+\beta^{\prime}(t)+1=1-\frac{\alpha\gamma}{t}-\frac{\alpha\beta+\beta}{t^{2}}>0, hence from (18) we get

a(t)=c2(t)b(t)c(t)β(t)α(t)β(t)+β(t)+1.a(t)=\frac{c^{2}(t)-b(t)c(t)\beta(t)}{-\alpha(t)\beta(t)+\beta^{\prime}(t)+1}.

Since (20) and (21) do not depend by β(t),\beta(t), it seem natural to choose c(t)c(t), b(t)b(t) and d(t)d(t) the same as in case of heavy ball system with vanishing damping (see [10]), that is, c(t)=t,b(t)=b(0,α1]c(t)=t,\,b(t)=b\in(0,\alpha-1] and d(t)=b(α1b),d(t)=b(\alpha-1-b), for all tt0,t\geq t_{0}, provided α>1.\alpha>1. Now, an easy computation shows that in this case

a(t)=t2+γ(αb)t+β+(β+αγ2)(αb)+a(t)=t^{2}+\gamma(\alpha-b)t+\beta+(\beta+\alpha\gamma^{2})(\alpha-b)+
(αβγ+(γβ+2αβγ+α2γ3)(αb))t+(αβ+β)(β+(β+αγ2)(αb))t2αγt(αβ+β),\frac{(\alpha\beta\gamma+(\gamma\beta+2\alpha\beta\gamma+\alpha^{2}\gamma^{3})(\alpha-b))t+(\alpha\beta+\beta)(\beta+(\beta+\alpha\gamma^{2})(\alpha-b))}{t^{2}-\alpha\gamma t-(\alpha\beta+\beta)},

hence (16) is satisfied whenever b>2,b>2, which implies that α>3.\alpha>3.

However, if γ=0\gamma=0 then

a(t)=t2+β+β(αb)+((αβ+β)(β+β(αb))t2(αβ+β),a(t)=t^{2}+\beta+\beta(\alpha-b)+\frac{((\alpha\beta+\beta)(\beta+\beta(\alpha-b))}{t^{2}-(\alpha\beta+\beta)},

hence (16) holds also for b=2b=2 and α=3.\alpha=3.

3.2 Error estimates for the values

In this section we obtain convergence rate of order 𝒪(1/t2),t+\mathcal{O}(1/t^{2}),\,t\longrightarrow+\infty for the difference g(x(t)+(γ+βt)x˙(t))gg\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-g^{*} where g=g(x)=ming,xargming.g^{*}=g(x^{*})=\min g,\,x^{*}\in\operatorname*{argmin}g\neq\emptyset. From here we are able to show that g(x(t))gg(x(t))-g^{*} also has a convergence rate of order 𝒪(1/t2),t+.\mathcal{O}(1/t^{2}),\,t\longrightarrow+\infty. However, just as in the case of heavy ball system with vanishing damping, in order to obtain these rates, it is necessary to assume α3\alpha\geq 3 in our system (6). We have the following result.

Theorem 12

Let xx be the unique strong global solution of (6) and assume that argming.\operatorname*{argmin}g\neq\emptyset. Assume further that α>3\alpha>3, γ>0\gamma>0 and β\beta\in\mathbb{R}.

Then, there exists K0K\geq 0 and t1t0t_{1}\geq t_{0} such that

g(x(t)+(γ+βt)x˙(t))mingKt2,for all tt1.g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\leq\frac{K}{t^{2}},\,\mbox{for all }t\geq t_{1}. (22)

Further,

t0+t(g(x(t)+(γ+βt)x˙(t))ming)𝑑t<+,\int_{t_{0}}^{+\infty}t\left(g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\right)dt<+\infty, (23)

and

t0+t2g(x(t)+(γ+βt)x˙(t))2𝑑t<+.\int_{t_{0}}^{+\infty}t^{2}\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|^{2}dt<+\infty. (24)
Proof.

Let ming=g=g(x),xargming.\min g=g^{*}=g(x^{*}),\,x^{*}\in\operatorname*{argmin}g. Consider the energy functional (13) with

b(t)=α1>2,c(t)=t,d(t)=0.b(t)=\alpha-1>2,\,c(t)=t,\,d(t)=0.

According to Remark 11 we have

a(t)=t2+γt+2β+αγ2+(3αβγ+α2γ3+βγ)t+β(α+1)(2β+αγ2)t2αγtβ(α+1)a(t)=t^{2}+\gamma t+2\beta+\alpha\gamma^{2}+\frac{(3\alpha\beta\gamma+\alpha^{2}\gamma^{3}+\beta\gamma)t+\beta(\alpha+1)(2\beta+\alpha\gamma^{2})}{t^{2}-\alpha\gamma t-\beta(\alpha+1)}

and the conditions (18)-(21) are satisfied with equality for every tt0t\geq t_{0}.

Obviously, if tt is big enough on has that a(t)>0a(t)>0 and since γ>0\gamma>0 it holds that γ+βt>0\gamma+\frac{\beta}{t}>0 for tt big enough, even if β<0\beta<0. Hence, there exists tt0t^{\prime}\geq t_{0} such (17) is satisfied for every tt.t\geq t^{\prime}.

Now, a(t)b(t)c(t)=(3α)t+γ+𝒪(1t2)a^{\prime}(t)-b(t)c(t)=(3-\alpha)t+\gamma+\mathcal{O}\left(\frac{1}{t^{2}}\right) and by taking into account that α>3\alpha>3 we obtain that there exists t′′t0t^{\prime\prime}\geq t_{0} such that (16) holds for every tt′′.t\geq t^{\prime\prime}.

Let t′′′=max(t,t′′).t^{\prime\prime\prime}=\max(t^{\prime},t^{\prime\prime}). We conclude that, ˙(t)0\dot{\mathcal{E}}(t)\leq 0 for all tt′′′,t\geq t^{\prime\prime\prime}, hence (t)\mathcal{E}(t) is nonincreasing, i.e.

a(t)(g(x(t)+(γ+βt)x˙(t))ming)(t)(t′′′), for all tt′′′.a(t)\left(g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\right)\leq\mathcal{E}(t)\leq\mathcal{E}(t^{\prime\prime\prime}),\mbox{ for all }t\geq t^{\prime\prime\prime}.

But, obviously there exists t1t′′′t_{1}\geq t^{\prime\prime\prime} such that a(t)t2a(t)\geq t^{2} for all tt1t\geq t_{1} and by denoting K=(t′′′)K=\mathcal{E}(t^{\prime\prime\prime}) we obtain

g(x(t)+(γ+βt)x˙(t))mingKt2, for all tt1.g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\leq\frac{K}{t^{2}},\mbox{ for all }t\geq t_{1}.

Next we prove (23). Since a(t)b(t)c(t)=(3α)t+γ+𝒪(1t2)a^{\prime}(t)-b(t)c(t)=(3-\alpha)t+\gamma+\mathcal{O}\left(\frac{1}{t^{2}}\right) there exist t2t1t_{2}\geq t_{1} such that

a(t)b(t)c(t)(3α)t2, for all tt2.a^{\prime}(t)-b(t)c(t)\leq\frac{(3-\alpha)t}{2},\mbox{ for all }t\geq t_{2}.

Hence,

˙(t)(3α)t2(g(x(t)+(γ+βt)x˙(t))ming), for all tt2\dot{\mathcal{E}}(t)\leq\frac{(3-\alpha)t}{2}\left(g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\right),\mbox{ for all }t\geq t_{2}

and by integrating from t2t_{2} to T>t2T>t_{2} we get

t2Tt(g(x(t)+(γ+βt)x˙(t))ming)𝑑t23α((T)(t2)).\int_{t_{2}}^{T}t\left(g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\right)dt\leq\frac{2}{3-\alpha}(\mathcal{E}(T)-\mathcal{E}(t_{2})).

Now, by letting T+T\longrightarrow+\infty and taking into account that \mathcal{E} is nonincreasing, the conclusion follows.

For proving (24) observe that there exists t3t1t_{3}\geq t_{1} such that a(t)(γ+βt)γ2t2-a(t)\left(\gamma+\frac{\beta}{t}\right)\leq-\frac{\gamma}{2}t^{2} for all tt3.t\geq t_{3}. Consequently

˙(t)γ2t2g(x(t)+(γ+βt)x˙(t))2, for all tt3.\dot{\mathcal{E}}(t)\leq-\frac{\gamma}{2}t^{2}\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|^{2},\mbox{ for all }t\geq t_{3}.

Integrating from t3t_{3} to T>t3T>t_{3} and letting T+T\longrightarrow+\infty we obtain the desired conclusion. \blacksquare

Remark 13

Note that according to Remark 11 the condition (17) holds even if γ=0\gamma=0 and β0.\beta\geq 0. Consequently, even in the case α>3,γ=0,β0\alpha>3,\,\gamma=0,\,\beta\geq 0 the conclusions (22) and (23) in Theorem 12 hold. However, if β=0\beta=0 then we cannot obtain (24). Nevertheless, if β>0\beta>0 then (24) becomes:

t0+tg(x(t)+βtx˙(t))2𝑑t<+.\int_{t_{0}}^{+\infty}t\left\|{\nabla}g\left(x(t)+\frac{\beta}{t}\dot{x}(t)\right)\right\|^{2}dt<+\infty.
Remark 14

Concerning the case α=3\alpha=3, note that (17) is satisfied if one assumes that γ=0\gamma=0 but β0.\beta\geq 0. Moreover, in this case (16) is also satisfied since, one has a(t)b(t)c(t)=4β2(α+1)(t2β(α+1))20.a^{\prime}(t)-b(t)c(t)=-\frac{4\beta^{2}(\alpha+1)}{(t^{2}-\beta(\alpha+1))^{2}}\leq 0. Hence, also in the case α=3,γ=0,β0\alpha=3,\,\gamma=0,\,\beta\geq 0 (22) in the conclusion of Theorem 12 holds.

Moreover, if we assume β>0\beta>0 then (24) becomes:

t0+tg(x(t)+βtx˙(t))2𝑑t<+.\int_{t_{0}}^{+\infty}t\left\|{\nabla}g\left(x(t)+\frac{\beta}{t}\dot{x}(t)\right)\right\|^{2}dt<+\infty.

Next we show that also the error g(x(t))mingg\left(x(t)\right)-\min g is of order 𝒪(1/t2).\mathcal{O}(1/t^{2}). For obtaining this result we need the Descent Lemma, see [28].

Lemma 15

Let g:mg:\mathbb{R}^{m}\longrightarrow\mathbb{R} be a Frèchet differentiable function with LgL_{g} Lipschitz continuous gradient. Then one has

g(y)g(x)+g(x),yx+Lg2yx2,x,ym.g(y)\leq g(x)+\langle{\nabla}g(x),y-x\rangle+\frac{L_{g}}{2}\|y-x\|^{2},\,\forall x,y\in\mathbb{R}^{m}.
Theorem 16

Let xx be the unique strong global solution of (6) and assume that argming.\operatorname*{argmin}g\neq\emptyset. If α>3,γ>0\alpha>3,\,\gamma>0 and β\beta\in\mathbb{R}, then xx is bounded and there exists K0K\geq 0 and t1t0t_{1}\geq t_{0} such that

x˙(t)Kt, for all tt1.\|\dot{x}(t)\|\leq\frac{K}{t},\mbox{ for all }t\geq t_{1}. (25)

Further,

t0+tx˙(t)2𝑑t+\int_{t_{0}}^{+\infty}t\|\dot{x}(t)\|^{2}dt\leq+\infty (26)

and there exists K10K_{1}\geq 0 and t2tt_{2}\geq t such that

g(x(t))mingK1t2, for all tt2.g\left(x(t)\right)-\min g\leq\frac{K_{1}}{t^{2}},\mbox{ for all }t\geq t_{2}. (27)
Proof.

For xargmingx^{*}\in\operatorname*{argmin}g let g=g(x)=mingg^{*}=g(x^{*})=\min g and consider the energy function (13) with b(t)=b,b(t)=b, where 2<b<α12<b<\alpha-1, c(t)=tc(t)=t, d(t)=b(α1b)>0d(t)=b(\alpha-1-b)>0 and

a(t)=c2(t)b(t)c(t)(γ+βt)1βt2αt(γ+βt)=(t2bγtbβ)t2t2αγtβ(α+1)=(1+(αb)γtβ(α+1b)t2αγtβ(α+1))t2.a(t)=\frac{c^{2}(t)-b(t)c(t)\left(\gamma+\frac{\beta}{t}\right)}{1-\frac{\beta}{t^{2}}-\frac{\alpha}{t}\left(\gamma+\frac{\beta}{t}\right)}=\frac{(t^{2}-b\gamma t-b\beta)t^{2}}{t^{2}-\alpha\gamma t-\beta(\alpha+1)}=\left(1+\frac{(\alpha-b)\gamma t-\beta(\alpha+1-b)}{t^{2}-\alpha\gamma t-\beta(\alpha+1)}\right)t^{2}.

According to Remark 11 there exists t1t0t_{1}\geq t_{0} such that the conditions (16)-(21) are satisfied.

From the definition of \mathcal{E} one has

b(t)(x(t)x)+c(t)x˙(t)2(t),\|b(t)(x(t)-x^{*})+c(t)\dot{x}(t)\|\leq\sqrt{2\mathcal{E}(t)},

and

x(t)x2d(t)(t)\|x(t)-x^{*}\|\leq\sqrt{\frac{2}{d(t)}\mathcal{E}(t)}

that is

b(x(t)x)+tx˙(t)2(t),\|b(x(t)-x^{*})+t\dot{x}(t)\|\leq\sqrt{2\mathcal{E}(t)},

and

x(t)x2b(α1b)(t).\|x(t)-x^{*}\|\leq\sqrt{\frac{2}{b(\alpha-1-b)}\mathcal{E}(t)}.

By using the fact that \mathcal{E} nonincreasing on an interval [t1,+)[t_{1},+\infty) the latter inequality assures that xx is bounded.

Now, by using the inequality XYXY\|X-Y\|\geq\|X\|-\|Y\| we get

b(x(t)x)+tx˙(t)tx˙(t)b(x(t)x),\|b(x(t)-x^{*})+t\dot{x}(t)\|\geq t\|\dot{x}(t)\|-\|b(x(t)-x^{*})\|,

hence for all tt1t\geq t_{1} one has

tx˙(t)2(t)+b(x(t)x)(1+1α1b)2(t)K,t\|\dot{x}(t)\|\leq\sqrt{2\mathcal{E}(t)}+\|b(x(t)-x^{*})\|\leq\left(1+\sqrt{\frac{1}{\alpha-1-b}}\right)\sqrt{2\mathcal{E}(t)}\leq K,

where K=(1+1α1b)2(t1).K=\left(1+\sqrt{\frac{1}{\alpha-1-b}}\right)\sqrt{2\mathcal{E}(t_{1})}.

Further, (21) becomes (b+1α)t<0,(b+1-\alpha)t<0, hence for all tt1t\geq t_{1} one has

˙(t)(b+1α)tx˙(t)2.\dot{\mathcal{E}}(t)\leq(b+1-\alpha)t\|\dot{x}(t)\|^{2}.

By integrating from t1t_{1} to T>t1T>t_{1} one gets

t1Ttx˙(t)2𝑑t1α1b((t1)(T))1α1b(t1).\int_{t_{1}}^{T}t\|\dot{x}(t)\|^{2}dt\leq\frac{1}{\alpha-1-b}(\mathcal{E}(t_{1})-\mathcal{E}(T))\leq\frac{1}{\alpha-1-b}\mathcal{E}(t_{1}).

By letting T+T\longrightarrow+\infty we obtain

t0+tx˙(t)2𝑑t+.\int_{t_{0}}^{+\infty}t\|\dot{x}(t)\|^{2}dt\leq+\infty.

Now, by using Lemma 15 with y=x(t)y=x(t) and x=x(t)+(γ+βt)x˙(t)x=x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t) we obtain

g(x(t))g(x(t)+(γ+βt)x˙(t))g(x(t))-g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\leq (28)
g(x(t)+(γ+βt)x˙(t)),(γ+βt)x˙(t)+Lg2(γ+βt)x˙(t)2\left\langle{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right),-\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right\rangle+\frac{L_{g}}{2}\left\|\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right\|^{2}\leq
|γt+β|tg(x(t)+(γ+βt)x˙(t))x˙(t)+Lg2(γ+βt)2x˙(t)2.\frac{|\gamma t+\beta|}{t}\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|\left\|\dot{x}(t)\right\|+\frac{L_{g}}{2}\left(\gamma+\frac{\beta}{t}\right)^{2}\|\dot{x}(t)\|^{2}.

Now according to (24) there exists tt0t^{\prime}\geq t_{0} and K>0K^{\prime}>0 such that

g(x(t)+(γ+βt)x˙(t))Kt, for all tt.\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|\leq\frac{K^{\prime}}{t},\mbox{ for all }t\geq t^{\prime}.

Further (25) assures that there exists t1t0t_{1}\geq t_{0} and K>0K>0 such that

x˙(t)Kt, for all tt1.\|\dot{x}(t)\|\leq\frac{K}{t},\mbox{ for all }t\geq t_{1}.

Obviously,

|γ+βt|max(γ,γ+βt0), for all tt0,\left|\gamma+\frac{\beta}{t}\right|\leq\max\left(\gamma,\gamma+\frac{\beta}{t_{0}}\right),\mbox{ for all }t\geq t_{0},

hence (28) assures that there exists K1>0K_{1}^{\prime}>0 and t2t0t_{2}^{\prime}\geq t_{0} such that

g(x(t))g(x(t)+(γ+βt)x˙(t))K1t2, for all tt2.g(x(t))-g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\leq\frac{K_{1}^{\prime}}{t^{2}},\mbox{ for all }t\geq t_{2}^{\prime}. (29)

Now, by adding (22) and (29) we get that there exists K1>0K_{1}>0 and t2t0t_{2}\geq t_{0} such that

(g(x(t)+(γ+βt)x˙(t))ming)+(g(x(t))g(x(t)+(γ+βt)x˙(t)))K1t2, for all tt2\left(g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\min g\right)+\left(g(x(t))-g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right)\leq\frac{K_{1}}{t^{2}},\mbox{ for all }t\geq t_{2}

that is

g(x(t))mingK1t2, for all tt2.g(x(t))-\min g\leq\frac{K_{1}}{t^{2}},\mbox{ for all }t\geq t_{2}.

\blacksquare

Remark 17

Note that if we assume that β0\beta\geq 0 then one can allow γ=0\gamma=0 in the hypotheses of Theorem 16, since in this case condition (17) is satisfied and the conclusions of Theorem 16 hold.

Remark 18

Note that (24), which holds whenever α>3\alpha>3 and γ>0\gamma>0, assures that

tg(x(t)+(γ+βt)x˙(t))L2([t0,+),).t\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|\in L^{2}([t_{0},+\infty),\mathbb{R}).

Further, (26) assures that

tx˙(t)L2([t0,+),).\sqrt{t}\|\dot{x}(t)\|\in L^{2}([t_{0},+\infty),\mathbb{R}). (30)

Consequently, the system (6) leads to

tx¨(t)=αx˙(t)+tg(x(t)+(γ+βt)x˙(t))αx˙(t)+tg(x(t)+(γ+βt)x˙(t))\|t\ddot{x}(t)\|=\left\|\alpha\dot{x}(t)+t{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|\leq\alpha\|\dot{x}(t)\|+t\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|

hence,

tx¨(t)L2([t0,+),).t\|\ddot{x}(t)\|\in L^{2}([t_{0},+\infty),\mathbb{R}). (31)

Now, from (12) we have

tx(3)(t)K(tx˙(t)+tx¨(t))\sqrt{t}\|x^{(3)}(t)\|\leq K(\sqrt{t}\|\dot{x}(t)\|+\sqrt{t}\|\ddot{x}(t)\|)

for some K>0K>0 and for almost every tt0t\geq t_{0}, which combined with (30) and (31) gives

tx(3)(t)L2([t0,+),).\sqrt{t}\|x^{(3)}(t)\|\in L^{2}([t_{0},+\infty),\mathbb{R}). (32)
Remark 19

Notice that (30), (31) and (32) assure in particular that

limt+x˙(t)=limt+x¨(t)=limt+x(3)(t)=0.\lim_{t\longrightarrow+\infty}\|\dot{x}(t)\|=\lim_{t\longrightarrow+\infty}\|\ddot{x}(t)\|=\lim_{t\longrightarrow+\infty}\|{x}^{(3)}(t)\|=0. (33)
Remark 20

In the case γ=0\gamma=0 and β0\beta\geq 0 according to Remark 13 one has

tg(x(t)+(γ+βt)x˙(t))L2([t0,+),).\sqrt{t}\left\|{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right\|\in L^{2}([t_{0},+\infty),\mathbb{R}).

Hence, as in Remark 18 we derive that

tx¨(t),tx(3)(t)L2([t0,+),)\sqrt{t}\|\ddot{x}(t)\|,\,\sqrt{t}\|x^{(3)}(t)\|\in L^{2}([t_{0},+\infty),\mathbb{R})

and consequently (33) holds.

3.3 The convergence of the generated trajectories

In this section we show convergence of the generated trajectories to a minimum point of the objective g.g. The main tool that we use in order to attain this goal will be the following continuous version of Opial Lemma, (see [34, 30, 10].

Lemma 21

Let HH to be a separable Hilbert space and SHS\subseteq H, with SS\neq\emptyset. Further, consider x:[t0,+)Hx:[t_{0},+\infty)\to H a given map. Suppose that the following conditions are satisfied :

(i)zS,limt+x(t)z\displaystyle(i)\quad\forall z\in S,\,\exists\lim\limits_{t\longrightarrow+\infty}\|x(t)-z\|
(ii)every weak sequentially limit point of x(t) belongs to the set S.\displaystyle(ii)\quad\text{every weak sequentially limit point of }x(t)\text{ belongs to the set }S.

Then, x(t)x(t) converges weakly to a point of SS as t+t\to+\infty.

Remark 22

In the setting of the proof concerning the convergence of the generated trajectories, we consider the set SS to be argmin(g)argmin(g). Moreover, the Hilbert space HH is m\mathbb{R}^{m}, and this implies that we actually deduce strong convergence of a strong global solution of the dynamical system (6) to a minimum of the objective function g.g.

Consider now the set of limit points of the trajectory xx, that is

ω(x)={x¯m:(tn)n,tn+,n+ such that x(tn)x¯,n+}.\omega(x)=\{\overline{x}\in\mathbb{R}^{m}:\exists(t_{n})_{n\in\mathbb{N}}\subseteq\mathbb{R},\,t_{n}\longrightarrow+\infty,\,n\longrightarrow+\infty\text{ such that }x(t_{n})\longrightarrow\overline{x},\,n\longrightarrow+\infty\}.

We show that ω(x)argming.\omega(x)\subseteq\operatorname*{argmin}g. We emphasize that since gg is convex one has

argming=critg:={x¯m:g(x¯)=0.}.\operatorname*{argmin}g=\operatorname*{crit}g:=\{\overline{x}\in\mathbb{R}^{m}:{\nabla}g(\overline{x})=0.\}.

We have the following result.

Lemma 23

Let xx be the unique strong global solution of (6) and assume that argming.\operatorname*{argmin}g\neq\emptyset. If α>3,γ>0\alpha>3,\,\gamma>0 and β\beta\in\mathbb{R}, then the following assumptions hold.

  • (i)

    ω(x)=ω(x()+(γ+βt)x˙())\omega(x)=\omega\left(x(\cdot)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(\cdot)\right);

  • (ii)

    ω(x)argming.\omega(x)\subseteq\operatorname*{argmin}g.

Proof.

Indeed (33) assures that limt+(γ+βt)x˙(t)=0\lim_{t\longrightarrow+\infty}\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)=0, which immediately proves (i).

For proving (ii) consider x¯ω(x).\overline{x}\in\omega(x). Then, there exists (tn)n,tn+,n+(t_{n})_{n\in\mathbb{N}}\subseteq\mathbb{R},\,t_{n}\longrightarrow+\infty,\,n\longrightarrow+\infty such that limn+x(tn)=x¯.\lim_{n\longrightarrow+\infty}x(t_{n})=\overline{x}. Now, since g{\nabla}g is continuous and limn+(x(tn)+(γ+βtn)x˙(tn))=x¯\lim_{n\longrightarrow+\infty}\left(x(t_{n})+\left(\gamma+\frac{\beta}{t_{n}}\right)\dot{x}(t_{n})\right)=\overline{x} one has

limn+g(x(tn)+(γ+βtn)x˙(tn))=g(x¯).\lim_{n\longrightarrow+\infty}{\nabla}g\left(x(t_{n})+\left(\gamma+\frac{\beta}{t_{n}}\right)\dot{x}(t_{n})\right)={\nabla}g(\overline{x}).

Further, according to (33)

limn+(x¨(tn)+αtnx˙(tn))=0.\lim_{n\longrightarrow+\infty}\left(\ddot{x}(t_{n})+\frac{\alpha}{t_{n}}\dot{x}(t_{n})\right)=0.

Now, the system (6) gives

0=limn+(x¨(tn)+αtnx˙(tn)+g(x(tn)+(γ+βtn)x˙(tn)))=g(x¯)0=\lim_{n\longrightarrow+\infty}\left(\ddot{x}(t_{n})+\frac{\alpha}{t_{n}}\dot{x}(t_{n})+{\nabla}g\left(x(t_{n})+\left(\gamma+\frac{\beta}{t_{n}}\right)\dot{x}(t_{n})\right)\right)={\nabla}g(\overline{x})

that is x¯argming\overline{x}\in\operatorname*{argmin}g and this proves (ii). \blacksquare

Remark 24

Obviously, according to Remark 20 the conclusion of Lemma 23 remains valid also in the case α>3,γ=0\alpha>3,\,\gamma=0 and β0.\beta\geq 0. Note that (ii) from the conclusion of Lemma 23 is actually the condition (ii) from Lemma 21. For proving (i) from Lemma 21, that is, the limit limt+x(t)x\lim_{t\longrightarrow+\infty}\|x(t)-x^{*}\| exists for every xargming,x^{*}\in\operatorname*{argmin}g, we need the following result from [10].

Lemma 25

(Lemma A.4.  [10]) Let t0>0t_{0}>0, and let w:[t0,+)w:[t_{0},+\infty)\longrightarrow\mathbb{R} be a continuously differentiable function which is bounded from below. Assume that

tw¨(t)+αw˙(t)G(t)t\ddot{w}(t)+\alpha\dot{w}(t)\leq G(t)

for some α>1,\alpha>1, almost every t>t0t>t_{0}, and some nonnegative function GL1(t0,+).G\in L^{1}(t_{0},+\infty). Then, the positive part [w˙]+[\dot{w}]_{+} of w˙\dot{w} belongs to L1(t0,+)L^{1}(t_{0},+\infty) and limit limt+w(t)\lim_{t\longrightarrow+\infty}w(t) exists.

Now we can prove the following.

Lemma 26

Let xx be the unique strong global solution of (6) and assume that argming.\operatorname*{argmin}g\neq\emptyset. If α>3,γ>0\alpha>3,\,\gamma>0 and β\beta\in\mathbb{R}, then for every xargmingx^{*}\in\operatorname*{argmin}g there exists the limit

limt+x(t)x.\lim_{t\longrightarrow+\infty}\|x(t)-x^{*}\|.
Proof.

Let xargmingx^{*}\in\operatorname*{argmin}g and define the function hx:[t0,+),hx(t)=12x(t)x2h_{x^{*}}:[t_{0},+\infty)\longrightarrow\mathbb{R},\,h_{x^{*}}(t)=\dfrac{1}{2}\|x(t)-x^{*}\|^{2}. Using the chain rule with respect to the differentiation of hxh_{x^{*}}, we obtain that

h˙x(t)=x˙(t),x(t)x\displaystyle\dot{h}_{x^{*}}(t)=\langle\dot{x}(t),x(t)-x^{*}\rangle

and that

h¨x(t)=x¨(t),x(t)x+x˙(t)2.\displaystyle\ddot{h}_{x^{*}}(t)=\langle\ddot{x}(t),x(t)-x^{*}\rangle+\|\dot{x}(t)\|^{2}.

Let us denote g=g(x)=min(g)g^{\ast}=g(x^{\ast})=min(g). Using the dynamical system (6), one has that

h¨x(t)+αth˙x(t)=g(x(t)+(γ+βt)x˙(t)),x(t)x+x˙(t)2.\displaystyle\ddot{h}_{x^{*}}(t)+\dfrac{\alpha}{t}\dot{h}_{x^{*}}(t)=-\left\langle\nabla g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right),x(t)-x^{\ast}\right\rangle+\|\dot{x}(t)\|^{2}.

This means that

h¨x(t)+αth˙x(t)=g(x(t)+(γ+βt)x˙(t)),x(t)+(γ+βt)x˙(t)x+x˙(t)2\ddot{h}_{x^{*}}(t)+\dfrac{\alpha}{t}\dot{h}_{x^{*}}(t)=-\left\langle\nabla g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right),x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)-x^{*}\right\rangle+\|\dot{x}(t)\|^{2} (34)
+g(x(t)+(γ+βt)x˙(t)),(γ+βt)x˙(t).+\left\langle\nabla g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right),\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right\rangle.

By the convexity of the mapping gg and taking into account that g=g(x)=mingg^{*}=g(x^{*})=\min g, we obtain

g(x(t)+(γ+βt)x˙(t)),x(t)+(γ+βt)x˙(t)xgg(x(t)+(γ+βt)x˙(t))0,-\left\langle\nabla g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right),x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)-x^{*}\right\rangle\leq g^{*}-g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right)\leq 0,

hence by using (6) the inequality (34) becomes

h¨x(t)+αth˙x(t)gg(x(t)+(γ+βt)x˙(t))+(1αt(γ+βt))x˙(t)2(γ+βt)x˙(t),x¨(t).\displaystyle\ddot{h}_{x^{*}}(t)+\dfrac{\alpha}{t}\dot{h}_{x^{*}}(t)\leq g^{*}-g\left(x(t)+\left(\gamma+\dfrac{\beta}{t}\right)\dot{x}(t)\right)+\left(1-\dfrac{\alpha}{t}\left(\gamma+\dfrac{\beta}{t}\right)\right)\|\dot{x}(t)\|^{2}-\left(\gamma+\dfrac{\beta}{t}\right)\langle\dot{x}(t),\ddot{x}(t)\rangle.

Consequently, one has

th¨x(t)+αh˙x(t)(tα(γ+βt))x˙(t)2(γt+β)x˙(t),x¨(t).t\ddot{h}_{x^{*}}(t)+\alpha\dot{h}_{x^{*}}(t)\leq\left(t-\alpha\left(\gamma+\dfrac{\beta}{t}\right)\right)\|\dot{x}(t)\|^{2}-\left(\gamma t+\beta\right)\langle\dot{x}(t),\ddot{x}(t)\rangle. (35)

Now, according to (30) one has

tx˙(t)2L1([t0,+),).t\|\dot{x}(t)\|^{2}\in L^{1}([t_{0},+\infty),\mathbb{R}).

Consequently,

(tα(γ+βt))x˙(t)2L1([t0,+),).\left(t-\alpha\left(\gamma+\dfrac{\beta}{t}\right)\right)\|\dot{x}(t)\|^{2}\in L^{1}([t_{0},+\infty),\mathbb{R}). (36)

Further,

(γt+β)x˙(t),x¨(t)12x˙(t)2+12(γt+β)2x¨(t)2-\left(\gamma t+\beta\right)\langle\dot{x}(t),\ddot{x}(t)\rangle\leq\frac{1}{2}\|\dot{x}(t)\|^{2}+\frac{1}{2}\left(\gamma t+\beta\right)^{2}\|\ddot{x}(t)\|^{2}

and (31) assures that

t2x¨(t)2L1([t0,+),),t^{2}\|\ddot{x}(t)\|^{2}\in L^{1}([t_{0},+\infty),\mathbb{R}),

hence

(γt+β)x˙(t),x¨(t)L1([t0,+),).-\left(\gamma t+\beta\right)\langle\dot{x}(t),\ddot{x}(t)\rangle\in L^{1}([t_{0},+\infty),\mathbb{R}). (37)

According to (36) and (37) we get that the function

G(t)=(tα(γ+βt))x˙(t)2+12x˙(t)2+12(γt+β)2x¨(t)2L1([t0,+),).G(t)=\left(t-\alpha\left(\gamma+\dfrac{\beta}{t}\right)\right)\|\dot{x}(t)\|^{2}+\frac{1}{2}\|\dot{x}(t)\|^{2}+\frac{1}{2}\left(\gamma t+\beta\right)^{2}\|\ddot{x}(t)\|^{2}\in L^{1}([t_{0},+\infty),\mathbb{R}).

Moreover, it is obvious that there exists t1t0t_{1}\geq t_{0} such that GG is non negative for every tt1.t\geq t_{1}. From (35) one has

th¨x(t)+αh˙x(t)G(t), for every tt0,t\ddot{h}_{x^{*}}(t)+\alpha\dot{h}_{x^{*}}(t)\leq G(t),\mbox{ for every }t\geq t_{0}, (38)

hence Lemma 25 leads to the existence of the limit

limt+hx(t)\lim_{t\longrightarrow+\infty}h_{x^{*}}(t)

and consequently the limit

limt+x(t)x\lim_{t\longrightarrow+\infty}\|x(t)-x^{*}\|

also exists. \blacksquare

Now, we present the main result of this subsection regarding the convergence of the solution of the dynamical system (6) as t+t\to+\infty.

Theorem 27

Let xx be the unique strong global solution of the dynamical system (6) and assume that argming.\operatorname*{argmin}g\neq\emptyset. If α>3,γ>0\alpha>3,\,\gamma>0 and β\beta\in\mathbb{R}, then x(t)x(t) converges to a point in argming\operatorname*{argmin}g as t+t\longrightarrow+\infty.

Proof.

Taking into account that the conclusion (ii) in Lemma 23 and the conclusion of Lemma 26 are exactly the conditions in the hypothesis of Lemma 21 with S=argming,S=\operatorname*{argmin}g, the conclusion of the theorem is straightforward. \blacksquare

Remark 28

As it was expected, Theorem 27 remains true if in its hypothesis we assume only that α>3,γ=0\alpha>3,\,\gamma=0 and β0.\beta\geq 0. Indeed, note that under these assumptions the conclusion of Lemma 26 holds, since GG from its proof becomes

G(t)=(tαβt)x˙(t)2+12x˙(t)2+12β2x¨(t)2G(t)=\left(t-\dfrac{\alpha\beta}{t}\right)\|\dot{x}(t)\|^{2}+\frac{1}{2}\|\dot{x}(t)\|^{2}+\frac{1}{2}\beta^{2}\|\ddot{x}(t)\|^{2}

and according to Remark 20 G(t)L1([t0,+),).G(t)\in L^{1}([t_{0},+\infty),\mathbb{R}). Moreover, it is obvious that there exists t1t0t_{1}\geq t_{0} such that GG is non negative for every tt1.t\geq t_{1}.

This fact combined with Remark 24 lead to the desired conclusion.

4 Conclusions

In this paper we study a second order dynamical system which can be viewed as an extension of the heavy ball system with vanishing damping. This dynamical system is actually a perturbed version of the heavy ball system with vanishing damping, but the perturbation is made in the argument of the gradient of the objective function. Numerical experiments show that this perturbation brings a smoothing effect in the behaviour of the energy error g(x(t))mingg(x(t))-\min g and also in the behaviour of the absolute error x(t)x\|x(t)-x^{*}\|, where x(t)x(t) is a generated trajectory by our dynamical system and xx^{*} is a minimum of the objective g.g. Another novelty of our system that via explicit discretization leads to inertial algorithms. A related future research is the convergence analysis of algorithm (10), since this algorithm contains as particular case the famous Nesterov algorithm. However, since Algorithm (10) may allow different inertial steps, its area of applicability can be considerable.

We have shown existence and uniqueness of the trajectories generated by our dynamical system even for the case the objective function gg is non-convex. Further, we treated the cases when the energy error g(x(t))mingg(x(t))-\min g is of order 𝒪(1t2)\mathcal{O}\left(\frac{1}{t^{2}}\right) and we obtained the convergence of a generated trajectory to a minimum of the objective function g.g. Another related research is the convergence analysis of the generated trajectories in the case when the objective function gg is possible non-convex. This would be a novelty in the literature even for the case α>3,γ=β=0\alpha>3,\,\gamma=\beta=0, that is, for the case of the heavy ball system with vanishing damping.

We underline that the dynamical system (6) can easily be extended to proximal-gradient dynamical systems (see [18, 21] and the references therein). Indeed, let f:m¯f:\mathbb{R}^{m}\longrightarrow\overline{\mathbb{R}} be a proper convex and lower semicontinuous function and let g:mg:\mathbb{R}^{m}\longrightarrow\mathbb{R} be a possible non-convex smooth function with LgL_{g} Lipschitz continuous gradient.

We recall that the proximal point operator of the convex function λf\lambda f is defined as

proxλf:mm,proxλf(x)=argminym{f(y)+12λyx2}.\operatorname*{prox}\nolimits_{\lambda f}:{\mathbb{R}^{m}}\rightarrow{\mathbb{R}^{m}},\quad\operatorname*{prox}\nolimits_{\lambda f}(x)=\operatorname*{argmin}_{y\in{\mathbb{R}^{m}}}\left\{f(y)+\frac{1}{2\lambda}\|y-x\|^{2}\right\}.

Consider the optimization problem

infxmf(x)+g(x).\inf_{x\in\mathbb{R}^{m}}f(x)+g(x).

One can associate to this optimization problem the following second order proximal-gradient dynamical system.

{x¨(t)+(γ+α+βt)x˙(t)+x(t)=proxλf((x(t)+(γ+βt)x˙(t))λg(x(t)+(γ+βt)x˙(t))),x(t0)=u0,x˙(t0)=v0,\left\{\begin{array}[]{lll}\displaystyle\ddot{x}(t)+\left(\gamma+\frac{\alpha+\beta}{t}\right)\dot{x}(t)+x(t)=\operatorname*{prox}\nolimits_{\lambda f}\left(\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)-\lambda{\nabla}g\left(x(t)+\left(\gamma+\frac{\beta}{t}\right)\dot{x}(t)\right)\right),\\ \\ \displaystyle x(t_{0})=u_{0},\,\dot{x}(t_{0})=v_{0},\end{array}\right. (39)

where α>0,γ>0,λ>0,β\alpha>0,\,\gamma>0,\,\lambda>0,\,\beta\in\mathbb{R} and t0>0.t_{0}>0.

Obviously, when f0f\equiv 0 and λ=1\lambda=1 then (39) becomes the dynamical system (6) studied in the present paper.

The discretization of the the dynamical system (39) leads to proximal-gradient inertial algorithms, similar to the modified FISTA algorithm studied by Chambolle and Dossal in [24], (see also [15]).

Indeed, the explicit discretization of (6) with respect to the time variable tt, with step size hnh_{n} and initial points x0:=u0,x1:=v0x_{0}:=u_{0},\,x_{1}:=v_{0} yields the iterative scheme

xn+12xn+xn1hn2+(γ+α+βnhn)xnxn1hn+xn+1=\frac{x_{n+1}-2x_{n}+x_{n-1}}{h_{n}^{2}}+\left(\gamma+\frac{\alpha+\beta}{nh_{n}}\right)\frac{x_{n}-x_{n-1}}{h_{n}}+x_{n+1}=
proxλf(xn+(γ+βnhn)xnxn1hnλg(xn+(γ+βnhn)xnxn1hn))n1.\operatorname*{prox}\nolimits_{\lambda f}\left(x_{n}+\left(\gamma+\frac{\beta}{nh_{n}}\right)\frac{x_{n}-x_{n-1}}{h_{n}}-\lambda\nabla g\left(x_{n}+\left(\gamma+\frac{\beta}{nh_{n}}\right)\frac{x_{n}-x_{n-1}}{h_{n}}\right)\right)\ \forall n\geq 1.

The latter can be expressed as

(1+hn2)xn+1=xn+(1γhnα+βn)(xnxn1)+(1+h_{n}^{2})x_{n+1}=x_{n}+\left(1-\gamma h_{n}-\frac{\alpha+\beta}{n}\right)(x_{n}-x_{n-1})+
hn2proxλf(xn+(γ+βnhn)xnxn1hnλg(xn+(γ+βnhn)xnxn1hn)).h_{n}^{2}\operatorname*{prox}\nolimits_{\lambda f}\left(x_{n}+\left(\gamma+\frac{\beta}{nh_{n}}\right)\frac{x_{n}-x_{n-1}}{h_{n}}-\lambda\nabla g\left(x_{n}+\left(\gamma+\frac{\beta}{nh_{n}}\right)\frac{x_{n}-x_{n-1}}{h_{n}}\right)\right).

Consequently, the dynamical system (39) leads to the algorithm: For x0,x1mx_{0},\,x_{1}\in\mathbb{R}^{m} consider

{yn=xn+(1γhnα+βn)(xnxn1),zn=xn+(γhn+βnhn2)(xnxn1),xn+1=11+hn2yn+hn21+hn2proxλf(znλg(zn)),\left\{\begin{array}[]{llll}\displaystyle y_{n}=x_{n}+\left(1-\gamma h_{n}-\frac{\alpha+\beta}{n}\right)(x_{n}-x_{n-1}),\\ \\ \displaystyle z_{n}=x_{n}+\left(\frac{\gamma}{h_{n}}+\frac{\beta}{nh_{n}^{2}}\right)(x_{n}-x_{n-1}),\\ \\ \displaystyle x_{n+1}=\frac{1}{1+h_{n}^{2}}y_{n}+\frac{h_{n}^{2}}{1+h_{n}^{2}}\operatorname*{prox}\nolimits_{\lambda f}(z_{n}-\lambda{\nabla}g(z_{n})),\end{array}\right. (40)

where α0,γ0\alpha\geq 0,\,\gamma\geq 0 and β.\beta\in\mathbb{R}.

Now, the simplest case is obtained by taking in (40) constant step size hn1h_{n}\equiv 1. By denoting α+β\alpha+\beta still with α\alpha\in\mathbb{R}, the Algorithm (40) becomes: For x0,x1mx_{0},x_{1}\in\mathbb{R}^{m} and for every n1n\geq 1 consider

{yn=xn+(1γ)nαn(xnxn1),zn=xn+γn+βn(xnxn1),xn+1=12yn+12proxλf(znλg(zn)),\left\{\begin{array}[]{llll}\displaystyle y_{n}=x_{n}+\frac{(1-\gamma)n-\alpha}{n}(x_{n}-x_{n-1}),\\ \\ \displaystyle z_{n}=x_{n}+\frac{\gamma n+\beta}{n}(x_{n}-x_{n-1}),\\ \\ \displaystyle x_{n+1}=\frac{1}{2}y_{n}+\frac{1}{2}\operatorname*{prox}\nolimits_{\lambda f}(z_{n}-\lambda{\nabla}g(z_{n})),\end{array}\right. (41)

where αβ,γ0\alpha\geq\beta,\,\gamma\geq 0 and β.\beta\in\mathbb{R}. The convergence of the sequences generated by Algorithm (41) to a critical point of the objective function f+gf+g would open the gate for the study of FISTA type algorithms with nonidentical inertial terms.

References

  • [1] J.F. Aujol, Ch. Dossal, Optimal rate of convergence of an ODE associated to the Fast Gradient Descent schemes for b>0b>0, HAL preprint, https://hal.inria.fr/hal-01547251v2/document
  • [2] J.F. Aujol, Ch. Dossal, A. Rondepierre, Optimal convergence rates for Nesterov acceleration, arxiv.org/abs/1805.05719
  • [3] F. Alvarez, H. Attouch, An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping, Set-Valued Analysis, 9, 3-11, 2001
  • [4] V. Apidopoulos, J.F. Aujol, Ch. Dossal, Convergence rate of inertial Forward–Backward algorithm beyond Nesterov’s rule, Math. Program. (2018). https://doi.org/10.1007/s10107-018-1350-9
  • [5] B. Abbas, H. Attouch, B.F. Svaiter, Newton-like dynamics and forward-backward methods for structured monotone inclusions in Hilbert spaces, Journal of Optimization Theory and its Applications 161(2), 331-360, 2014
  • [6] F. Alvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, Journal de Mathématiques Pures et Appliquées (9) 81(8), 747-779, 2002
  • [7] H. Attouch, Z. Chbani,Fast inertial dynamics and fista algorithms in convex optimization. Perturbation aspects, arXiv preprint arXiv:1507.01367, 2015
  • [8] H. Attouch, Z. Chbani, H. Riahi, Rate of convergence of the Nesterov accelerated gradient method in the subcritical case α3\alpha\leq 3, ESAIM: COCV (2017). https://doi.org/10.1051/cocv/2017083
  • [9] H. Attouch, Z. Chbani, H. Riahi, Fast convex optimization via time scaling of damped inertial gradient dynamics, https://hal.archives-ouvertes.fr/hal-02138954
  • [10] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Math. Program. 168(1-2) Ser. B, 123-175, 2018
  • [11] H. Attouch, X. Goudou, P. Redont, The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system, Communications in Contemporary Mathematics 2, 1-34, 2000
  • [12] H. Attouch, J. Peypouquet, P. Redont, Fast convex optimization via inertial dynamics with Hessian driven damping, J. Differential Equations, 261(10), 5734-5783, 2016
  • [13] H. Attouch, B.F. Svaiter, A continuous dynamical Newton-like approach to solving monotone inclusions, SIAM Journal on Control and Optimization 49(2), 574-598, 2011
  • [14] M. Balti, R. May,Asymptotic for the perturbed heavy ball system with vanishing damping term, arXiv preprint arXiv:1609.00135, 2016.
  • [15] A. Beck, M. Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM Journal on Imaging Sciences, 2(1), 183-202, 2009
  • [16] P. Bégout, J. Bolte, M.A. Jendoubi, On damped second-order gradient systems, Journal of Differential Equations, (259), 3115-3143, 2015
  • [17] R.I. Boţ¸, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for minimizing the sum of two non-convex functions, Euro Journal on Computational Optimization, 4(1), 3-25, 2016
  • [18] R.I. Boţ, E.R. Csetnek, S.C. László, Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems, Journal of Evolution Equations, 18(3), 1291-1318, 2018
  • [19] C. Alecsa, S.C. László, A. Viorel, A gradient type algorithm with backward inertial steps associated to a nonconvex minimization problem, https://www.researchgate.net/publication/331832475
  • [20] R.I. Boţ, E.R. Csetnek, S.C. László, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applicable Analysis (2018). https://doi.org/10.1080/00036811.2018.1495330
  • [21] R.I. Boţ, E.R. Csetnek, S.C. László, A primal-dual dynamical approach to structured convex minimization problems, https://arxiv.org/abs/1905.08290
  • [22] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second order differential equations with asymptotically small dissi- pation, Trans. Amer. Math. Soc. 361, pp. 5983-6017, 2009
  • [23] A. Cabot, H. Engler, S. Gadat,Second order differential equations with asymptotically small dissipation and piecewise at potentials, Electronic Journal of Differential Equations 17, 33-38, 2009
  • [24] A. Chambolle, Ch. Dossal, On the convergence of the iterates of the ”fast iterative shrinkage/thresholding algorithm”, J. Optim. Theory Appl., 166(3), 968-982, 2015
  • [25] A. Haraux, Systèmes Dynamiques Dissipatifs et Applications, Recherches en Mathématiques Appliquéées 17, Masson, Paris, 1991
  • [26] S.C. László, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, https://arxiv.org/pdf/1811.09616.pdf
  • [27] M. Muehlebach, M. I. Jordan, A Dynamical Systems Perspective on Nesterov Acceleration, arXiv:1905.07436v1
  • [28] Y. Nesterov , Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Dordrecht, 2004
  • [29] Y.E. Nesterov, A method for solving the convex programming problem with convergence rate O(1/k2)O(1/k^{2}), (Russian) Dokl. Akad. Nauk SSSR 269(3), 543-547, 1983
  • [30] Z. Opial,Weak convergence of the sequence of successive approximations for nonexpansive mappings, Bull. Amer. Math. Soc., 73, 591-597, 1967
  • [31] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, U.S.S.R. Comput. Math. Math. Phys., 4(5), 1-17, 1964
  • [32] E.D. Sontag, Mathematical control theory. Deterministic finite-dimensional systems, Second edition, Texts in Applied Mathematics 6, Springer-Verlag, New York, 1998
  • [33] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, Journal of Machine Learning Research, 17, 1-43, 2016
  • [34] R.E. Bruck, Asymptotic convergence of nonlinear contraction semigroups in Hilbert spaces, J. Funct. Anal., 18 , 15-26, 1975
2021

Related Posts