A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Abstract

We investigate an algorithm of gradient type with a backward inertial step in connection with the minimization of a nonconvex differentiable function. We show that the generated sequences converge to a critical point of the objective function, if a regularization of the objective function satisfies the Kurdyka-Łojasiewicz property. Further, we provide convergence rates for the generated sequences and the objective function values formulated in terms of the Łojasiewicz exponent. Finally, some numerical experiments are presented in order to compare our numerical scheme with some algorithms well known in the literature.

Authors

Cristian Daniel Alecsa
Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy
Department of Mathematics, Babes-Bolyai University, Cluj-Napoca, Romania

Szilárd Csaba László
Technical University of Cluj-Napoca, Romania

Adrian Viorel
Technical University of Cluj-Napoca, Romania

Keywords

Optimization; Unconstrained optimization problems; Inertial algorithm; Nonconvex optimization; Kurdyka-Łojasiewicz inequality; Convergence rate; Discrete Lyapunov energy functions.

Paper coordinates

Cristian Daniel Alecsa, Szilárd Csaba László, Adrian Viorel, A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem, Numer. Algor., 84 (2020), pp. 485–512. doi: 10.1007/s11075-019-00765-z

About this paper

Print ISSN
Online ISSN

[1] H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions in-volving analytic features, Mathematical Programming 116(1-2) Series B, 5-16, 2009

[2] H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projec-tion methods for nonconvex problems: an approach based on the Kurdyka- Lojasiewicz inequality,Mathematics of Operations Research 35(2), 438-457, 2010

[3] H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tameproblems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Mathematical Programming 137(1-2) Series A, 91-129, 2013

[4] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics andalgorithms with asymptotic vanishing viscosity, Math. Program. 168(1-2) Ser. B, 123-175, 2018

[5] P. B´egout, J. Bolte, M.A. Jendoubi, On damped second-order gradient systems, Journal of Differ-ential Equations (259), 3115-3143, 2015

[6] J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex andnonsmooth problems, Mathematical Programming Series A (146)(1-2), 459-494, 2014

[7] J. Bolte, A. Daniilidis, A. Lewis, The Lojasiewicz inequality for nonsmooth subanalytic functionswith applications to subgradient dynamical systems, SIAM Journal on Optimization 17(4), 1205-1223, 2006

[8] J. Bolte, A. Daniilidis, A. Lewis, M. Shiota, Clarke subgradients of stratifiable functions, SIAMJournal on Optimization 18(2), 556-572, 2007

[9] J. Bolte, A. Daniilidis, O. Ley, L. Mazet, Characterizations of Lojasiewicz inequalities: subgradientflows, talweg, convexity, Transactions of the American Mathematical Society 362(6), 3319-3363,2010

[10] R.I. Bot¸, E.R. Csetnek, S.C. Laszlo, Approaching nonsmooth nonconvex minimization throughsecond-order proximal-gradient dynamical systems, Journal of Evolution Equations 18(3), 1291-1318, 2018

[11] R.I. Bot¸, E.R. Csetnek, S.C. Laszlo, An inertial forward-backward algorithm for minimizing thesum of two non-convex functions, Euro Journal on Computational Optimization 4(1), 3-25, 2016

[12] R.I. Bot¸, E.R. Csetnek, S.C. Laszlo, A second order dynamical approach withvariable damping to nonconvex smooth minimization, Applicable Analysis (2018),https://doi.org/10.1080/00036811.2018.1495330

[13] P. Frankel, G. Garrigos, J. Peypouquet, Splitting Methods with Variable Metric forKurdyka Lojasiewicz Functions and General Convergence Rates, Journal of Optimization Theoryand Applications, 165(3), 874900, 2015

[14] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institutFourier (Grenoble) 48(3), 769-783, 1998

[15] S.C. Laszlo, Convergence rates for an inertial algorithm of gradient type associated to a smoothnonconvex minimization, https://arxiv.org/abs/1807.0038721

[16] G. Li, T. K. Pong, Calculus of the Exponent of Kurdyka- Lojasiewicz Inequality and Its Applicationsto Linear Convergence of First-Order Methods, Foundations of Computational Mathematics, 1-34, 2018

[17] S. Lojasiewicz, Une propriete topologique des sous-ensembles analytiques reels, Les Equations auxDerivees Partielles, Editions du Centre National de la Recherche Scientifique Paris, 87-89, 1963

[18] Y.E. Nesterov, A method for solving the convex programming problem with convergence rateO(1/k2), (Russian) Dokl. Akad. Nauk SSSR 269(3), 543-547, 1983

[19] Y. Nesterov, Introductory lectures on convex optimization: a basic course. Kluwer Academic Pub-lishers, Dordrecht, 2004

[20] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, U.S.S.R. Comput.Math. Math. Phys., 4(5):1-17, 1964

[21] R.T. Rockafellar, R.J.-B. Wets, Variational Analysis, Fundamental Principles of Mathematical Sciences 317, Springer-Verlag, Berlin, 1998

[22] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradientmethod: theory and insights, Journal of Machine Learning Research, 17, 1-43, 2016

Paper (preprint) in HTML form

A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Cristian Daniel Alecsa 1,2. Szilárd Csaba László 3 ® \cdot Adrian Viorel 3
Abstract

We investigate an algorithm of gradient type with a backward inertial step in connection with the minimization of a nonconvex differentiable function. We show that the generated sequences converge to a critical point of the objective function, if a regularization of the objective function satisfies the Kurdyka-Łojasiewicz property. Further, we provide convergence rates for the generated sequences and the objective function values formulated in terms of the Łojasiewicz exponent. Finally, some numerical experiments are presented in order to compare our numerical scheme with some algorithms well known in the literature.

Keywords Inertial algorithm \cdot Nonconvex optimization \cdot Kurdyka-Łojasiewicz inequality \cdot Convergence rate

Mathematics Subject Classification (2010) 90C26•90C30 65K10\cdot 65\mathrm{~K}10

1 Introduction and Preliminaries

Gradient-type algorithms have a long history, going back at least to Cauchy (1847), and also a wealth of applications. Solving linear systems, Cauchy’s original motivation, is maybe the most obvious application, but many of today’s hot topics in machine learning or image processing also deal with optimization problems from an algorithmic perspective and rely on gradient-type algorithms.

The original gradient descent algorithm

xn+1=xnsg(xn)x_{n+1}=x_{n}-s\nabla g\left(x_{n}\right)

which is precisely an explicit Euler method applied to the gradient flow

x˙(t)=g(x(t)),\dot{x}(t)=-\nabla g(x(t)),

does not achieve very good convergence rates and much research has been dedicated to accelerating convergence.

Based on the analogy to mechanical systems, e.g., to the movement, with friction, of a heavy ball in a potential well defined by the smooth convex objective function gg, with LgL_{g}-Lipschitz continuous gradient, Polyak [26] was able to provide the seminal idea for achieving acceleration namely the addition of inertial (momentum) terms to the gradient algorithm. His two-step iterative method, the so-called heavy ball method, takes the following form: For the initial values x0=x1mx_{0}=x_{-1}\in\mathbb{R}^{m} and nn\in\mathbb{N} let:

{yn=xn+αn(xnxn1)xn+1=ynβng(xn)\left\{\begin{array}[]{l}y_{n}=x_{n}+\alpha_{n}\left(x_{n}-x_{n-1}\right)\\ x_{n+1}=y_{n}-\beta_{n}\nabla g\left(x_{n}\right)\end{array}\right.

where αn[0,1)\alpha_{n}\in[0,1) and βn>0\beta_{n}>0 is a step-size parameter. Recently, in [29], the convergence rate of order 𝒪(1n)\mathcal{O}\left(\frac{1}{n}\right) has been obtained for the heavy ball algorithm, provided gg is coercive, the inertial parameter (αn)n\left(\alpha_{n}\right)_{n\in\mathbb{N}} is a nonincreasing sequence, and the stepsize parameter satisfies βn=2(1αn)cLg\beta_{n}=\frac{2\left(1-\alpha_{n}\right)c}{L_{g}}, for some fixed c(0,1)c\in(0,1) (see also [18] for an ergodic rate). More precisely, in [29], it is shown that under the previous assumption one has:

g(xn)ming=𝒪(1n)g\left(x_{n}\right)-\min g=\mathcal{O}\left(\frac{1}{n}\right)

It is worthwhile mentioning that the forward-backward algorithm studied in [12] in a full nonconvex setting reduces to Polyak’s heavy ball method if the nonsmooth term vanishes; hence, it can be viewed as an extension of the heavy ball method to the case when the objective function gg is possible nonconvex, but still has Lipschitz continuous gradient with Lipschitz contant LgL_{g}. Indeed, Algorithm 1 from [12], in case f0f\equiv 0 and F=122F=\frac{1}{2}\|\cdot\|^{2} has the form: For the initial values x0=x1mx_{0}=x_{-1}\in\mathbb{R}^{m} and nn\in\mathbb{N}, let:

xn+1=xn+αn(xnxn1)βng(xn),x_{n+1}=x_{n}+\alpha_{n}\left(x_{n}-x_{n-1}\right)-\beta_{n}\nabla g\left(x_{n}\right), (2)

where 0<β¯βnβ¯<+0<\underline{\beta}\leq\beta_{n}\leq\bar{\beta}<+\infty and αn[0,α],α>0\alpha_{n}\in[0,\alpha],\alpha>0 for all n1n\geq 1. In this particular case, convergence of the generated sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} to a critical point of the objective function gg can be shown under the assumption that a regularization of
the objective function satisfies the Kurdyka-Łojasiewicz property, further β¯,β¯\underline{\beta},\bar{\beta} and α>0\alpha>0 satisfy:

1>β¯Lg+2αβ¯β¯.1>\bar{\beta}L_{g}+2\alpha\frac{\bar{\beta}}{\underline{\beta}}. (3)

Note that (3) implies α<12\alpha<\frac{1}{2}; hence, αn[0,12)\alpha_{n}\in\left[0,\frac{1}{2}\right) for all n1n\geq 1. If β¯\underline{\beta} and α\alpha are positive numbers such that 1>β¯Lg+2α1>\underline{\beta}L_{g}+2\alpha, then by choosing β¯[β¯,β¯β¯Lg+2α)\bar{\beta}\in\left[\underline{\beta},\frac{\underline{\beta}}{\underline{\beta}L_{g}+2\alpha}\right), relation (3) is satisfied.

Probably the most acclaimed inertial algorithm is Nesterov’s accelerated gradient method, which in its particular form: For the initial values x0=x1mx_{0}=x_{-1}\in\mathbb{R}^{m} and nn\in\mathbb{N} let:

{yn=xn+nn+3(xnxn1),xn+1=ynsg(yn),\left\{\begin{array}[]{l}y_{n}=x_{n}+\frac{n}{n+3}\left(x_{n}-x_{n-1}\right),\\ x_{n+1}=y_{n}-s\nabla g\left(y_{n}\right),\end{array}\right.

for a convex gg with Lipschitz continuous gradient LgL_{g} and step size s1Lgs\leq\frac{1}{L_{g}}, exhibits an improved convergence rate of 𝒪(1/n2)\mathcal{O}\left(1/n^{2}\right) (see [15, 23]), and which, as highlighted by Su, Boyd, and Candès [28], (see also [5]), can be seen as the discrete counterpart of the second-order differential equation:

x¨(t)+3tx˙(t)g(x(t))=0.\ddot{x}(t)+\frac{3}{t}\dot{x}(t)-\nabla g(x(t))=0.

In this respect, it may be useful to recall that until recently little was known about the efficiency of Nesterov’s accelerated gradient method outside a convex setting. However, in [20], a Nesterov-like method, differing from the original only by a multiplicative coefficient, has been studied and convergence rates have been provided for the very general case when a regularization of the objective function gg has the KL property. More precisely, in [20], the following algorithm was considered.

For the initial values x0=x1mx_{0}=x_{-1}\in\mathbb{R}^{m} and nn\in\mathbb{N} let:

{yn=xn+βnn+α(xnxn1),xn+1=ynsg(yn),\left\{\begin{array}[]{l}y_{n}=x_{n}+\frac{\beta n}{n+\alpha}\left(x_{n}-x_{n-1}\right),\\ x_{n+1}=y_{n}-s\nabla g\left(y_{n}\right),\end{array}\right.

where α>0,β(0,1)\alpha>0,\beta\in(0,1) and 0<s<2(1β)Lg0<s<\frac{2(1-\beta)}{L_{g}}. Unfortunately, for technical reasons, one can not allow β=1\beta=1 therefore one does not have full equivalence between Algorithm (5) and Algorithm (4). However, what is lost at the inertial parameter is gained at the step size, since for 0<β120<\beta\leq\frac{1}{2} one may have s[1Lg,2Lg)s\in\left[\frac{1}{L_{g}},\frac{2}{L_{g}}\right). Convergence of the sequences generated by Algorithm (5) was obtained under the assumption that the regularization of the objective function gg, namely H(x,y)=g(x)+12yx2H(x,y)=g(x)+\frac{1}{2}\|y-x\|^{2}, is a KL function.

In this paper, we deal with the optimization problem:

infxmg(x),\inf_{x\in\mathbb{R}^{m}}g(x), (P)

where g:mg:\mathbb{R}^{m}\longrightarrow\mathbb{R} is a Fréchet differentiable function with LgL_{g}-Lipschitz continuous gradient, i.e., there exists Lg0L_{g}\geq 0 such that g(x)g(y)Lgxy\|\nabla g(x)-\nabla g(y)\|\leq L_{g}\|x-y\| for all x,ymx,y\in\mathbb{R}^{m}, and we associate to (P) the following inertial algorithm of gradient type.

Consider the starting points x0=x1mx_{0}=x_{-1}\in\mathbb{R}^{m}, and for every nn\in\mathbb{N} let:

{yn=xn+αn(xnxn1),xn+1=ynβng(yn),\left\{\begin{array}[]{l}y_{n}=x_{n}+\alpha_{n}\left(x_{n}-x_{n-1}\right),\\ x_{n+1}=y_{n}-\beta_{n}\nabla g\left(y_{n}\right),\end{array}\right.

where we assume that

limn+αn=α(10+688,0),limn+βn=β and 0<β<4α2+10α+2Lg(2α+1)2.\lim_{n\rightarrow+\infty}\alpha_{n}=\alpha\in\left(\frac{-10+\sqrt{68}}{8},0\right),\lim_{n\rightarrow+\infty}\beta_{n}=\beta\text{ and }0<\beta<\frac{4\alpha^{2}+10\alpha+2}{L_{g}(2\alpha+1)^{2}}.

Remark 1 Observe that the inertial parameter αn\alpha_{n} becomes negative after a number of iterations and this can be viewed as taking a backward inertial step in our algorithm. Of course, this also shows that after a number of iteration yny_{n} is a convex combination of xn1x_{n-1} and xnx_{n}, (see [16] for similar constructions), that is:

yn=(1(αn))xn+(αn)xn1,αn(0,1).y_{n}=\left(1-\left(-\alpha_{n}\right)\right)x_{n}+\left(-\alpha_{n}\right)x_{n-1},-\alpha_{n}\in(0,1).

Another novelty of Algorithm (6) is that it allows variable step size. Moreover, it can easily be verified that whenever α>16\alpha>-\frac{1}{6} one may take β>1Lg\beta>\frac{1}{L_{g}}.

We emphasize that the analysis of the proposed algorithm (6) is intimately related to the properties of the following regularization of the objective function gg (see also [1113,20,21][11-13,20,21] ), that is:

H:m×m,H(x,y)=g(x)+12yx2H:\mathbb{R}^{m}\times\mathbb{R}^{m}\longrightarrow\mathbb{R},\quad H(x,y)=g(x)+\frac{1}{2}\|y-x\|^{2} (7)

In the remainder of this section, we introduce the necessary apparatus of notions and results that we will need in our forthcoming analysis.

For a differentiable function f:mf:\mathbb{R}^{m}\longrightarrow\mathbb{R}, we denote by crit(f)={xm\operatorname{crit}(f)=\left\{x\in\mathbb{R}^{m}\right. : f(x)=0}\nabla f(x)=0\} the set of critical points of ff. The following so-called descent lemma (see [24]) will play an essential role in our forthcoming analysis.

Lemma 2 Let f:mf:\mathbb{R}^{m}\longrightarrow\mathbb{R} be Frèchet differentiable with LL Lipschitz continuous gradient. Then:

f(y)f(x)+f(x),yx+L2yx2,x,ymf(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|y-x\|^{2},\forall x,y\in\mathbb{R}^{m}

Furthermore, the set of cluster points of a given sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} will be denoted by ω((xn)n)\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right). At the same time, the distance function to a set is defined for AmA\subseteq\mathbb{R}^{m} as

dist(x,A)=infyAxy for all xm.\operatorname{dist}(x,A)=\inf_{y\in A}\|x-y\|\text{ for all }x\in\mathbb{R}^{m}.

Our convergence result relies on the concept of a KL function. For η(0,+]\eta\in(0,+\infty], we denote by Θη\Theta_{\eta} the class of concave and continuous functions φ:[0,η)\varphi:[0,\eta)\rightarrow
[0,+)[0,+\infty) such that φ(0)=0,φ\varphi(0)=0,\varphi is continuously differentiable on ( 0,η0,\eta ), continuous at 0 and φ(s)>0\varphi^{\prime}(s)>0 for all s(0,η)s\in(0,\eta).

Definition 1 (Kurdyka-Lojasiewicz property) Let f:mf:\mathbb{R}^{m}\rightarrow\mathbb{R} be a differentiable function. We say that ff satisfies the Kurdyka-Lojasiewicz (KL) property at x¯m\bar{x}\in\mathbb{R}^{m} if there exists η(0,+]\eta\in(0,+\infty], a neighborhood UU of x¯\bar{x} and a function φΘη\varphi\in\Theta_{\eta} such that for all xx in the intersection:

U{xm:f(x¯)<f(x)<f(x¯)+η}U\cap\left\{x\in\mathbb{R}^{m}:f(\bar{x})<f(x)<f(\bar{x})+\eta\right\}

the following inequality holds:

φ(f(x)f(x¯))f(x))1\left.\varphi^{\prime}(f(x)-f(\bar{x}))\|\nabla f(x)\right)\|\geq 1

If ff satisfies the KL property at each point in m\mathbb{R}^{m}, then ff is called a KLKL function.
The function φ\varphi is called a desingularizing function (see for instance [6]). The origins of this notion go back to the pioneering work of Łojasiewicz [22], where it is proved that for a real-analytic function f:mf:\mathbb{R}^{m}\rightarrow\mathbb{R} and a critical point x¯m\bar{x}\in\mathbb{R}^{m} (that is f(x¯)=0\nabla f(\bar{x})=0 ), there exists θ[1/2,1)\theta\in[1/2,1) such that the function |ff(x¯)|θf1|f-f(\bar{x})|^{\theta}\|\nabla f\|^{-1} is bounded around x¯\bar{x}. This corresponds to the situation when φ(s)=C(1θ)1s1θ\varphi(s)=C(1-\theta)^{-1}s^{1-\theta}, where C>0C>0 is a given constant, and leads to the following definition.

Definition 2 (for which we refer to [2,8,22]) A differentiable function f:mf:\mathbb{R}^{m}\longrightarrow\mathbb{R} has the Łojasiewicz property with exponent θ[0,1)\theta\in[0,1) at x¯crit(f)\bar{x}\in\operatorname{crit}(f) if there exists K,ε>0K,\varepsilon>0 such that:

|f(x)f(x¯)|θKf(x),|f(x)-f(\bar{x})|^{\theta}\leq K\|\nabla f(x)\|, (8)

for every xmx\in\mathbb{R}^{m}, with xx¯<ε\|x-\bar{x}\|<\varepsilon.
In the above definition, for θ=0\theta=0, we adopt the convention 00=00^{0}=0, such that if |f(x)f(x¯)|0=0|f(x)-f(\bar{x})|^{0}=0, then f(x)=f(x¯)f(x)=f(\bar{x}) (see [2]).

The result of Łojasiewicz allows the interpretation of the KL property as a reparametrization of the function values in order to avoid flatness around the critical points. Kurdyka [19] extended this property to differentiable functions definable in an o-minimal structure. Further extensions to the nonsmooth setting can be found in [3, 8-10].

To the class of KL functions belong semi-algebraic, real sub-analytic, semiconvex, uniformly convex, and convex functions satisfying a growth condition. We refer the reader to [24,710][2-4,7-10] and the references therein for more details regarding all the classes mentioned above and illustrating examples.

Finally, an important role in our convergence analysis will be played by the following uniformized KL property given in [7, Lemma 6].

Lemma 3 Let Ωm\Omega\subseteq\mathbb{R}^{m} be a compact set and let f:mf:\mathbb{R}^{m}\rightarrow\mathbb{R} be a differentiable function. Assume that ff is constant on Ω\Omega and ff satisfies the KL property at each
point of Ω\Omega. Then, there exist ε,η>0\varepsilon,\eta>0 and φΘη\varphi\in\Theta_{\eta} such that for all x¯Ω\bar{x}\in\Omega and for all xx in the intersection:

{xm:dist(x,Ω)<ε}{xm:f(x¯)<f(x)<f(x¯)+η}\left\{x\in\mathbb{R}^{m}:\operatorname{dist}(x,\Omega)<\varepsilon\right\}\cap\left\{x\in\mathbb{R}^{m}:f(\bar{x})<f(x)<f(\bar{x})+\eta\right\} (9)

the following inequality holds:

φ(f(x)f(x¯))f(x)1.\varphi^{\prime}(f(x)-f(\bar{x}))\|\nabla f(x)\|\geq 1. (10)

The outline of the paper is the following. In Section 2, we give a sufficient condition that ensures the decrease property of the regularization HH in the iterates, and which also ensures that the iterates gap belongs to l2l^{2}. Further, using these results, we show that the set of cluster points of the iterates is included in the set of critical points of the objective function. Finally, by means of the the KL property of HH, we obtain that the iterates gap belongs to l1l^{1}. This implies the convergence of the iterates (see also [4, 7, 12, 20]). In Section 3, we obtain several convergence rates both for the sequences (xn)n,(yn)n\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} generated by the numerical scheme (6), as well as for the function values g(xn),g(yn)g\left(x_{n}\right),g\left(y_{n}\right) in the terms of the Łojasiewicz exponent of gg and HH, respectively (see [14, 17, 20] and also [1] for convergence rates under geometrical conditions). Finally, in Section 4, we present some numerical experiments that show that our algorithm, in many cases, has better properties than the algorithms used in the literature.

2 Convergence results

We start to investigate the convergence of the proposed algorithm by showing that HH is decreasing along certain sequences built upon the iterates generated by (6).

Theorem 4 Let (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} and (yn)n\left(y_{n}\right)_{n\in\mathbb{N}} be the sequences generated by the numerical scheme (6) and for all n,n1n\in\mathbb{N},n\geq 1 consider the sequences:

An=(2βnLg)(1+αn+1)22αn+1(1+αn+1)2βnCn=(2βnLg)αn(1+αn+1)αnαn+12βn\begin{gathered}A_{n}=\frac{\left(2-\beta_{n}L_{g}\right)\left(1+\alpha_{n+1}\right)^{2}-2\alpha_{n+1}\left(1+\alpha_{n+1}\right)}{2\beta_{n}}\\ C_{n}=\frac{\left(2-\beta_{n}L_{g}\right)\alpha_{n}\left(1+\alpha_{n+1}\right)-\alpha_{n}\alpha_{n+1}}{2\beta_{n}}\end{gathered}

and

δn=An1+Cn1.\delta_{n}=A_{n-1}+C_{n-1}.

Then, there exists NN\in\mathbb{N} such that:
(i) The sequence (g(yn)+δnxnxn12)nN\left(g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\right)_{n\geq N} is decreasing and δn>0\delta_{n}>0 for all nNn\geq N.

Assume further that gg is bounded from below. Then,
(ii) The sequence (g(yn)+δnxnxn12)nN\left(g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\right)_{n\geq N} is convergent;
(iii) n1xnxn12<+\quad\sum_{n\geq 1}\left\|x_{n}-x_{n-1}\right\|^{2}<+\infty.

Proof By applying the descent Lemma 2 to gg, we have:

g(yn+1)g(yn)+g(yn),yn+1yn+Lg2yn+1yn2.g\left(y_{n+1}\right)\leq g\left(y_{n}\right)+\left\langle\nabla g\left(y_{n}\right),y_{n+1}-y_{n}\right\rangle+\frac{L_{g}}{2}\left\|y_{n+1}-y_{n}\right\|^{2}.

However, after rewriting the first equation in (6) as g(yn)=1βn(ynxn+1)\nabla g\left(y_{n}\right)=\frac{1}{\beta_{n}}\left(y_{n}-x_{n+1}\right) and taking the inner product with yn+1yny_{n+1}-y_{n} to obtain:

g(yn),yn+1yn=1βnynxn+1,yn+1yn,\left\langle\nabla g\left(y_{n}\right),y_{n+1}-y_{n}\right\rangle=\frac{1}{\beta_{n}}\left\langle y_{n}-x_{n+1},y_{n+1}-y_{n}\right\rangle,

the descent inequality becomes:

g(yn+1)Lg2yn+1yn2g(yn)+1βnynxn+1,yn+1yn\displaystyle g\left(y_{n+1}\right)-\frac{L_{g}}{2}\left\|y_{n+1}-y_{n}\right\|^{2}\leq g\left(y_{n}\right)+\frac{1}{\beta_{n}}\left\langle y_{n}-x_{n+1},y_{n+1}-y_{n}\right\rangle (11)
=g(yn)+1βn(yn+1yn2+yn+1xn+1,yn+1yn)\displaystyle\quad=g\left(y_{n}\right)+\frac{1}{\beta_{n}}\left(-\left\|y_{n+1}-y_{n}\right\|^{2}+\left\langle y_{n+1}-x_{n+1},y_{n+1}-y_{n}\right\rangle\right)

Further,

yn+1yn=(1+αn+1)(xn+1xn)αn(xnxn1)y_{n+1}-y_{n}=\left(1+\alpha_{n+1}\right)\left(x_{n+1}-x_{n}\right)-\alpha_{n}\left(x_{n}-x_{n-1}\right)

and

yn+1xn+1=αn+1(xn+1xn),y_{n+1}-x_{n+1}=\alpha_{n+1}\left(x_{n+1}-x_{n}\right),

hence:

g(yn+1)+(1βnLg2)yn+1yn2g(yn)+αn+1βnxn+1xn,yn+1yn.g\left(y_{n+1}\right)+\left(\frac{1}{\beta_{n}}-\frac{L_{g}}{2}\right)\left\|y_{n+1}-y_{n}\right\|^{2}\leq g\left(y_{n}\right)+\frac{\alpha_{n+1}}{\beta_{n}}\left\langle x_{n+1}-x_{n},y_{n+1}-y_{n}\right\rangle. (12)

Thus, we have:

yn+1yn2=(1+αn+1)(xn+1xn)αn(xnxn1)2=(1+αn+1)2xn+1xn2+αn2xnxn122αn(1+αn+1)xn+1xn,xnxn1,\begin{gathered}\left\|y_{n+1}-y_{n}\right\|^{2}=\left\|\left(1+\alpha_{n+1}\right)\left(x_{n+1}-x_{n}\right)-\alpha_{n}\left(x_{n}-x_{n-1}\right)\right\|^{2}\\ =\left(1+\alpha_{n+1}\right)^{2}\left\|x_{n+1}-x_{n}\right\|^{2}+\alpha_{n}^{2}\left\|x_{n}-x_{n-1}\right\|^{2}-2\alpha_{n}\left(1+\alpha_{n+1}\right)\left\langle x_{n+1}-x_{n},x_{n}-x_{n-1}\right\rangle,\end{gathered}

and

xn+1xn,yn+1yn=xn+1xn,(1+αn+1)(xn+1xn)αn(xnxn1)=(1+αn+1)xn+1xn2αnxn+1xn,xnxn1\begin{gathered}\left\langle x_{n+1}-x_{n},y_{n+1}-y_{n}\right\rangle=\left\langle x_{n+1}-x_{n},\left(1+\alpha_{n+1}\right)\left(x_{n+1}-x_{n}\right)-\alpha_{n}\left(x_{n}-x_{n-1}\right)\right\rangle\\ =\left(1+\alpha_{n+1}\right)\left\|x_{n+1}-x_{n}\right\|^{2}-\alpha_{n}\left\langle x_{n+1}-x_{n},x_{n}-x_{n-1}\right\rangle\end{gathered}

Replacing the above equalities in (12) gives:

g(yn+1)+(2βnLg)(1+αn+1)22αn+1(1+αn+1)2βnxn+1xn2g(yn)(2βnLg)αn22βnxnxn12+(2βnLg)αn(1+αn+1)αnαn+1βnxn+1xn,xnxn1.\begin{gathered}g\left(y_{n+1}\right)+\frac{\left(2-\beta_{n}L_{g}\right)\left(1+\alpha_{n+1}\right)^{2}-2\alpha_{n+1}\left(1+\alpha_{n+1}\right)}{2\beta_{n}}\left\|x_{n+1}-x_{n}\right\|^{2}\leq\\ g\left(y_{n}\right)-\frac{\left(2-\beta_{n}L_{g}\right)\alpha_{n}^{2}}{2\beta_{n}}\left\|x_{n}-x_{n-1}\right\|^{2}+\\ \frac{\left(2-\beta_{n}L_{g}\right)\alpha_{n}\left(1+\alpha_{n+1}\right)-\alpha_{n}\alpha_{n+1}}{\beta_{n}}\left\langle x_{n+1}-x_{n},x_{n}-x_{n-1}\right\rangle.\end{gathered}

The above inequality motivates the introduction of the following notations:

Bn=(2βnLg)αn22βnB_{n}=\frac{\left(2-\beta_{n}L_{g}\right)\alpha_{n}^{2}}{2\beta_{n}}

and

Δn=An1+Bn+Cn1+Cn\Delta_{n}=A_{n-1}+B_{n}+C_{n-1}+C_{n} (13)

for all n,n1n\in\mathbb{N},n\geq 1.

In terms of these notations, after using the equality:

2xn+1xn,xnxn1=xn+1xn12xn+1xn2xnxn12,2\left\langle x_{n+1}-x_{n},x_{n}-x_{n-1}\right\rangle=\left\|x_{n+1}-x_{n-1}\right\|^{2}-\left\|x_{n+1}-x_{n}\right\|^{2}-\left\|x_{n}-x_{n-1}\right\|^{2},

we can write:

Cnxn+1xn12+g(yn+1)+(An+Cn)xn+1xn2g(yn)+(CnBn)xnxn12.-C_{n}\left\|x_{n+1}-x_{n-1}\right\|^{2}+g\left(y_{n+1}\right)+\left(A_{n}+C_{n}\right)\left\|x_{n+1}-x_{n}\right\|^{2}\leq g\left(y_{n}\right)+\left(-C_{n}-B_{n}\right)\left\|x_{n}-x_{n-1}\right\|^{2}. (14)

Consequently, we have:

Cnxn+1xn12+Δnxnxn12(g(yn)+δnxnxn12)(g(yn+1)+δn+1xn+1xn2)\begin{array}[]{r}-C_{n}\left\|x_{n+1}-x_{n-1}\right\|^{2}+\Delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\leq\left(g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\right)\\ -\left(g\left(y_{n+1}\right)+\delta_{n+1}\left\|x_{n+1}-x_{n}\right\|^{2}\right)\end{array}

Now, since αnα,βnβ\alpha_{n}\longrightarrow\alpha,\beta_{n}\longrightarrow\beta as n+n\longrightarrow+\infty and α(10+688,0),0<β<4α2+10α+2Lg(2α+1)2\alpha\in\left(\frac{-10+\sqrt{68}}{8},0\right),0<\beta<\frac{4\alpha^{2}+10\alpha+2}{L_{g}(2\alpha+1)^{2}} we have:

limn+An\displaystyle\lim_{n\longrightarrow+\infty}A_{n} =(2βLg)(α+1)2+2α2α22β,\displaystyle=\frac{\left(2-\beta L_{g}\right)(\alpha+1)^{2}+2\alpha-2\alpha^{2}}{2\beta},
limn+Bn\displaystyle\lim_{n\longrightarrow+\infty}B_{n} =(2βLg)α22β,\displaystyle=\frac{\left(2-\beta L_{g}\right)\alpha^{2}}{2\beta},
limn+Cn\displaystyle\lim_{n\longrightarrow+\infty}C_{n} =(2βLg)α(1+α)α22β<0,\displaystyle=\frac{\left(2-\beta L_{g}\right)\alpha(1+\alpha)-\alpha^{2}}{2\beta}<0,
limn+Δn\displaystyle\lim_{n\longrightarrow+\infty}\Delta_{n} =(2βLg)(2α+1)2+2α4α22β>0,\displaystyle=\frac{\left(2-\beta L_{g}\right)(2\alpha+1)^{2}+2\alpha-4\alpha^{2}}{2\beta}>0,
limn+δn\displaystyle\lim_{n\longrightarrow+\infty}\delta_{n} =(2βLg)(2α2+3α+1)+2α3α22β>0.\displaystyle=\frac{\left(2-\beta L_{g}\right)\left(2\alpha^{2}+3\alpha+1\right)+2\alpha-3\alpha^{2}}{2\beta}>0.

Hence, there exists NN\in\mathbb{N} and C>0,D>0C>0,D>0 such that for all nNn\geq N one has:

CnC,ΔnD and δn>0C_{n}\leq-C,\Delta_{n}\geq D\text{ and }\delta_{n}>0

which, in the view of (15), shows (i); that is, the sequence g(yn)+δnxnxn12g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2} is decreasing for nNn\geq N.

By using (15) again, we obtain:

0<\displaystyle 0< Cxn+1xn12+Dxnxn12(g(yn)+δnxnxn12)\displaystyle C\left\|x_{n+1}-x_{n-1}\right\|^{2}+D\left\|x_{n}-x_{n-1}\right\|^{2}\leq\left(g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\right)
(g(yn+1)+δn+1xn+1xn2)\displaystyle-\left(g\left(y_{n+1}\right)+\delta_{n+1}\left\|x_{n+1}-x_{n}\right\|^{2}\right)

for all nNn\geq N, or, more convenient, that:

0<Dxnxn12(g(yn)+δnxnxn12)(g(yn+1)+δn+1xn+1xn2),0<D\left\|x_{n}-x_{n-1}\right\|^{2}\leq\left(g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\right)-\left(g\left(y_{n+1}\right)+\delta_{n+1}\left\|x_{n+1}-x_{n}\right\|^{2}\right), (16)

for all nNn\geq N. Let r>Nr>N. Summing up the latter relations gives:

Dn=Nrxnxn12(g(yN)+δNxNxN12)(g(yr+1)+δr+1xr+1xr2)D\sum_{n=N}^{r}\left\|x_{n}-x_{n-1}\right\|^{2}\leq\left(g\left(y_{N}\right)+\delta_{N}\left\|x_{N}-x_{N-1}\right\|^{2}\right)-\left(g\left(y_{r+1}\right)+\delta_{r+1}\left\|x_{r+1}-x_{r}\right\|^{2}\right)

which leads to:

g(yr+1)+Dn=Nrxnxn12g(yN)+δNxNxN12.g\left(y_{r+1}\right)+D\sum_{n=N}^{r}\left\|x_{n}-x_{n-1}\right\|^{2}\leq g\left(y_{N}\right)+\delta_{N}\left\|x_{N}-x_{N-1}\right\|^{2}. (17)

Now, taking into account that gg is bounded from below, after letting r+r\longrightarrow+\infty we obtain:

n=Nxnxn12<+\sum_{n=N}^{\infty}\left\|x_{n}-x_{n-1}\right\|^{2}<+\infty

which proves (iii).
This also shows that:

limn+xnxn12=0,\lim_{n\longrightarrow+\infty}\left\|x_{n}-x_{n-1}\right\|^{2}=0,

hence

limn+δnxnxn12=0.\lim_{n\rightarrow+\infty}\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}=0.

But then, using again the fact that gg is bounded from below, we have that the sequence g(yn)+δnxnxn12g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2} is bounded from below and also decreasing (see (i)) for nNn\geq N; hence, there exists:

limn+g(yn)+δnxnxn12.\lim_{n\rightarrow+\infty}g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\in\mathbb{R}.

Remark 5 By introducing the sequence:

un=2δn(xnxn1)+yn,n1,u_{n}=\sqrt{2\delta_{n}}\cdot\left(x_{n}-x_{n-1}\right)+y_{n},\quad n\geq 1, (18)

one can easily observe that the statements of Theorem 4 can be expressed in terms of the regularization of the objective function since H(yn,un)=g(yn)+δnxnxn12H\left(y_{n},u_{n}\right)=g\left(y_{n}\right)+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2} for all n,n1n\in\mathbb{N},n\geq 1.

An interesting fact is that for the sequence (H(yn,un))nN\left(H\left(y_{n},u_{n}\right)\right)_{n\geq N} to be decreasing one does not need the boundedness of the objective function gg, but only its regularity, as can be seen in the proof of Theorem 4. The energy decay is thus a structural property of the algorithm (6) and only the existence of the limit requires the boundedness of the objective function.

Remark 6 Observe that conclusion (iii) in the hypotheses of Theorem 4 assures that the sequence (xnxn1)nl2\left(x_{n}-x_{n-1}\right)_{n\in\mathbb{N}}\in l^{2}, in particular that:

limn+(xnxn1)=0.\lim_{n\longrightarrow+\infty}\left(x_{n}-x_{n-1}\right)=0. (19)

Lemma 7 In the framework of the optimization problem (P), assume that the objective function gg is bounded from below and consider the sequences (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} and (yn)n\left(y_{n}\right)_{n\in\mathbb{N}} generated by the numerical algorithm (6) and let (un)n\left(u_{n}\right)_{n\in\mathbb{N}} be defined by (18). Then, the following statements are valid:
(i) The sets of cluster points of (xn)n,(yn)n\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} and (un)n\left(u_{n}\right)_{n\in\mathbb{N}} coincide and are contained in the set o critical points of g, i.e.:

ω((xn)n)=ω((yn)n)=ω((un)n)crit(g);\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right)=\omega\left(\left(y_{n}\right)_{n\in\mathbb{N}}\right)=\omega\left(\left(u_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\operatorname{crit}(g);
ω((yn,un)n)crit(H)={(x,x)m×m:xcrit(g)}.\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\operatorname{crit}(H)=\left\{(x,x)\in\mathbb{R}^{m}\times\mathbb{R}^{m}:x\in\operatorname{crit}(g)\right\}. (ii)

Proof (i) We start by proving ω((xn)n)ω((un)n)\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\omega\left(\left(u_{n}\right)_{n\in\mathbb{N}}\right) and ω((xn)n)ω((yn)n)\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\omega\left(\left(y_{n}\right)_{n\in\mathbb{N}}\right). Bearing in mind that limn(xnxn1)=0\lim_{n\longrightarrow\infty}\left(x_{n}-x_{n-1}\right)=0 and that the sequences (δn)n,(αn)n\left(\delta_{n}\right)_{n\in\mathbb{N}},\left(\alpha_{n}\right)_{n\in\mathbb{N}}, and (βn)n\left(\beta_{n}\right)_{n\in\mathbb{N}} are convergent, the conclusion is quite straightforward. Indeed, if x¯ω((xn)n)\bar{x}\in\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right) and (xnk)k\left(x_{n_{k}}\right)_{k\in\mathbb{N}} is a subsequence such that limkxnk=x¯\lim_{k\longrightarrow\infty}x_{n_{k}}=\bar{x}, then:

limkynk=limkxnk+limkαnklimk(xnkxnk1)\lim_{k\longrightarrow\infty}y_{n_{k}}=\lim_{k\longrightarrow\infty}x_{n_{k}}+\lim_{k\longrightarrow\infty}\alpha_{n_{k}}\cdot\lim_{k\longrightarrow\infty}\left(x_{n_{k}}-x_{n_{k}-1}\right)

and

limkunk=limkδnklimk(xnkxnk1)+limkynk\lim_{k\longrightarrow\infty}u_{n_{k}}=\lim_{k\longrightarrow\infty}\delta_{n_{k}}\cdot\lim_{k\longrightarrow\infty}\left(x_{n_{k}}-x_{n_{k}-1}\right)+\lim_{k\longrightarrow\infty}y_{n_{k}}

imply that the sequences (xnk)k,(ynk)k\left(x_{n_{k}}\right)_{k\in\mathbb{N}},\left(y_{n_{k}}\right)_{k\in\mathbb{N}} and (unk)k\left(u_{n_{k}}\right)_{k\in\mathbb{N}} all converge to the same element x¯m\bar{x}\in\mathbb{R}^{m}. The reverse inclusions follow in a very similar manner from the definitions of unu_{n} and yny_{n}.
In order to prove that ω((xn)n)crit(g)\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\operatorname{crit}(g), we use the fact that g\nabla g is a continuous operator. So, passing to the limit in g(ynk)=1βnk(ynkxnk+1)\nabla g\left(y_{n_{k}}\right)=\frac{1}{\beta_{n_{k}}}\cdot\left(y_{n_{k}}-x_{n_{k}+1}\right) and taking into account that limk+βnk=β>0\lim_{k\rightarrow+\infty}\beta_{n_{k}}=\beta>0, we have:

g(x¯)\displaystyle\nabla g(\bar{x}) =limkg(ynk)\displaystyle=\lim_{k\longrightarrow\infty}\nabla g\left(y_{n_{k}}\right)
=1limkβnklimk(ynkxnk+1)\displaystyle=\frac{1}{\lim_{k\longrightarrow\infty}\beta_{n_{k}}}\cdot\lim_{k\longrightarrow\infty}\left(y_{n_{k}}-x_{n_{k}+1}\right)

and finally, as ynkxnk+1=(xnkxnk+1)+αnk(xnkxnk1)y_{n_{k}}-x_{n_{k}+1}=\left(x_{n_{k}}-x_{n_{k}+1}\right)+\alpha_{n_{k}}\cdot\left(x_{n_{k}}-x_{n_{k}-1}\right), we obtain:

g(x¯)=0.\nabla g(\bar{x})=0.

For proving the statement (ii), we rely on a direct computation yielding:

H(x,y)=(g(x)+(xy),(yx)),\nabla H(x,y)=(\nabla g(x)+(x-y),(y-x)), (20)

which, in turn, gives

crit(H)={(x¯,x¯)m×m:x¯crit(g)}\operatorname{crit}(H)=\left\{(\bar{x},\bar{x})\in\mathbb{R}^{m}\times\mathbb{R}^{m}:\bar{x}\in\operatorname{crit}(g)\right\}

and allows us to apply (i) to obtain the desired conclusion.
Some direct consequences of Theorem 4 (ii) and Lemma 7 are the following.
Fact 8 In the setting of Lemma 7, let (x¯,x¯)ω((yn,un)n)(\bar{x},\bar{x})\in\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right). It follows that x¯\bar{x}\in crit (g)(\mathrm{g}) and:

limnH(yn,un)=H(x¯,x¯).\lim_{n\longrightarrow\infty}H\left(y_{n},u_{n}\right)=H(\bar{x},\bar{x}).

Consequently,

HH is finite and constant on the set ω((yn,un)n)\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right).

The arguments behind the proofs of the following two facts are the same as those in Lemma 13 from [20].

Fact 9 If the assumptions from Lemma 7 hold true and if the sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is bounded, then the following conclusions hold up :

(i) ω((yn,un)n)\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right) is nonempty and compact,
(ii) limn+dist((yn,un),ω((yn,un)n))=0\lim_{n\longrightarrow+\infty}\operatorname{dist}\left(\left(y_{n},u_{n}\right),\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right)\right)=0.

Remark 10 We emphasize that if gg is coercive, that is limx+g(x)=+\lim_{\|x\|\rightarrow+\infty}g(x)=+\infty, then gg is bounded from below and (xn)n,(yn)n\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}}, the sequences generated by (6), are bounded.

Indeed, notice that gg is bounded from below, being a continuous and coercive function (see [27]). Note that according to Theorem 4 the sequence Dn=Nrxnxn12D\sum_{n=N}^{r}\|x_{n}-x_{n-1}\|^{2} is convergent hence is bounded. Consequently, from (17), it follows that yry_{r} is contained for every r>Nr>N, ( NN is defined in the hypothesis of Theorem 4), in a lower level set of gg, which is bounded. Since (yn)n\left(y_{n}\right)_{n\in\mathbb{N}} is bounded, taking into account (19), it follows that (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is also bounded.

Now, based on the conclusions of Lemma 7, we present a result which will be crucial later on. For our next result, 1\|\cdot\|_{1} will denote the 1 -norm and 2\|\cdot\|_{2} will represent the 2-norm on the linear space m×m\mathbb{R}^{m}\times\mathbb{R}^{m}.

Lemma 11 Let H,(xn)n,(yn)nH,\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}}, and (un)n\left(u_{n}\right)_{n\in\mathbb{N}} be as in all the previous results, with the mapping gg bounded from below. Then, the following gradient inequalities hold true:

H(yn,un)2H(yn,un)11βnxn+1xn+[(αnβn)+22δn]xnxn1\left\|\nabla H\left(y_{n},u_{n}\right)\right\|_{2}\leq\left\|\nabla H\left(y_{n},u_{n}\right)\right\|_{1}\leq\frac{1}{\beta_{n}}\cdot\left\|x_{n+1}-x_{n}\right\|+\left[\left(\frac{\alpha_{n}}{\beta_{n}}\right)+2\sqrt{2\delta_{n}}\right]\cdot\left\|x_{n}-x_{n-1}\right\| (21)

and

H(yn,un)222βn2xn+1xn2+2[(αnβn2δn)2+δn]xnxn12.\left\|\nabla H\left(y_{n},u_{n}\right)\right\|_{2}^{2}\leq\frac{2}{\beta_{n}^{2}}\cdot\left\|x_{n+1}-x_{n}\right\|^{2}+2\left[\left(\frac{\alpha_{n}}{\beta_{n}}-\sqrt{2\delta_{n}}\right)^{2}+\delta_{n}\right]\cdot\left\|x_{n}-x_{n-1}\right\|^{2}. (22)

Proof First of all note that from our numerical scheme (6) we have g(yn)=1βn((xnxn+1)+αn(xnxn1))\nabla g\left(y_{n}\right)=\frac{1}{\beta_{n}}\left(\left(x_{n}-x_{n+1}\right)+\alpha_{n}\left(x_{n}-x_{n-1}\right)\right). In terms of the 1\|\cdot\|_{1} on m×m\mathbb{R}^{m}\times\mathbb{R}^{m}, we have:

H(yn,un)1\displaystyle\left\|\nabla H\left(y_{n},u_{n}\right)\right\|_{1} =(g(yn)+(ynun),(unyn))1\displaystyle=\left\|\left(\nabla g\left(y_{n}\right)+\left(y_{n}-u_{n}\right),\left(u_{n}-y_{n}\right)\right)\right\|_{1}
=g(yn)+(ynun)+unyn\displaystyle=\left\|\nabla g\left(y_{n}\right)+\left(y_{n}-u_{n}\right)\right\|+\left\|u_{n}-y_{n}\right\|
1βnxn+1xn+αnβnxnxn1+22δnxnxn1,\displaystyle\leq\frac{1}{\beta_{n}}\left\|x_{n+1}-x_{n}\right\|+\frac{\alpha_{n}}{\beta_{n}}\left\|x_{n}-x_{n-1}\right\|+2\sqrt{2\delta_{n}}\left\|x_{n}-x_{n-1}\right\|,

which proves the desired inequality.

Now, with respect to the Euclidean norm, similar arguments yield:

H(yn,un)22\displaystyle\left\|\nabla H\left(y_{n},u_{n}\right)\right\|_{2}^{2} =g(yn)+(ynun)2+unyn2\displaystyle=\left\|\nabla g\left(y_{n}\right)+\left(y_{n}-u_{n}\right)\right\|^{2}+\left\|u_{n}-y_{n}\right\|^{2}
=g(yn)2δn(xnxn1)2+(2δn)2xnxn12\displaystyle=\left\|\nabla g\left(y_{n}\right)-\sqrt{2\delta_{n}}\left(x_{n}-x_{n-1}\right)\right\|^{2}+\left(\sqrt{2\delta_{n}}\right)^{2}\cdot\left\|x_{n}-x_{n-1}\right\|^{2}
2βn2xn+1xn2+2(αnβn2δn)(xnxn1)2+2δnxnxn12,\displaystyle\leq\frac{2}{\beta_{n}^{2}}\left\|x_{n+1}-x_{n}\right\|^{2}+2\left\|\left(\frac{\alpha_{n}}{\beta_{n}}-\sqrt{2\delta_{n}}\right)\cdot\left(x_{n}-x_{n-1}\right)\right\|^{2}+2\delta_{n}\cdot\left\|x_{n}-x_{n-1}\right\|^{2},

completing the proof.

Our main result concerning the convergence of the sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} generated by the algorithm (6) to a critical point of the objective function gg is the following.

Theorem 12 Consider the sequences (x)n,(yn)n(x)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} generated by the algorithm (6) and let the objective function gg be bounded from below. If the sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is bounded and HH is a KLKL function, then:

n=1xnxn1<+\sum_{n=1}^{\infty}\left\|x_{n}-x_{n-1}\right\|<+\infty (23)

and there exists an element x¯crit(g)\bar{x}\in\operatorname{crit}(g) such that limn+xn=x¯\lim_{n\longrightarrow+\infty}x_{n}=\bar{x}.

Proof Consider (x¯,x¯)(\bar{x},\bar{x}) from the set ω((yn,un)n)\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right) under the assumptions of Lemma 7. It follows that x¯crit(g)\bar{x}\in\operatorname{crit}(g). Also, using Fact 8, we get that limnH(yn,un)=H(x¯,x¯)\lim_{n\longrightarrow\infty}H\left(y_{n},u_{n}\right)=H(\bar{x},\bar{x}). Furthermore, we consider two cases:
I. By using NN from Theorem 4, assume that there exists n¯N\bar{n}\geq N, with n¯\bar{n}\in\mathbb{N}, such that H(yn¯,un¯)=H(x¯,x¯)H\left(y_{\bar{n}},u_{\bar{n}}\right)=H(\bar{x},\bar{x}). Then, since (H(yn,un))nN\left(H\left(y_{n},u_{n}\right)\right)_{n\geq N} is a decreasing sequence, it follows that:

H(yn,un)=H(x¯,x¯), for every nn¯,H\left(y_{n},u_{n}\right)=H(\bar{x},\bar{x}),\text{ for every }n\geq\bar{n},

Now, using (16), we get that for every nn¯n\geq\bar{n} we have the following inequality:

0Dxnxn12H(yn,un)H(yn+1,un+1)=H(x¯,x¯)H(x¯,x¯)=0.0\leq D\left\|x_{n}-x_{n-1}\right\|^{2}\leq H\left(y_{n},u_{n}\right)-H\left(y_{n+1},u_{n+1}\right)=H(\bar{x},\bar{x})-H(\bar{x},\bar{x})=0.

So, the sequence (xn)nn¯\left(x_{n}\right)_{n\geq\bar{n}} is constant and the conclusion holds true.
II. Now, we deal with the case when H(yn,un)>H(x¯,x¯)H\left(y_{n},u_{n}\right)>H(\bar{x},\bar{x}), for every nNn\geq N.

So, consider the set Ω:=ω((yn,un)n)\Omega:=\omega\left(\left(y_{n},u_{n}\right)_{n\in\mathbb{N}}\right). From Fact 9 , we have that the set Ω\Omega is nonempty and compact. Also, Fact 8 assures that mapping HH is constant on Ω\Omega. From the hypotheses of the theorem, we have that HH is a KLKL function. So, according to Lemma 3, there exists ε>0,η>0\varepsilon>0,\eta>0 and a function φΘη\varphi\in\Theta_{\eta}, such that for all the points ( z,wz,w ) from the set:
{(z,w)m×m:dist((z,w),Ω)<ε}{(z,w)m×m:H(x¯,x¯)<H(z,w)<η+H(x¯,x¯)}\left\{(z,w)\in\mathbb{R}^{m}\times\mathbb{R}^{m}:\operatorname{dist}((z,w),\Omega)<\varepsilon\right\}\cap\left\{(z,w)\in\mathbb{R}^{m}\times\mathbb{R}^{m}:H(\bar{x},\bar{x})<H(z,w)<\eta+H(\bar{x},\bar{x})\right\}
one has that:

φ(H(z,w)H(x¯,x¯))H(z,w)1.\varphi^{\prime}(H(z,w)-H(\bar{x},\bar{x}))\cdot\|\nabla H(z,w)\|\geq 1.

On the other hand, using Fact 9 , we obtain that limn+dist((yn,un),Ω)=0\lim_{n\longrightarrow+\infty}\operatorname{dist}\left(\left(y_{n},u_{n}\right),\Omega\right)=0. This means that there exists an index n1n_{1}\in\mathbb{N}, for which it is valid:

dist((yn,un),Ω)<ε, for all nn1.\operatorname{dist}\left(\left(y_{n},u_{n}\right),\Omega\right)<\varepsilon,\text{ for all }n\geq n_{1}.

Let us introduce the notation:

rn:=H(yn,un)H(x¯,x¯)r_{n}:=H\left(y_{n},u_{n}\right)-H(\bar{x},\bar{x})

Because

limn+rn=0\lim_{n\longrightarrow+\infty}r_{n}=0

and since

rn>0, for all nNr_{n}>0,\text{ for all }n\geq N

then there exists another index n2Nn_{2}\geq N, such that:

0<rn<η, for every nn20<r_{n}<\eta,\text{ for every }n\geq n_{2}

Taking n¯:=max(n1,n2)\bar{n}:=\max\left(n_{1},n_{2}\right) we get that for each nn¯n\geq\bar{n} it follows:

φ(rn)H(yn,un)1\varphi^{\prime}\left(r_{n}\right)\cdot\left\|\nabla H\left(y_{n},u_{n}\right)\right\|\geq 1

Since the function φ\varphi is concave, we have:

φ(rn)φ(rn+1)φ(rn)(rnrn+1)\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\geq\varphi^{\prime}\left(r_{n}\right)\cdot\left(r_{n}-r_{n+1}\right)

Thus, the following relation takes place for each nn¯n\geq\bar{n} :

φ(rn)φ(rn+1)rnrn+1H(yn,un)\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\geq\frac{r_{n}-r_{n+1}}{\left\|\nabla H\left(y_{n},u_{n}\right)\right\|}

On one hand, combining the inequality (16) and (21), it follows that for every nn¯n\geq\bar{n}

φ(rn)φ(rn+1)Dxnxn121βnxnxn+1+[αnβn+22δn]xnxn1\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\geq\frac{D\left\|x_{n}-x_{n-1}\right\|^{2}}{\frac{1}{\beta_{n}}\left\|x_{n}-x_{n+1}\right\|+\left[\frac{\alpha_{n}}{\beta_{n}}+2\sqrt{2\delta_{n}}\right]\cdot\left\|x_{n}-x_{n-1}\right\|} (24)

On the other hand, we know that the sequences (αn)n,(βn)n\left(\alpha_{n}\right)_{n\in\mathbb{N}},\left(\beta_{n}\right)_{n\in\mathbb{N}} and (δn)n\left(\delta_{n}\right)_{n\in\mathbb{N}} are convergent, and limn+βn=β>0\lim_{n\rightarrow+\infty}\beta_{n}=\beta>0, hence (1βn)n\left(\frac{1}{\beta_{n}}\right)_{n\in\mathbb{N}} and (αnβn+22δn)n\left(\frac{\alpha_{n}}{\beta_{n}}+2\sqrt{2\delta_{n}}\right)_{n\in\mathbb{N}} are bounded. This shows that there exists N¯,N¯n¯\bar{N}\in\mathbb{N},\bar{N}\geq\bar{n} and there exists M>0M>0, such that:

supnN¯{1βn,αnβn+22δn}M\sup_{n\geq\bar{N}}\left\{\frac{1}{\beta_{n}},\frac{\alpha_{n}}{\beta_{n}}+2\sqrt{2\delta_{n}}\right\}\leq M

Thus, the inequality (24) becomes:

φ(rn)φ(rn+1)Dxnxn12M(xnxn+1+xnxn1),\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\geq\frac{D\left\|x_{n}-x_{n-1}\right\|^{2}}{M\left(\left\|x_{n}-x_{n+1}\right\|+\left\|x_{n}-x_{n-1}\right\|\right)}, (25)

for every nN¯n\geq\bar{N}. This implies that for each nN¯n\geq\bar{N}, the following inequality holds:

xnxn1MD(φ(rn)φ(rn+1))(xnxn+1+xnxn1)\left\|x_{n}-x_{n-1}\right\|\leq\sqrt{\frac{M}{D}\cdot\left(\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\right)\cdot\left(\left\|x_{n}-x_{n+1}\right\|+\left\|x_{n}-x_{n-1}\right\|\right)}

From the well-known arithmetical-geometrical inequality, it follows that:

MD(φ(rn)φ(rn+1))(xnxn+1+xnxn1)\displaystyle\sqrt{\frac{M}{D}\cdot\left(\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\right)\cdot\left(\left\|x_{n}-x_{n+1}\right\|+\left\|x_{n}-x_{n-1}\right\|\right)}
\displaystyle\leq xn+1xn+xnxn13+3M4D(φ(rn)φ(rn+1)).\displaystyle\frac{\left\|x_{n+1}-x_{n}\right\|+\left\|x_{n}-x_{n-1}\right\|}{3}+\frac{3M}{4D}\cdot\left(\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\right).

Therefore, we obtain:

xnxn1xn+1xn+xnxn13+3M4D(φ(rn)φ(rn+1)).\left\|x_{n}-x_{n-1}\right\|\leq\frac{\left\|x_{n+1}-x_{n}\right\|+\left\|x_{n}-x_{n-1}\right\|}{3}+\frac{3M}{4D}\cdot\left(\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\right).

Consequently, we have:

2xnxn1xnxn+19M4D(φ(rn)φ(rn+1)),2\left\|x_{n}-x_{n-1}\right\|-\left\|x_{n}-x_{n+1}\right\|\leq\frac{9M}{4D}\cdot\left(\varphi\left(r_{n}\right)-\varphi\left(r_{n+1}\right)\right), (26)

for every nn\in\mathbb{N}, with nN¯n\geq\bar{N}. Now, by summing up the latter inequality from N¯\bar{N} to PN¯P\geq\bar{N}, we get that:

n=N¯Pxnxn1xP+1xPxN¯xN¯1+9M4D(φ(rN¯)φ(rP+1)).\sum_{n=\bar{N}}^{P}\left\|x_{n}-x_{n-1}\right\|\leq\left\|x_{P+1}-x_{P}\right\|-\left\|x_{\bar{N}}-x_{\bar{N}-1}\right\|+\frac{9M}{4D}\cdot\left(\varphi\left(r_{\bar{N}}\right)-\varphi\left(r_{P+1}\right)\right).

Now, it is time to use the fact that φ(0)=0\varphi(0)=0. In this setting, by letting P+P\longrightarrow+\infty and by using (19) we obtain:

n=N¯xnxn1xN¯xN¯1+9M4Dφ(rN¯)<+\sum_{n=\bar{N}}^{\infty}\left\|x_{n}-x_{n-1}\right\|\leq-\left\|x_{\bar{N}}-x_{\bar{N}-1}\right\|+\frac{9M}{4D}\varphi\left(r_{\bar{N}}\right)<+\infty

It implies that:

n=1xnxn1<+\sum_{n=1}^{\infty}\left\|x_{n}-x_{n-1}\right\|<+\infty

so the first part of the proof is done.
At the same time, the sequence (Sn)n\left(S_{n}\right)_{n\in\mathbb{N}}, defined by:

Sn=i=1nxixi1S_{n}=\sum_{i=1}^{n}\left\|x_{i}-x_{i-1}\right\|

is Cauchy. Thus, for every ε>0\varepsilon>0, there exists a positive integer number NεN_{\varepsilon}, such that for each nNεn\geq N_{\varepsilon} and for all pp\in\mathbb{N}, one has:

Sn+pSnεS_{n+p}-S_{n}\leq\varepsilon

Furthermore,

Sn+pSn=i=n+1n+pxixi1i=n+1n+p(xixi1)=xn+pxn.S_{n+p}-S_{n}=\sum_{i=n+1}^{n+p}\left\|x_{i}-x_{i-1}\right\|\geq\left\|\sum_{i=n+1}^{n+p}\left(x_{i}-x_{i-1}\right)\right\|=\left\|x_{n+p}-x_{n}\right\|.

So, the sequence (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is Cauchy hence is convergent, i.e., there exists xmx\in\mathbb{R}^{m}, such that:

limn+xn=x.\lim_{n\longrightarrow+\infty}x_{n}=x.

Thus, by using (i) of Lemma 7, it follows that:

{x}=ω((xn)n)crit(g),\{x\}=\omega\left(\left(x_{n}\right)_{n\in\mathbb{N}}\right)\subseteq\operatorname{crit}(g),

which leads to the conclusion of the second part of the present theorem.
Remark 13 Since the class of semi-algebraic functions is closed under addition (see for example [7]), and (x,y)12xy2(x,y)\mapsto\frac{1}{2}\|x-y\|^{2} is semi-algebraic, the conclusion of the previous theorem holds if the condition that HH is a KL function is replaced by the assumption that gg is semi-algebraic.

Note that, according to Remark 10, the conclusion of Theorem 12 remains valid if we replace in its hypotheses that the conditions that gg is bounded from below and (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is bounded by the condition that gg is coercive.

Finally, observe that under the assumptions of Theorem 12, we have limn+yn=x\lim_{n\rightarrow+\infty}y_{n}=x and

limn+g(xn)=limn+g(yn)=g(x).\lim_{n\longrightarrow+\infty}g\left(x_{n}\right)=\lim_{n\longrightarrow+\infty}g\left(y_{n}\right)=g(x).

3 Convergence rates

In the following theorem, we provide convergence rates for the sequence generated by (6), but also for the function values, in terms of the Łojasiewicz exponent of HH (see also, [2, 8, 14, 20]). More precisely we obtain finite, linear and sublinear convergence rates, depending the Łojasiewicz exponent of H,θH,\theta is 0 , or θ\theta belongs to (0,12]\left(0,\frac{1}{2}\right], or θ(12,1)\theta\in\left(\frac{1}{2},1\right), respectively. Note that the forthcoming results remain valid if one replace in their hypotheses the conditions that gg is bounded from below and (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is bounded by the condition that gg is coercive.

The following lemma was established in [14] and will be crucial in obtaining our convergence rates.

Lemma 14 ([14] Lemma 15) Let (en)nn¯,n¯\left(e_{n}\right)_{n\geq\bar{n}},\bar{n}\in\mathbb{N} be a monotonically decreasing positive sequence converging to 0 . Assume further that there exist the natural numbers l01l_{0}\geq 1 and n0n¯+l0n_{0}\geq\bar{n}+l_{0} such that for every nn0n\geq n_{0} one has:

enl0enC0en2θe_{n-l_{0}}-e_{n}\geq C_{0}e_{n}^{2\theta} (27)

where C0>0C_{0}>0 is some constant and θ[0,1)\theta\in[0,1). Then, following statements are true:
(i) If θ=0\theta=0, then (en)nn¯\left(e_{n}\right)_{n\geq\bar{n}} converges in finite time;
(ii) If θ(0,12]\theta\in\left(0,\frac{1}{2}\right], then there exists C1>0C_{1}>0 and Q[0,1)Q\in[0,1), such that for every nn0n\geq n_{0}

enC1Qn;e_{n}\leq C_{1}Q^{n};

(iii) If θ[12,1)\theta\in\left[\frac{1}{2},1\right), then there exists C2>0C_{2}>0, such that for every nn0+l0n\geq n_{0}+l_{0}

enC2(nl0+1)12θ1.e_{n}\leq C_{2}\left(n-l_{0}+1\right)^{-\frac{1}{2\theta-1}}.

In the proof of the following theorem, we use Lemma 14 (see also [2]).

Theorem 15 In the settings of problem (P) consider the sequences (xn)n,(yn)n\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} generated by Algorithm (6). Assume that gg is bounded from below and that (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} is bounded. Suppose that

H:m×m,H(x,y)=g(x)+12xy2H:\mathbb{R}^{m}\times\mathbb{R}^{m}\longrightarrow\mathbb{R},H(x,y)=g(x)+\frac{1}{2}\|x-y\|^{2}

fulfills the Łojasiewicz property with Łojasiewicz constant KK and Łojasiewicz exponent θ[0,1)\theta\in[0,1) and let limn+xn=x¯\lim_{n\rightarrow+\infty}x_{n}=\bar{x}. Then, the following statements hold true:

If θ=0\theta=0, then the sequences
(a) (g(yn))n,(g(xn))n,(yn)n\quad\left(g\left(y_{n}\right)\right)_{n\in\mathbb{N}},\left(g\left(x_{n}\right)\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} and (xn)n\left(x_{n}\right)_{n\in\mathbb{N}} converge in a finite number of steps;
If θ(0,12]\theta\in\left(0,\frac{1}{2}\right], then there exist Q(0,1),a1,a2,a3,a4>0Q\in(0,1),a_{1},a_{2},a_{3},a_{4}>0 and k¯\bar{k}\in\mathbb{N} such that:
(a a1)g(yn)g(x¯)a1Qn\left.a_{1}\right)\quad g\left(y_{n}\right)-g(\bar{x})\leq a_{1}Q^{n} for every nk¯n\geq\bar{k},
(a a2)g(xn)g(x¯)a2Qn\left.a_{2}\right)\quad g\left(x_{n}\right)-g(\bar{x})\leq a_{2}Q^{n} for every nk¯n\geq\bar{k},
(a) xnx¯a3Qnn2\left\|x_{n}-\bar{x}\right\|\leq a_{3}Q_{n}^{\frac{n}{2}} for every nk¯n\geq\bar{k},
(a4) ynx¯a4Qn2\left\|y_{n}-\bar{x}\right\|\leq a_{4}Q^{\frac{n}{2}} for all nk¯n\geq\bar{k};
If θ(12,1)\theta\in\left(\frac{1}{2},1\right) then there exist b1,b2,b3,b4>0b_{1},b_{2},b_{3},b_{4}>0 and k¯\bar{k}\in\mathbb{N} such that:
( b1)g(yn)g(x¯)b1n12θ1\left.b_{1}\right)\quad g\left(y_{n}\right)-g(\bar{x})\leq b_{1}n^{-\frac{1}{2\theta-1}}, for all nk¯n\geq\bar{k},
( b2b_{2} ) g(xn)g(x¯)b2n12θ1\quad g\left(x_{n}\right)-g(\bar{x})\leq b_{2}n^{-\frac{1}{2\theta-1}}, for all nk¯n\geq\bar{k},
(b) xnx¯b3nθ12θ1\left\|x_{n}-\bar{x}\right\|\leq b_{3}n^{\frac{\theta-1}{2\theta-1}}, for all nk¯n\geq\bar{k},
(b b4ynx¯b4nθ12θ1b_{4}\quad\left\|y_{n}-\bar{x}\right\|\leq b_{4}n^{\frac{\theta-1}{2\theta-1}}, for all nk¯n\geq\bar{k}.

Proof We start by employing the ideas from the proof of Theorem 12, namely if there exists n¯\bar{n}\in\mathbb{N}, with n¯N\bar{n}\geq N, for which one has that:

H(yn¯,un¯)=H(x¯,x¯)H\left(y_{\bar{n}},u_{\bar{n}}\right)=H(\bar{x},\bar{x})

then it follows that the sequence (xn)nn¯\left(x_{n}\right)_{n\geq\bar{n}} is constant. This leads to the fact that the sequence (yn)nn¯\left(y_{n}\right)_{n\geq\bar{n}} is also constant. Furthermore,

H(yn,un)=H(x¯,x¯) for all nn¯.H\left(y_{n},u_{n}\right)=H(\bar{x},\bar{x})\quad\text{ for all }\quad n\geq\bar{n}.

That is, if the regularized energy is constant after a certain number of iterations, one can see that the conclusion follow in a straightforward way.
Now, we can easily assume that:

H(yn,un)>H(x¯,x¯) for all nN.H\left(y_{n},u_{n}\right)>H(\bar{x},\bar{x})\quad\text{ for all }\quad n\geq N.

In order to simplify notations, we will use

Hn:=H(yn,un),H¯:=H(x¯,x¯) and Hn:=H(yn,un).H_{n}:=H\left(y_{n},u_{n}\right),\quad\bar{H}:=H(\bar{x},\bar{x})\quad\text{ and }\quad\nabla H_{n}:=\nabla H\left(y_{n},u_{n}\right).

Our analysis aims to apply Lemma 14 for

rn:=HnH¯=H(yn,un)H(x¯,x¯)>0r_{n}:=H_{n}-\bar{H}=H\left(y_{n},u_{n}\right)-H(\bar{x},\bar{x})>0

based on three previously established fundamental relations:
(a) The energy decay relation (16), for every nNn\geq N

HnHn+1Dxnxn12H_{n}-H_{n+1}\geq D\left\|x_{n}-x_{n-1}\right\|^{2}

(b) The energy-gradient estimate (22), for every nn\in\mathbb{N}

Hn22βn2xn+1xn2+Snxnxn12\left\|\nabla H_{n}\right\|^{2}\leq\frac{2}{\beta_{n}^{2}}\left\|x_{n+1}-x_{n}\right\|^{2}+S_{n}\left\|x_{n}-x_{n-1}\right\|^{2}

where

Sn=2[(αnβn2δn)2+δn],S_{n}=2\cdot\left[\left(\frac{\alpha_{n}}{\beta_{n}}-\sqrt{2\delta_{n}}\right)^{2}+\delta_{n}\right],

(c) The Łojasiewicz inequality (8), for every nN¯1n\geq\bar{N}_{1}

(HnH¯)2θK2Hn2,\left(H_{n}-\bar{H}\right)^{2\theta}\leq K^{2}\left\|\nabla H_{n}\right\|^{2},

where N¯1N\bar{N}_{1}\geq N is defined by the Łojasiewicz property of HH at ( x¯,x¯\bar{x},\bar{x} ) such that for ε>0\varepsilon>0 one has

(yn,un)(x¯,x¯)<ε\left\|\left(y_{n},u_{n}\right)-(\bar{x},\bar{x})\right\|<\varepsilon

for all nN¯1n\geq\bar{N}_{1}.
By combining these three inequalities, one reaches

(HnH¯)2θ2K2βn2D(Hn+1Hn+2)+K2SnD(HnHn+1)\left(H_{n}-\bar{H}\right)^{2\theta}\leq\frac{2K^{2}}{\beta_{n}^{2}D}\left(H_{n+1}-H_{n+2}\right)+\frac{K^{2}S_{n}}{D}\left(H_{n}-H_{n+1}\right)

and we are led to a nonlinear second-order difference inequality

rn2θ2K2βn2D(rn+1rn+2)+K2SnD(rnrn+1),r_{n}^{2\theta}\leq\frac{2K^{2}}{\beta_{n}^{2}D}\left(r_{n+1}-r_{n+2}\right)+\frac{K^{2}S_{n}}{D}\left(r_{n}-r_{n+1}\right), (28)

for every nN¯1n\geq\bar{N}_{1}.
Using the fact that the positive sequence (rn)nN\left(r_{n}\right)_{n\geq N} is decreasing we have that rn+22θrn2θr_{n+2}^{2\theta}\leq r_{n}^{2\theta} for all nNn\geq N. Further, since the sequences (2K2βn2D)n\left(\frac{2K^{2}}{\beta_{n}^{2}D}\right)_{n\in\mathbb{N}} and (K2SnD)n\left(\frac{K^{2}S_{n}}{D}\right)_{n\in\mathbb{N}} converge and have positive limit, there exists C>0C>0 such that for all nN¯1n\geq\bar{N}_{1} one has

2K2βn2D(rn+1rn+2)+K2SnD(rnrn+1)C(rnrn+2).\frac{2K^{2}}{\beta_{n}^{2}D}\left(r_{n+1}-r_{n+2}\right)+\frac{K^{2}S_{n}}{D}\left(r_{n}-r_{n+1}\right)\leq C\left(r_{n}-r_{n+2}\right).

In the view of these observations, (28) becomes

C0rn+22θrnrn+2,C_{0}r_{n+2}^{2\theta}\leq r_{n}-r_{n+2}, (29)

for every nN¯1n\geq\bar{N}_{1}, where C0=1CC_{0}=\frac{1}{C}.
Now we can apply Lemma 14 by observing that (29) is nothing else that (27) in Lemma 14, with en=rn+2,n¯=N2,l0=2e_{n}=r_{n+2},\bar{n}=N-2,l_{0}=2 and n0=N¯12n_{0}=\bar{N}_{1}-2. Hence, by taking into account that rn>0r_{n}>0 for all nNn\geq N, that is, in the conclusion of Lemma 14 (ii) one has Q0Q\neq 0, we have:
(p0) If θ=0\theta=0, then (rn)nN\left(r_{n}\right)_{n\geq N} converges in finite time;

rnC1Qn;r_{n}\leq C_{1}Q^{n};

(p2) If θ[12,1)\theta\in\left[\frac{1}{2},1\right), then there exists C2>0C_{2}>0, such that for every nN¯1+2n\geq\bar{N}_{1}+2

rnC2(n3)12θ1r_{n}\leq C_{2}(n-3)^{-\frac{1}{2\theta-1}}

(a). We treat first the case θ=0\theta=0. Then, according to (p0),rn=g(yn)g(x¯)+δnxnxn12(\mathrm{p}0),r_{n}=g\left(y_{n}\right)-g(\bar{x})+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2} converges in finite time. But then rnrn+1=0r_{n}-r_{n+1}=0 for nn big enough, hence (16) implies that xn=xn1x_{n}=x_{n-1} for nn big enough, consequently yn=xny_{n}=x_{n} for nn big enough, thus (xn)n,(yn)n\left(x_{n}\right)_{n\in\mathbb{N}},\left(y_{n}\right)_{n\in\mathbb{N}} converge in finite time. The above results show immediately that (g(xn))n,(g(yn))n\left(g\left(x_{n}\right)\right)_{n\in\mathbb{N}},\left(g\left(y_{n}\right)\right)_{n\in\mathbb{N}} converge in finite time.

Assume now that θ(0,12]\theta\in\left(0,\frac{1}{2}\right].
(a 𝐚1\mathbf{a}_{1} ). According to (p1) there exists C1>0C_{1}>0 and Q(0,1)Q\in(0,1), such that for every nN¯1n\geq\bar{N}_{1} one has:

rn=g(yn)g(x¯)+δnxnxn12C1Qnr_{n}=g\left(y_{n}\right)-g(\bar{x})+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\leq C_{1}Q^{n} (30)

Hence,

g(yn)g(x¯)a1Qn,g\left(y_{n}\right)-g(\bar{x})\leq a_{1}Q^{n}, (31)

for all nN¯1n\geq\bar{N}_{1}, where a1=C1a_{1}=C_{1}.
(a). In order to give an upper bound for the difference g(xn)g(x¯)g\left(x_{n}\right)-g(\bar{x}), we consider the following chain of inequalities based upon Lemma 2:

g(xn)g(yn)\displaystyle g\left(x_{n}\right)-g\left(y_{n}\right) g(yn),xnyn+Lg2xnyn2\displaystyle\leq\left\langle\nabla g\left(y_{n}\right),x_{n}-y_{n}\right\rangle+\frac{L_{g}}{2}\left\|x_{n}-y_{n}\right\|^{2}
=1βn(ynxn+1),αn(xnxn1)+Lg2xnyn2\displaystyle=\left\langle\frac{1}{\beta_{n}}\left(y_{n}-x_{n+1}\right),-\alpha_{n}\left(x_{n}-x_{n-1}\right)\right\rangle+\frac{L_{g}}{2}\left\|x_{n}-y_{n}\right\|^{2}
=1βnxn+1xn,αn(xnxn1)αn22βnLg2βnxnxn12.\displaystyle=\frac{1}{\beta_{n}}\left\langle x_{n+1}-x_{n},\alpha_{n}\left(x_{n}-x_{n-1}\right)\right\rangle-\alpha_{n}^{2}\frac{2-\beta_{n}L_{g}}{2\beta_{n}}\left\|x_{n}-x_{n-1}\right\|^{2}.

Here, using the inequality:

xn+1xn,αn(xnxn1)12[12βnLgxn+1xn2+(2βnLg)αn2xnxn12]\left\langle x_{n+1}-x_{n},\alpha_{n}\left(x_{n}-x_{n-1}\right)\right\rangle\leq\frac{1}{2}\left[\frac{1}{2-\beta_{n}L_{g}}\left\|x_{n+1}-x_{n}\right\|^{2}+\left(2-\beta_{n}L_{g}\right)\alpha_{n}^{2}\left\|x_{n}-x_{n-1}\right\|^{2}\right]

leads, after some simplifications, to:

g(xn)g(yn)12βn(2βnLg)xn+1xn2, for all ng\left(x_{n}\right)-g\left(y_{n}\right)\leq\frac{1}{2\beta_{n}\left(2-\beta_{n}L_{g}\right)}\left\|x_{n+1}-x_{n}\right\|^{2},\text{ for all }n\in\mathbb{N}

By combining the inequality (16) with the fact that the sequence (g(yn)+δnxnxn12)nN\left(g\left(y_{n}\right)+\delta_{n}\|x_{n}-\right.\left.x_{n-1}\|^{2}\right)_{n\geq N} is decreasing and converges to g(x¯)g(\bar{x}), one obtains:

g(xn)g(yn)12Dβn(2βnLg)rn+1, for all nNg\left(x_{n}\right)-g\left(y_{n}\right)\leq\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}r_{n+1},\quad\text{ for all }n\geq N (32)

From (30), we have rn+1C1Qn+1C1Qnr_{n+1}\leq C_{1}Q^{n+1}\leq C_{1}Q^{n} for all nN¯1n\geq\bar{N}_{1}, hence:

g(xn)g(yn)12Dβn(2βnLg)C1Qn, for all nN¯1g\left(x_{n}\right)-g\left(y_{n}\right)\leq\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}C_{1}Q^{n},\text{ for all }n\geq\bar{N}_{1}

This means that for every nN¯1n\geq\bar{N}_{1} one has:

g(xn)g(x¯)=(g(xn)g(yn))+(g(yn)g(x¯))C1[12Dβn(2βnLg)+1]Qng\left(x_{n}\right)-g(\bar{x})=\left(g\left(x_{n}\right)-g\left(y_{n}\right)\right)+\left(g\left(y_{n}\right)-g(\bar{x})\right)\leq C_{1}\left[\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}+1\right]Q^{n}

Since the sequence (βn)n\left(\beta_{n}\right)_{n\in\mathbb{N}} is convergent to β>0\beta>0, we can choose:

a2=C1supnN¯112Dβn(2βnLg)+C1a_{2}=C_{1}\sup_{n\geq\bar{N}_{1}}\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}+C_{1}

and we have

g(xn)g(x¯)a2Qn, for every nN¯1.g\left(x_{n}\right)-g(\bar{x})\leq a_{2}Q^{n},\text{ for every }n\geq\bar{N}_{1}. (33)

(a3). We continue the proof by establishing an estimate for xnx¯\left\|x_{n}-\bar{x}\right\|. By the triangle inequality and by summing up (26) from nN¯N¯1n\geq\bar{N}\geq\bar{N}_{1} to P>nP>n, (where N¯\bar{N} was defined in the proof of Theorem 12), one has:

xPxn1\displaystyle\left\|x_{P}-x_{n-1}\right\| k=nPxkxk1\displaystyle\leq\sum_{k=n}^{P}\left\|x_{k}-x_{k-1}\right\|
xnxn1+xP+1xP+9M4D[φ(HnH¯)φ(HP+1H¯)],\displaystyle\leq-\left\|x_{n}-x_{n-1}\right\|+\left\|x_{P+1}-x_{P}\right\|+\frac{9M}{4D}\left[\varphi\left(H_{n}-\bar{H}\right)-\varphi\left(H_{P+1}-\bar{H}\right)\right],

so, letting PP\longrightarrow\infty gives:

xn1x¯9M4Dφ(HnH¯).\left\|x_{n-1}-\bar{x}\right\|\leq\frac{9M}{4D}\varphi\left(H_{n}-\bar{H}\right).

Recall, however, that the desingularizing function is φ(t)=K1θt1θ\varphi(t)=\frac{K}{1-\theta}t^{1-\theta} hence,

xn1x¯M1rn1θ,\left\|x_{n-1}-\bar{x}\right\|\leq M_{1}r_{n}^{1-\theta}, (34)

for every nN¯n\geq\bar{N}, where M1=9MK4D(1θ)M_{1}=\frac{9MK}{4D(1-\theta)}.
Further, since rnr_{n} converges to 0 one has 0<rn<10<r_{n}<1 for nn big enough, hence rn1θrnr_{n}^{1-\theta}\leq\sqrt{r_{n}} holds for θ(0,1/2]\theta\in(0,1/2], if nn is big enough. By using (30), we conclude that there exists N¯2N¯\bar{N}_{2}\geq\bar{N} such that:

xnx¯M1rn+1M1rna3Qn2, for every nN¯2,\left\|x_{n}-\bar{x}\right\|\leq M_{1}\sqrt{r_{n+1}}\leq M_{1}\sqrt{r_{n}}\leq a_{3}Q^{\frac{n}{2}},\text{ for every }n\geq\bar{N}_{2}, (35)

where a3:=C1M1a_{3}:=\sqrt{C_{1}}M_{1}.
(𝐚4)\left(\mathbf{a}_{4}\right). Finally, we conclude this part of the proof by deducing an upper bound for ynx¯\left\|y_{n}-\bar{x}\right\|. The following inequalities hold true:

ynx¯\displaystyle\left\|y_{n}-\bar{x}\right\| =xn+αn(xnxn1)x¯|1+αn|xnx¯+|αn|xn1x¯\displaystyle=\left\|x_{n}+\alpha_{n}\left(x_{n}-x_{n-1}\right)-\bar{x}\right\|\leq\left|1+\alpha_{n}\right|\cdot\left\|x_{n}-\bar{x}\right\|+\left|\alpha_{n}\right|\cdot\left\|x_{n-1}-\bar{x}\right\|
(1+|αn|)a3Qn2+|αn|a3Q12Qn2(1+|αn|+Q12|αn|)a3Qn2,\displaystyle\leq\left(1+\left|\alpha_{n}\right|\right)a_{3}Q^{\frac{n}{2}}+\left|\alpha_{n}\right|a_{3}Q^{-\frac{1}{2}}Q^{\frac{n}{2}}\leq\left(1+\left|\alpha_{n}\right|+Q^{-\frac{1}{2}}\left|\alpha_{n}\right|\right)a_{3}Q^{\frac{n}{2}},

for all nN¯2+1n\geq\bar{N}_{2}+1. Let a4=supnN¯2+1(1+|αn|+Q12|αn|)a3>0a_{4}=\sup_{n\geq\bar{N}_{2}+1}\left(1+\left|\alpha_{n}\right|+Q^{-\frac{1}{2}}\left|\alpha_{n}\right|\right)a_{3}>0. Then,

ynx¯a4Qn2, for all nN¯2+1\left\|y_{n}-\bar{x}\right\|\leq a_{4}Q^{\frac{n}{2}},\text{ for all }n\geq\bar{N}_{2}+1 (36)

Now, if we take k¯=max{N¯1,N¯2+1}\bar{k}=\max\left\{\bar{N}_{1},\bar{N}_{2}+1\right\} then (31), (33), (35), and (36) lead to the conclusions (a1)(a4)\left(\mathrm{a}_{1}\right)-\left(\mathrm{a}_{4}\right).

Finally, assume that θ(12,1)\theta\in\left(\frac{1}{2},1\right).
(b 1 ). According to (p2) there exists C2>0C_{2}>0, such that for every nN¯1+2n\geq\bar{N}_{1}+2 one has

rn=g(yn)g(x¯)+δnxnxn12C2(n3)12θ1.r_{n}=g\left(y_{n}\right)-g(\bar{x})+\delta_{n}\left\|x_{n}-x_{n-1}\right\|^{2}\leq C_{2}(n-3)^{-\frac{1}{2\theta-1}}. (37)

Consequently,

g(yn)g(x¯)C2(n3)12θ1=C2(nn3)12θ1n12θ1g\left(y_{n}\right)-g(\bar{x})\leq C_{2}(n-3)^{-\frac{1}{2\theta-1}}=C_{2}\left(\frac{n}{n-3}\right)^{\frac{1}{2\theta-1}}n^{-\frac{1}{2\theta-1}}

for every nN¯1+2n\geq\bar{N}_{1}+2. Hence, we have

g(yn)g(x¯)b1n12θ1,g\left(y_{n}\right)-g(\bar{x})\leq b_{1}n^{-\frac{1}{2\theta-1}}, (38)

for every nN¯1+2n\geq\bar{N}_{1}+2, where b1=C2supnN¯1+2(nn3)12θ1b_{1}=C_{2}\sup_{n\geq\bar{N}_{1}+2}\left(\frac{n}{n-3}\right)^{\frac{1}{2\theta-1}}.
The other claims now follow quite easily.
(b 2 ). Indeed, note that (32) holds for every nN¯1n\geq\bar{N}_{1}, hence:

g(xn)g(yn)12Dβn(2βnLg)rn+112Dβn(2βnLg)b1(n+1)12θ1.g\left(x_{n}\right)-g\left(y_{n}\right)\leq\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}r_{n+1}\leq\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}b_{1}(n+1)^{\frac{-1}{2\theta-1}}.

Therefore, one obtains:

g(xn)g(x¯)=(g(xn)g(yn))+(g(yn)g(x¯))(12Dβn(2βnLg)b1+b1)n12θ1,g\left(x_{n}\right)-g(\bar{x})=\left(g\left(x_{n}\right)-g\left(y_{n}\right)\right)+\left(g\left(y_{n}\right)-g(\bar{x})\right)\leq\left(\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}b_{1}+b_{1}\right)n^{\frac{-1}{2\theta-1}},

for every nN¯1+2n\geq\bar{N}_{1}+2. Let b2=supnN¯1+2(12Dβn(2βnLg)b1+b1)b_{2}=\sup_{n\geq\bar{N}_{1}+2}\left(\frac{1}{2D\beta_{n}\left(2-\beta_{n}L_{g}\right)}b_{1}+b_{1}\right). Then

g(xn)g(x¯)b2n12θ1,g\left(x_{n}\right)-g(\bar{x})\leq b_{2}n^{\frac{-1}{2\theta-1}}, (39)

for every nN¯1+2n\geq\bar{N}_{1}+2.
(b 3 ). For proving (b 3 ), we use (34) again, and we have that for all nN¯N¯1+2n\geq\bar{N}\geq\bar{N}_{1}+2 it holds

xnx¯M1rn+11θM1rn1θM1(b1n12θ1)1θ\left\|x_{n}-\bar{x}\right\|\leq M_{1}r_{n+1}^{1-\theta}\leq M_{1}r_{n}^{1-\theta}\leq M_{1}\left(b_{1}n^{\frac{-1}{2\theta-1}}\right)^{1-\theta}

Let b3=M1b11θb_{3}=M_{1}b_{1}^{1-\theta}. Then,

xnx¯b3nθ12θ1,\left\|x_{n}-\bar{x}\right\|\leq b_{3}n^{\frac{\theta-1}{2\theta-1}}, (40)

for all nN¯n\geq\bar{N}.
(b 4 ). The final estimate is a straightforward consequence of the definition of yny_{n} and the above estimates. Indeed, for all nN¯+1n\geq\bar{N}+1 one has:

ynx¯=xn+αn(xnxn1)x¯|1+αn|xnx¯+|αn|xn1x¯(1+|αn|)b3nθ12θ1+|αn|b3(n1)θ12θ1(1+2|αn|)b3(n1)θ12θ1.\begin{gathered}\left\|y_{n}-\bar{x}\right\|=\left\|x_{n}+\alpha_{n}\left(x_{n}-x_{n-1}\right)-\bar{x}\right\|\leq\left|1+\alpha_{n}\right|\cdot\left\|x_{n}-\bar{x}\right\|+\left|\alpha_{n}\right|\cdot\left\|x_{n-1}-\bar{x}\right\|\\ \leq\left(1+\left|\alpha_{n}\right|\right)b_{3}n^{\frac{\theta-1}{2\theta-1}}+\left|\alpha_{n}\right|b_{3}(n-1)^{\frac{\theta-1}{2\theta-1}}\leq\left(1+2\left|\alpha_{n}\right|\right)b_{3}(n-1)^{\frac{\theta-1}{2\theta-1}}.\end{gathered}

Let b4=supnN¯+1(1+2|αn|)b3(nn1)1θ2θ1>0b_{4}=\sup_{n\geq\bar{N}+1}\left(1+2\left|\alpha_{n}\right|\right)b_{3}\left(\frac{n}{n-1}\right)^{\frac{1-\theta}{2\theta-1}}>0. Then,

ynx¯b4nθ12θ1, for all nN¯+1.\left\|y_{n}-\bar{x}\right\|\leq b_{4}n^{\frac{\theta-1}{2\theta-1}},\text{ for all }n\geq\bar{N}+1. (41)

Now, if we take k¯=N¯+1\bar{k}=\bar{N}+1 then (38), (39), (40), and (41) lead to the conclusions (b1)(b4)\left(b_{1}\right)-\left(b_{4}\right).

Remark 16 According to [21], HH has the Łojasiewicz property with Łojasiewicz exponent θ[12,1)\theta\in\left[\frac{1}{2},1\right), whenever gg has the Łojasiewicz property with Łojasiewicz exponent θ[12,1)\theta\in\left[\frac{1}{2},1\right). Therefore, one obtains the same convergence rates if in the hypotheses of Theorem 15 one assumes that gg has the Łojasiewicz property with Łojasiewicz exponent θ[12,1)\theta\in\left[\frac{1}{2},1\right).

4 Numerical experiments

The aim of this section is to support the analytic results of the previous sections by numerical experiments and to highlight some interesting features of the generic algorithm (6).

4.1 Comparing Algorithm (6) with some algorithms from the literature by using different step sizes

In our first experiment, let us consider the convex function:

g:2,g(x,y)=0.02x2+0.005y2.g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=0.02x^{2}+0.005y^{2}.

Based on the boundedness of its Hessian, we infer that the Lipschitz constant of its gradient is Lg=0.2L_{g}=0.2. Obviously, gg is strongly convex and its global minimum is (0,0)(0,0).

In order to give a better perspective on the advantages and disadvantages of algorithm (6) for different choices of step sizes and inertial coefficients, in our first numerical experiment, we compare the following:
(a) The proposed algorithm (6) with inertial parameter αn=0.01nn+3\alpha_{n}=-0.01\cdot\frac{n}{n+3}, which shows that α=0.01(10+688,0)\alpha=-0.01\in\left(\frac{-10+\sqrt{68}}{8},0\right), and constant step size βn=β=9(0,4α2+10α+2Lg(2α+1)2);\beta_{n}=\beta=9\in\left(0,\frac{4\alpha^{2}+10\alpha+2}{L_{g}(2\alpha+1)^{2}}\right);
(b) The proposed algorithm (6) with inertial parameter αn=0.01nn+3\alpha_{n}=-0.01\cdot\frac{n}{n+3} and increasing step size βn=9n+1n+2\beta_{n}=9\cdot\frac{n+1}{n+2};
(c) The proposed algorithm (6) with inertial parameter αn=0.01nn+3\alpha_{n}=-0.01\cdot\frac{n}{n+3} and decreasing step size βn=9n+3n+2\beta_{n}=9\cdot\frac{n+3}{n+2};
(d) Polyak’s algorithm (1) with inertial parameter αn=0.6nn+2\alpha_{n}=0.6\cdot\frac{n}{n+2} and constant step size βn=2Lg=10\beta_{n}=\frac{2}{L_{g}}=10;
(e) Polyak’s algorithm (1) with decreasing inertial parameter αn=0.7n+2n+1.5\alpha_{n}=0.7\cdot\frac{n+2}{n+1.5} and increasing step size βn=2(1αn)0.9Lg=90.3n+0.1n+1.5\beta_{n}=\frac{2\left(1-\alpha_{n}\right)\cdot 0.9}{L_{g}}=9\cdot\frac{0.3n+0.1}{n+1.5};
(f) Nesterov algorithm (4) with maximal admissible step size s=1Lg=5s=\frac{1}{L_{g}}=5;
(g) The Nesterov-like algorithm (5) with inertial parameter 0.6nn+30.6\cdot\frac{n}{n+3}, and step size s=2(10.6)Lg=4s=\frac{2(1-0.6)}{L_{g}}=4.
The choices of inertial coefficients and step sizes are motivated by theoretical results in [18,29] and [20]. We consider the starting points x0=x1=(3,1)x_{0}=x_{-1}=(3,1) and run the simulations until the error |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right| attains the value 101510^{-15}. These results are shown in Fig. 1, where the horizontal axis measures the number of iterations and the vertical axis shows the error |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right|. The experiment depicted in Fig. 1 shows that Algorithm (6) has the best behavior when we choose a decreasing step size (red square in Fig. 1). This instance outperforms those obtained with the same Algorithm (6) but with constant step size (red star in Fig. 1) and even more so fo increasing step sizes (by red circle in Fig. 1). Further, it should be noted that the Algorithm (6), in all its instances, outperforms Algorithm (5) (green line in Fig. 1), Algorithm (1) with a constant step size (yellow line in Fig. 1) or variable step size (black line in Fig. 1) and also Nesterov’s Algorithm (4) (blue line in Fig. 1).

4.2 The analysis of the behavior of Algorithm (6) via different inertial values and step sizes

In the second set of numerical experiments, we analyze the behavior of our algorithm with different inertial values and different step sizes in a noncovex setting. Our experiments suggest that in order to obtain faster convergence one should use in Algorithm (6) decreasing step size βn\beta_{n} and one should have a sequence of inertial parameter whose limit is as close to 0 as possible (see Fig. 2).

First, consider the Rastrigin function (see [25]):

g:2,g(x,y)=20+x210cos(2πx)+y210cos(2πy)g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=20+x^{2}-10\cos(2\pi x)+y^{2}-10\cos(2\pi y)

which is nonconvex. For the initial values x0=x1=(0.9,0.9)x_{0}=x_{-1}=(0.9,0.9), we run Algorithm (6), with the constant step size βn=β=0.001\beta_{n}=\beta=0.001 (yellow circle in Fig. 2a), then with decreasing step size βn=0.001n+4n+3\beta_{n}=0.001\cdot\frac{n+4}{n+3} (green arrow in Fig. 2a) and then with increasing step size βn=0.001n+2n+3\beta_{n}=0.001\cdot\frac{n+2}{n+3} (red star in Fig. 2a). Meanwhile, the inertial parameter is taken to be αn=0.1nn+3\alpha_{n}=-0.1\cdot\frac{n}{n+3} with simulations running until the error |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right| attains 101510^{-15}. The results are shown in Fig. 2a, where the horizontal axis measures the number of iterations and the vertical axis shows the error in terms of iterates.

Next, consider the convex quadratic function g:2,g(x,y)=0.02x2+0.005y2g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=0.02x^{2}+0.005y^{2} together with initial values x0=x1=(3,1)x_{0}=x_{-1}=(3,1) and an inertial parameter αn=0.01nn+1\alpha_{n}=-0.01\cdot\frac{n}{n+1}. The three instances of our algorithm: with constant step size βn=β=8\beta_{n}=\beta=8 (yellow circle in Fig. 2b), decreasing step size βn=8n+4n+3\beta_{n}=8\cdot\frac{n+4}{n+3} (green arrow in Fig. 2b) and finally with nondecreasing step size βn=8n+2n+3\beta_{n}=8\cdot\frac{n+2}{n+3} (red star in Fig. 2b),

Refer to caption
Figure 1: Fig. 2 Comparing different step sizes and inertial coefficients

are compared and results are shown in Fig. 2b, where the horizontal axis measures the number of iterations and the vertical axis shows the error in terms of iterates.

We also consider different initial values for Algorithm (6), namely x0=x1=x_{0}=x_{-1}= (0.9, 0.9) together with a fixed step size βn=β=0.001\beta_{n}=\beta=0.001 and the inertial parameters αn=0.2nn+3\alpha_{n}=-0.2\cdot\frac{n}{n+3} (red circle in Fig. 2c), αn=0.1nn+3\alpha_{n}=-0.1\cdot\frac{n}{n+3} (yellow star Fig. 2c), and αn=0.001nn+3\alpha_{n}=-0.001\cdot\frac{n}{n+3} (green arrow Fig. 2c). The result when the objective function gg is the Rastrigin function is shown in Fig. 2c.

Finally, we consider the same inertial values as before for Algorithm (6), but we take the convex objective function g:2,g(x,y)=0.02x2+0.005y2g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=0.02x^{2}+0.005y^{2} and the fixed step size βn=β=8\beta_{n}=\beta=8, see Fig. 2d.

4.3 Comparing Algorithm (6) with known algorithms by using some test functions for optimization

Since Algorithm (6) is new in the literature, it is worthwhile to compare with known algorithms using some so-called test functions for optimization (see [25]). In these experiments, we run the algorithms until the error |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right| attains the value 101510^{-15}. These results are shown in Fig. 3a-d, where the horizontal axis measures the number of iterations and the vertical axis shows the error |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right|.

At first consider Beale’s Function:
g:2,g(x,y)=(1.5x+xy)2+(2.25x+xy2)2+(2.625x+xy3)2g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=(1.5-x+xy)^{2}+\left(2.25-x+xy^{2}\right)^{2}+\left(2.625-x+xy^{3}\right)^{2}.
We compare Algorithm (6) with inertial parameter αn=0.01nn+3\alpha_{n}=-0.01\cdot\frac{n}{n+3} (red star Fig. 3a), with Algorithm (2) with inertial parameter αn=0.01nn+3\alpha_{n}=0.01\cdot\frac{n}{n+3} (green square

Refer to caption
Figure 2: Fig. 3 Minimizing test functions for optimization by using different algorithms

Fig. 3a), and Algorithm (4) (blue circle Fig. 3a), by taking the same step size βn=s=0.01\beta_{n}=s=0.01, and initial value x0=x1=(0.1,0.5)x_{0}=x_{-1}=(0.1,0.5). Meanwhile Algorithm (6) and Algorithm (2) show a similar behavior for the Beale function, both outperform Algorithm (4) (see Fig. 3a).

Consider next the Rosenbrock Function:

g:2,g(x,y)=100(x2y)2+(x1)2g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=100\left(x^{2}-y\right)^{2}+(x-1)^{2}

We compare Algorithm (6) with inertial parameter αn=0.01nn+3\alpha_{n}=-0.01\cdot\frac{n}{n+3} (red star Fig. 3b), with Algorithm (2) with inertial parameter αn=0.4nn+3\alpha_{n}=0.4\cdot\frac{n}{n+3} (green square Fig. 3b), and Algorithm (4) (blue circle Fig. 3b), by taking the same step size βn=s=0.001\beta_{n}=s=0.001, and initial value x0=x1=(2,2)x_{0}=x_{-1}=(2,2) (see Fig. 3b). Note that also for the Rosenbrock Function, Algorithm (6) and Algorithm (2) have a similar behavior; however, in contrast with the oscillations in the error terms of iterates |g(xn+1)g(xn)|\left|g\left(x_{n+1}\right)-g\left(x_{n}\right)\right| of Algorithm (2), Algorithm (6) shows an almost linear decrease trend.

We are also interested in the quadratic function of the form:

g:2,g(x,y)=3803.84138.08x232.92y+128.08x2+203.64y2+182.25xy.g:\mathbb{R}^{2}\longrightarrow\mathbb{R},g(x,y)=-3803.84-138.08x-232.92y+128.08x^{2}+203.64y^{2}+182.25xy.

We compare Algorithm (6) with inertial parameter αn=0.2nn+3\alpha_{n}=-0.2\cdot\frac{n}{n+3} (red star Fig. 3c), with Algorithm (2) with inertial parameter αn=0.49nn+3\alpha_{n}=0.49\cdot\frac{n}{n+3} (green square Fig. 3c), and Algorithm (4) (blue circle Fig. 3c), by taking the same step size βn=s=0.0025\beta_{n}=s=0.0025, and initial value x0=x1=(2,2)x_{0}=x_{-1}=(2,2). As Fig. 3c shows that in this case Algorithm (6) clearly outperforms Algorithm (2) and Algorithm (4).

Finally, for the logistic regression with l2l_{2}-regularization, we consider the cost function:

g:m,g(w)=1ki=1kln(1+eyiwTxi)+12w2g:\mathbb{R}^{m}\longrightarrow\mathbb{R},g(w)=\frac{1}{k}\sum_{i=1}^{k}\ln\left(1+e^{-y_{i}w^{T}x^{i}}\right)+\frac{1}{2}\|w\|^{2}

with k=200,m=50k=200,m=50. Further,

y1,,yk{1,+1}y_{1},\ldots,y_{k}\in\{-1,+1\}

and

x1,,xkm are generated by a random normal distribution. x^{1},\ldots,x^{k}\in\mathbb{R}^{m}\text{ are generated by a random normal distribution. }

We compare Algorithm (6) with inertial parameter αn=0.1nn+3\alpha_{n}=-0.1\cdot\frac{n}{n+3} (red star Fig. 3d), with Algorithm (2) with inertial parameter αn=0.36nn+3\alpha_{n}=0.36\cdot\frac{n}{n+3} (green square Fig. 3d), and Algorithm (4) (blue circle Fig. 3d), by taking the same step size βn=s=0.5\beta_{n}=s=0.5, and the initial value x0=x1=(1,,1)Tx_{0}=x_{-1}=(1,\ldots,1)^{T}. Also here, Algorithm (6) outperforms Algorithm (2) and Algorithm (4) (see Fig. 3d).

4.4 Switching between positive and negative inertial parameter values

Finally, a set of numerical experiments is related to the minimization of the nonconvex, coercive function:

g:,g(x)=ln(1+(x21)2)g:\mathbb{R}\longrightarrow\mathbb{R},g(x)=\ln\left(1+\left(x^{2}-1\right)^{2}\right)
Refer to caption
Figure 3: Fig. 4 Minimizing the nonconvex function $g(x)=\ln \left(1+\left(xˆ{2

-1\right)ˆ{2}\right)$ by using different inertial values and different starting points}

Observe that this function has two global minima at x=1x=-1 and x=1x=1 and a local maximum at x=0x=0.

The experiments that we present in what follows emphasize the importance of the fact that the inertial parameter αn\alpha_{n} in Algorithm (6), though having a strictly negative limit, may have a finite number of positive terms.

Indeed, by taking the same starting points x0=x1=2x_{0}=x_{-1}=2 and constant step size βn=0.1\beta_{n}=0.1, according to Fig. 4a, Algorithm (6), with inertial parameter αn=0.1nn+3\alpha_{n}=-0.1\frac{n}{n+3} (red star Fig. 4a), seems to converge faster than algorithm (2) with inertial parameter αn=0.1nn+3\alpha_{n}=0.1\frac{n}{n+3} (black circle Fig. 4a), after a certain number of iterates. Here, we ran the algorithms until the absolute value of the gradient of the objective function in iterates |g(xn)|\left|\nabla g\left(x_{n}\right)\right| attained the value 101510^{-15}. These results are shown in Fig. 4a, where the horizontal axis measures the number of iterations and the vertical axis shows the energy error |g(xn+1)g(x¯)|\left|g\left(x_{n+1}\right)-g(\bar{x})\right|, where x¯\bar{x} in this case is the appropriate minimum 1. However, these algorithms show a similar behavior concerning the scaled error h2n2|g(xn+1)g(x¯)|h^{2}n^{2}\left|g\left(x_{n+1}\right)-g(\bar{x})\right|, where nn is the number of iterations and hh is the step size (see Fig. 4b).

Now, for the initial values x0=x1=0.00001x_{0}=x_{-1}=0.00001 (which is very closed to the local maximum of the objective function), Algorithm (2) (black circle, Fig. 4c, d), clearly outperforms Algorithm (6) (red star, Fig. 4c, d) both for the energy error |g(xn+1)g(x¯)|\left|g\left(x_{n+1}\right)-g(\bar{x})\right|, (Fig. 4c), and the scaled error h2n2|g(xn+1)g(x¯)|h^{2}n^{2}\left|g\left(x_{n+1}\right)-g(\bar{x})\right| (Fig. 4d).

Nevertheless, the very general structure of the generic Algorithm (6) allows for much flexibility, as only the limit of the sequence ( αn\alpha_{n} ) is prescribed. So, one can profit by taking the inertial parameter αn=0.1n+5n+3\alpha_{n}=\frac{-0.1n+5}{n+3} in Algorithm (6). Then, αn\alpha_{n} is positive for the first 50 iterates, and this helps Algorithm (6) to outperform Algorithm (5) with inertial parameter αn=0.1nn+3\alpha_{n}=0.1\frac{n}{n+3}, even for the initial values x0=x1=x_{0}=x_{-1}= 0.00001 (see Fig. 4e for the energy error |g(xn+1)g(x¯)|\left|g\left(x_{n+1}\right)-g(\bar{x})\right| and Fig. 4f for the scaled error h2n2|g(xn+1)g(x¯)|h^{2}n^{2}\left|g\left(x_{n+1}\right)-g(\bar{x})\right|, where the graphics corresponding to Algorithm (6) are depicted by red, the graphics corresponding to Algorithm (2) are depicted by black).

References

  1. 1.

    Aujol, J.-F., Dossal, C.H., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. arXiv:1805.05719

  2. 2.

    Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1-2), 5-16 (2009)

  3. 3.

    Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438-457 (2010)

  4. 4.

    Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1-2), 91-129 (2013)

  5. 5.

    Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity . Math. Program. 168(1-2), 123-175 (2018)

  6. 6.

    Bégout, P., Bolte, J., Jendoubi, M.A.: On damped second-order gradient systems. J. Differ. Equ. 259, 3115-3143 (2015)

  7. 7.

    Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. Series A 146(1-2), 459-494 (2014)

  8. 8.

    Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205-1223 (2006)

  9. 9.

    Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556-572 (2007)

  10. 10.

    Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319-3363 (2010)

  11. 11.

    Boţ, R.I., Csetnek, E.R., László, S.C.: Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems. J. Evol. Equ. 18(3), 1291-1318 (2018)

  12. 12.

    Boţ, R.I., Csetnek, E.R., László, S.C.: An inertial forward-backward algorithm for minimizing the sum of two non-convex functions. Euro J. Comput. Optim. 4(1), 3-25 (2016)

  13. 13.

    Boţ, R.I., Csetnek, E.R., László, S.C.: A second order dynamical approach with variable damping to nonconvex smooth minimization. Applicable Analysis. https://doi.org/10.1080/00036811.2018. 1495330 (2018)

  14. 14.

    Boţ, R.I., Nguyen, D.-K.: The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. arXiv:1801.01994

  15. 15.

    Chambolle, A., Dossal, C.h.: On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. J. Optim. Theory Appl. 166(3), 968-982 (2015)

  16. 16.

    Combettes, P.L., Glaudin, L.E.: Quasinonexpansive iterations on the affine hull of orbits: From Mann’s mean value algorithm to inertial methods. Siam Journal on Optimization 27(4), 2356-2380 (2017)

  17. 17.

    Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for KurdykaŁojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874-900 (2015)

  18. 18.

    Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 IEEE European Control Conference (ECC), pp. 310-315 (2015)

  19. 19.

    Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier (Grenoble) 48(3), 769-783 (1998)

  20. 20.

    László, S.C.: Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization. arXiv:1811.09616

  21. 21.

    Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods . Found. Comput. Math. 2018, 1-34 (2018)

  22. 22.

    Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, Les É,quations aux Dérivées Partielles, Éditions du Centre National de la Recherche Scientifique Paris, pp. 87-89 (1963)

  23. 23.

    Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate O(1/k2)O\left(1/k^{2}\right). (Russian) Dokl. Akad. Nauk SSSR 269(3), 543-547 (1983)

  24. 24.

    Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Dordrecht (2004)

  25. 25.

    Polheim, H.: Examples of objective functions, Documentation for Genetic and Evolutionary Algorithms for use with MATLAB : GEATbx version 3.7, http://www.geatbx.com

  26. 26.

    Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. U.S.S.R. Comput. Math. Math. Phys. 4(5), 1-17 (1964)

  27. 27.

    Rockafellar, R.T., Wets, R.J.-B.: Variational analysis fundamental principles of mathematical sciences, vol. 317. Springer, Berlin (1998)

  28. 28.

    Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17, 1-43 (2016)

  29. 29.

    Sun, T., Yin, P., Li, D., Huang, C., Guan, L., Jiang, H.: Non-ergodic convergence analysis of heavyball algorithms. arXiv:1811.01777

2020

Related Posts