An extension of the second order dynamical system that models Nesterov’s convex gradient method

Abstract

In this paper we deal with a general second order continuous dynamical system associated to a convex minimization problem with a Frechet differentiable objective function.

We show that inertial algorithms, such as Nesterov’s algorithm, can be obtained via the natural explicit discretization from our dynamical system.

Our dynamical system can be viewed as a perturbed version of the heavy ball method with vanishing damping, however the perturbation is made in the argument of the gradient of the objective function.

This perturbation seems to have a smoothing effect for the energy error and eliminates the oscillations obtained for this error in the case of the heavy ball method with vanishing damping, as some numerical experiments show.

We prove that the value of the objective function in a generated trajectory converges in order \(\mathcal{O}(1/t^2)\) to the global minimum of the objective function.

Moreover, we obtain that a trajectory generated by the dynamical system converges to a minimum point of the objective function.

Authors

Cristian Daniel Alecsa
Department of Mathematics Babes-Bolyai University, Cluj-Napoca, Romania
Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy, Cluj-Napoca, Romania

Szilard Csaba Laszlo
Technical University of Cluj-Napoca, Department of Mathematics, Cluj-Napoca, Romania

Titus Pinta
Mathematical Institute University of Oxford, Oxford, England

Keywords

Unconstrained optimization problems; convex optimization; heavy ball method; continuous second order dynamical system; convergence rate; inertial algorithm; Lyapunov energy functions.

Paper coordinates

C.-D. Alecsa, S.C. Laszlo, T. Pinta, An extension of the second order dynamical system that models Nesterov’s convex gradient method, Appl. Math. Optim., 84 (2021), pp. 1687–1716. doi: 10.1007/s00245-020-09692-1

PDF

About this paper

Publisher Name

Springer

Print ISSN

0095-4616

Online ISSN

1432-0606

google scholar link

[1] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, 2012,arXiv:1206.5533.

[2]  C. Bishop, Training with noise is equivalent to Tikhonov regularization, Neural computation 7 (1),1995, 108– 116.

[3] L. Bottou, Stochastic gradient descent tricks, Neural networks: Tricks of the trade, Springer, 2012,421– 436.

[4] L. Bottou, F.E. Curtis, J. Nocedal, Optimization Methods for Large-Scale Machine Learning, SIAM Review, vol. 60, no. 2, 2018, 223-311

[5] R.I. Bot, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for minimizing the sum oftwo non-convex functions, Euro Journal on Computational Optimization 4(1), 2016, 3-25.

[6] A.B. Da Silva, M. Gazeau, A general system of differential equations to model first order adaptive algorithms, 2018, arXiv:1810.13108.

[7] T. Dozat, Incorporating Nesterov momentum into Adam, ICLR Workshop, 2016, 1-4.

[8] J. Duchi, E. Hazan, Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research, 2011, 2121–2159.

[9] I. Faragó, New operator splitting methods and their applications, Boyanov et al. (Eds.) NumericalMethods and Applications, Lecture Notes in Computational Sciences 4310, Springer Verlag, Berlin, 2007, 443-450.

[10] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016,http://www.deeplearningbook.org.

[11] E. Hansen, A. Ostermann, Exponential splitting for unbounded operators, Mathematics ofcomputation, vol. 78, no. 267, 2009, 1485-1496.

[12] E. Hansen, F. Kramer, A. Ostermann, A second-order positivity preserving scheme for semilin-ear parabolic problems, Applied Numerical Mathematics, vol. 62, issue 10, 2012, 1428-1435.

[13] C.F. Higham, D.J. Higham, Deep learning : An introduction for applied mathematicians, 2018, arXiv:1801.05894.

[14] H. Holden, K.H. Karlsen, K-A. Lie, Splitting methods for Partial Differential Equations with Rough Solutions : Analysis and MATLAB Programs, European Mathematical Society Publishing House, 2010,DOI:10.4171/078.16
APREPRINT – JUNE 18, 2019

[15] D.P. Kingma, J.L. Ba, Adam : A method for stochastic optimization, 2014,arXiv:1412.6980.

[16] Y. LeCun, L. Bottou, G.B. Orr, K.R. Müller, Efficient backprop, Neural networks: Tricks of thetrade, Springer, 1998, 9–50.

[17] G.I. Marchuk, Some application of splitting-up methods to the solution of mathematical physics problems, Aplikace matematiky, 1968, 103–132.

[18] P. Mehta, M. Bukov, C.H. Wang, A.G.R. Day, C. Richardson, C.K. Fisher, and D.J. Schwab, AHigh-Bias, Low-Variance Introduction to Machine Learning for Physicists, 2018, arXiv:1803.08823.

[19] Y. Nesterov, A method of solving a convex programming problem with convergence rate o(1/k2), Soviet Mathematics Doklady, vol. 27, 1983, 372–376.

[20] M. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.

[21] J. Nocedal, S. Wright, Numerical Optimization, Springer-Verlag, New York, 2006,10.1007/978-0-387-40065-5.

[22] B. Polyak, Some methods of speeding up the convergence of iteration methods, USSR ComputationalMathematics and Mathematical Physics, 4 (5), 1964, 1–17.

[23] S. Ruder, An overview of gradient descent optimization algorithms, 2016, arXiv:1609.04747.

[24] L. Smith, Cyclical learning rates for training neural networks, 2015, arXiv:1506.01186.

[25] G. Strang, On the construction and comparison of difference schemes, SIAM J. Numerical Anaysis,vol. 5, no. 3, 1968, 506-517.

[26] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov’s accelerated gradientmethod: theory and insights, Journal of Machine Learning Research, (17), 2016, 1-43.

[27] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, H. Jiang, Non-ergodic Convergence Analysis of Heavy-Ball Algorithms, 2018, https://arxiv.org/abs/1811.01777.

[28] I. Sutskever, J. Martens, G. Dahl, G. Hinton , On the importance of initialization and mo-mentum in deep learning, International conference on machine learning, 2013, 1139–1147.

[29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.Rabinovich, Going deeper with convolutions, CVPR, 2015, arxiv.org/pdf/1409.4842.pdf.

[30] T. Tieleman, G. Hinton, Lecture 6.5 – rmsprop. Divide the gradient by a running average ofits recent magnitude, Coursera : Neural Networks for Machine Learning, 4, 2012.

[31] A.C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, The marginal value of adaptive gradient methods in machine learning, 2017, arXiv:1705.08292.

[32] M.D. Zeiler, ADADELTA : An adaptive learning rate method, 2012,http://arxiv.org/abs/1212.5701.

[33] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, 2016, arXiv:1611.03530.17

2021

Related Posts