A mixed iteration for nonnegative matrix factorizations

Abstract

We show that, under appropriate conditions, one can create a hybrid between two given iterations which can perform better than either of the original ones. This fact provides a freedom of choice. We also give numerical examples in which we compare our hybrid with the dedicated Lee–Seung iteration.

Authors

S.M. Soltuz
(Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy)

B.E. Rhoades

Keywords

Non-negative matrix factorization; Lee–Seung iteration.

References

See the expanding block below.

Paper coordinates

Ş.M. Şoltuz, B.E. Rhoades, A mixed iteration for nonnegative matrix factorizations, Appl. Math. Comput., 219 (2013) no. 18, 9847-9855.
doi: 10.1016/j.amc.2013.03.124

PDF

About this paper

Print ISSN

0096-3003

Online ISSN
Google Scholar Profile

soon

soon

Paper (preprint) in HTML form

A mixed iteration for nonnegative matrix factorizations

Ştefan M. Şoltuz a,b,∗, B.E. Rhoades c
a Dawson College, Mathematics Department, 3040 Sherbrooke Street West, Westmount (Montreal), Quebec H3Z 1A4, Canada
b Tiberiu Popoviciu Institute of Numerical Analysis, Cluj-Napoca, Romania
c Indiana University, Mathematics Department, Bloomingtron, IN, USA
Abstract

We show that, under appropriate conditions, one can create a hybrid between two given iterations which can perform better than either of the original ones. This fact provides a freedom of choice. We also give numerical examples in which we compare our hybrid with the dedicated Lee-Seung iteration.

ARTICLE INFO

Keywords:

Non-negative matrix factorization
Lee-Seung iteration

1. Introduction

1.1. Source separation methods. The non-negative matrix factorization (NMF)

Single-channel source separation problems arise when a number of sources emit signals that are mixed and recorded by a single sensor and one is interested in estimating the original sources of the signals based on the recorded mixture. This problem is ill-posed. Several different model channel source separation methods have been used. One method is the autoregressive model (AR). This model captures temporal correlations at the source, as was shown in [2,3]. In these papers it was proved that, for a single channel mixture of stationary (AR) sources, the (AR) coefficients can be uniquely identified and the sources separated. For non-stationary (AR) sources an adaptive sliding -window was introduced to update the process. A complete study of (AR) models can be found in [4,5].

Let M(m,n)M(m,n) denote the collection of m×nm\times n matrices with nonnegative entries. For a given matrix VM(m,n)V\in M(m,n), the Nonnegative Matrix Factorization procedure ( NMFNMF ) is used to find matrices WM(m,r)W\in M(m,r) and HM(r,n)H\in M(r,n) such that V=WHV=WH. Matrix factorization has many applications. For example it is used in source separation and dimensionality reduction or clustering. In using (NMF) it is necessary to compute

argminW,H12VWH2.\arg\min_{W,H}\frac{1}{2}\|V-WH\|^{2}. (1)

An excellent survey on (NMF) and matrix factorization could be found in [15].
Other factorization methods used are Vector Quantization (VQ), Principal Component Analysis (PCA) and Independent Component Analysis (ICA). These can be written in the form VWHV\approx WH. The differences between these methods and (NMF) are due to the different constraints placed on factoring matrices. In the (VQ) method the columns of HH are constrained to be unary vectors (i.e., all components are zero except for one element equal to 1 ). In the (PCA) procedure the columns of WW and the rows of HH must be orthogonal. In the (ICA) procedure the rows of HH are maximally statistically independent.

A major problem with the (PCA) procedure is that it allows the basis vectors to have both positive and negative components and the data are represented by linear combinations of these vectors. In some applications, the presence of negative

00footnotetext: Corresponding author. at: Dawson College, Mathematics Department, 3040 Sherbrooke Street West, Westmount (Montreal), Quebec H3Z 1A4, Canada. E-mail addresses: smsoltuz@gmail.com (Ş.M. Şoltuz), rhoades@indiana.edu (B.E. Rhoades).

components contradicts the physical reality. For example, the pixels in a gray scale image must have nonnegative entries, so any image with negative intensities would have no reasonable interpretation. Also STFT magnitude is given by nonnegative quantities. The ( NMFNMF ) procedure was developed in an attempt to address this problem. (See [1,6,7,11,18,19][1,6,7,11,18,19] ).

Two iterative (NMF) methods are in use - Alternative Least Squares (ALS) and that of Lee-Seung (LS), the reader may consult [8-13].

1.2. Multiplicative rule. The pointwise product

Consider two functions F:M×NF:\mathbb{R}^{M\times N}\rightarrow\mathbb{R} and G:M×N×M×NG:\mathbb{R}^{M\times N}\times\mathbb{R}^{M\times N}\rightarrow\mathbb{R}.
Definition 1. [9] One says that G(H,H)G\left(H,H^{\prime}\right) is an auxiliary function for F(H)F(H) if the conditions

G(H,H)F(H),G(H,H)=F(H),G\left(H,H^{\prime}\right)\geqslant F(H),\quad G(H,H)=F(H), (2)

are satisfied.
In [9] the following quantity is considered as the cost function

F(H)=12i(ViaWiaHa)2,F(H)=\frac{1}{2}\sum_{i}\left(V_{i}-\sum_{a}W_{ia}H_{a}\right)^{2}, (3)

where 1im,1ar,Vi1\leqslant i\leqslant m,1\leqslant a\leqslant r,V_{i} is the "ith", ( 1,n1,n ) dimension row from V,HaV,H_{a} is the " aa th", (1,n)(1,n) row from HH and H=(Ha)1arH=\left(H_{a}\right)_{1\leqslant a\leqslant r}.
Lemma 2. [9] If GG is an auxiliary function, then FF is nonincreasing under the update

Hn+1=argminHG(H,Hn).H_{n+1}=\arg\min_{H}G\left(H,H_{n}\right).

The Taylor expansion for FF in (3), leads to

F(H)=F(Hn)+(HHn)F(Hn)+12(HHn)T(WnTWn)(HHn).F(H)=F\left(H_{n}\right)+\left(H-H_{n}\right)\nabla F\left(H_{n}\right)+\frac{1}{2}\left(H-H_{n}\right)^{T}\left(W_{n}^{T}W_{n}\right)\left(H-H_{n}\right).

Leaving both HH and HnH_{n} as variables one obtains as an auxiliary function for FF,

GTaylor (H,Hn)=F(Hn)+(HHn)F(Hn)+12(HHn)T(WnTWn)(HHn).G_{\text{Taylor }}\left(H,H_{n}\right)=F\left(H_{n}\right)+\left(H-H_{n}\right)\nabla F\left(H_{n}\right)+\frac{1}{2}\left(H-H_{n}\right)^{T}\left(W_{n}^{T}W_{n}\right)\left(H-H_{n}\right). (4)

The quantity (WnTWn)\left(W_{n}^{T}W_{n}\right) is a positive semidefinite matrix. Moreover, as noted in [9], the difference matrix between,

Kn=diag(diag(WnTWnHnT/HnT))K_{n}=\operatorname{diag}\left(\operatorname{diag}\left(W_{n}^{T}W_{n}H_{n}^{T}\cdot/H_{n}^{T}\right)\right) (5)

and

Kn(WnTWn)K_{n}-\left(W_{n}^{T}W_{n}\right)

remains positive semidefinite. (In Matlab, "./" denotes the pointwise division between matrices and "diag" the diagonal of a Matrix, seen as vector). Note that we need it twice, in order to keep the dimensionality right. Hence, a similar quantity as the above Taylor expansion, (4),

G(H,Hn)=F(Hn)+(HHn)F(Hn)+12(HHn)T(Kn)(HHn),G\left(H,H_{n}\right)=F\left(H_{n}\right)+\left(H-H_{n}\right)\nabla F\left(H_{n}\right)+\frac{1}{2}\left(H-H_{n}\right)^{T}\left(K_{n}\right)\left(H-H_{n}\right), (6)

satisfies the conditions of an auxiliary function of the above FF, see [9]. In order to make an upgrade each step for Hn+1H_{n+1}, the arg min is involved. It is know that under appropriate conditions over Φ,Ψ\Phi,\Psi the quadratic form FF, (with Φ\Phi a positive defined matrix),

F(x)=xTΦx+Ψx+ΘF(x)=x^{T}\Phi x+\Psi x+\Theta (7)

attains its minimum at

x¯=Φ1Ψ.\bar{x}=\Phi^{-1}\Psi.

Thus, by setting (6) into (7),

Φ:=Kn(=diag(diag(WnTWnHnT/HnT))),\displaystyle\Phi=K_{n}\left(=\operatorname{diag}\left(\operatorname{diag}\left(W_{n}^{T}W_{n}H_{n}^{T}\cdot/H_{n}^{T}\right)\right)\right),
Ψ:=F(Hn),\displaystyle\Psi=\nabla F\left(H_{n}\right),
Θ:=F(Hn),\displaystyle\Theta=F\left(H_{n}\right),

we obtain x¯=Hn+1Hn\bar{x}=H_{n+1}-H_{n} and therefore the new Hn+1H_{n+1} is

Hn+1Hn=argminHG(H,Hn),\displaystyle H_{n+1}-H_{n}=\arg\min_{H}G\left(H,H_{n}\right), (8)
Hn+1Hn=Kn1F(Hn),\displaystyle H_{n+1}-H_{n}=K_{n}^{-1}\nabla F\left(H_{n}\right),
Hn+1=Hn+Kn1F(Hn).\displaystyle H_{n+1}=H_{n}+K_{n}^{-1}\nabla F\left(H_{n}\right).

This leads to the Lee-Seung iterative (multiplicative) method.

1.3. The Lee-Seung multiplicative rule (LS)

More specific, for (8), the matrix Hn+1H_{n+1} is given by

Hn+1=HnHn(WnTWnHn)((WnTWnHn)(WnTV))H_{n+1}=H_{n}-\frac{H_{n}}{\left(W_{n}^{T}W_{n}H_{n}\right)}\left(\left(W_{n}^{T}W_{n}H_{n}\right)-\left(W_{n}^{T}V\right)\right)

Basically, the rule is to reduce the "distance" (or the cost function) by choosing at each step, appropriate ηn\eta_{n}, respectively γn\gamma_{n},

Hn+1=Hn+ηn((WnTV)(WnTWnHn))\displaystyle H_{n+1}=H_{n}+\eta_{n}\left(\left(W_{n}^{T}V\right)-\left(W_{n}^{T}W_{n}H_{n}\right)\right) (9)
Wn+1=Wn+((VHn+1T)(WnHn+1Hn+1T))γn\displaystyle W_{n+1}=W_{n}+\left(\left(VH_{n+1}^{T}\right)-\left(W_{n}H_{n+1}H_{n+1}^{T}\right)\right)\gamma_{n}

The heart of each iterative method consists in the choice of such ηn\eta_{n} and γn\gamma_{n}. Specifically, at step nn, in [9], each matrix is given pointwisely by

ηab=Hnab(WnTWnHn)ab,γab=Wnab(WnHn+1Hn+1T)ab,\eta_{ab}=\frac{H_{nab}}{\left(W_{n}^{T}W_{n}H_{n}\right)_{ab}},\gamma_{ab}=\frac{W_{nab}}{\left(W_{n}H_{n+1}H_{n+1}^{T}\right)_{ab}}, (10)

where ()ab(\cdot)_{ab}, gives the location, (row and column), within the matrix. In (9) set(10). This leads to Hadamard or pointwise multiplication, denoted by ".". Eventually, the following iteration method is obtained

Hn+1=Hn(WnTV)(WnTWnHn),\displaystyle H_{n+1}=H_{n}\cdot\frac{\left(W_{n}^{T}V\right)}{\left(W_{n}^{T}W_{n}H_{n}\right)}, (11)
Wn+1=Wn(VHn+1T)(WnHn+1Hn+1T).\displaystyle W_{n+1}=W_{n}\cdot\frac{\left(VH_{n+1}^{T}\right)}{\left(W_{n}H_{n+1}H_{n+1}^{T}\right)}.

It was reported in [11,13], that the convergence result from [9], actually, does not provide enough conditions for the convergence of (11). A new more stable iteration based on (11)was presented and its convergence was study.

Remark 3. "Lin’s modification" for (11), from [11], consists in adding a "small" positive quantity (i.e. δ=109\delta=10^{-9} ), such that the Lee-Seung iteration becomes:

Hn+1=Hn(WnTV)(WnTWnHn)+δ\displaystyle H_{n+1}=H_{n}\cdot\frac{\left(W_{n}^{T}V\right)}{\left(W_{n}^{T}W_{n}H_{n}\right)+\delta} (12)
Wn+1=Wn(VHn+1T)(WnHn+1Hn+1T)+δ\displaystyle W_{n+1}=W_{n}\cdot\frac{\left(VH_{n+1}^{T}\right)}{\left(W_{n}H_{n+1}H_{n+1}^{T}\right)+\delta}

For this new method, ηn\eta_{n} and γn\gamma_{n}, were set to be:

ηab=Hnab(WnTWnHn)ab+δ,γab=Wnab(WnHn+1Hn+1T)ab+δ,\eta_{ab}=\frac{H_{nab}}{\left(W_{n}^{T}W_{n}H_{n}\right)_{ab}+\delta},\quad\gamma_{ab}=\frac{W_{nab}}{\left(W_{n}H_{n+1}H_{n+1}^{T}\right)_{ab}+\delta},

1.4. The Alternative least squares method (ALS)

In (9) set

ηn=(WnTWn)1\eta_{n}=\left(W_{n}^{T}W_{n}\right)^{-1}

and alternatively,

The matrix V is generated by real audio data.

γn=(Hn+1Hn+1T)1,\gamma_{n}=\left(H_{n+1}H_{n+1}^{T}\right)^{-1},

to obtain,

Hn+1=(WnTWn)1(WnTV)\displaystyle H_{n+1}=\left(W_{n}^{T}W_{n}\right)^{-1}\left(W_{n}^{T}V\right) (13)
Wn+1=VHn+1(Hn+1Hn+1T)1\displaystyle W_{n+1}=VH_{n+1}\left(H_{n+1}H_{n+1}^{T}\right)^{-1}

The problem with such iteration is that one or both of (WnTWn)1\left(W_{n}^{T}W_{n}\right)^{-1} and (Hn+1Hn+1T)1\left(H_{n+1}H_{n+1}^{T}\right)^{-1} can be negative. In order to solve this problem one can consider the projection onto the nonnegative orthant, denoted by P+[]P_{+}[\cdot]. The above Alternating Least Squares iteration becomes

Hn+1=P+((WnTWn)1(WnTV))\displaystyle H_{n+1}=P_{+}\left(\left(W_{n}^{T}W_{n}\right)^{-1}\left(W_{n}^{T}V\right)\right) (14)
Wn+1=P+(VHn+1(Hn+1Hn+1T)1)\displaystyle W_{n+1}=P_{+}\left(VH_{n+1}\left(H_{n+1}H_{n+1}^{T}\right)^{-1}\right)

The convergence for such iteration is described in [7]. We remark that a quantity such as (ATA)1A\left(A^{T}A\right)^{-1}A is similar to the projection operator obtained for the least squares method, (see for example [14]). This method was reported (see also our experiments) to be very fast but unstable, see also [17]. In [17], as well as here, the aim is to obtain a new iteration which has the speed convergence of (ALS) and the stability of (LS).

Typical convergence and divergence behaviors are indicated in Fig. 1. V is the spectrogram of an audio signal A3_whistle.wav 1, the frame length of the FFT was set to 512 . Hence, the dimension of 𝐕\mathbf{V} was m=512,n=174m=512,n=174. Both 𝐖\mathbf{W} and H were randomly initialized. The rank rr was set to 7 and 50%50\% overlap between the windows was used for generating the spectrogram. All the three algorithms were applied to decompose the music notes from the audio signal.

1.5. The hybrid method

In (9) insert η=inv(W1W)\eta=\operatorname{in}v\left(W^{-1}W\right), to obtain the iteration

Hn+1=Hn+(WnTWn)1((WnTV)(WnTWnHn)).H_{n+1}=H_{n}+\left(W_{n}^{T}W_{n}\right)^{-1}\left(\left(W_{n}^{T}V\right)-\left(W_{n}^{T}W_{n}H_{n}\right)\right). (15)

A "general" iteration method for such " Hn+1H_{n+1} " as in (9) would have the following structure

00footnotetext: 1 Available at www.ee.surrey.ac.uk/Personal/W.Wang/demondata.html.
Xn+1=Xn+K(Yn)1((YnTV)(YnTYnXn))\displaystyle X_{n+1}=X_{n}+K\left(Y_{n}\right)^{-1}\left(\left(Y_{n}^{T}V\right)-\left(Y_{n}^{T}Y_{n}X_{n}\right)\right) (16)
Yn+1=Yn+N(Xn+1)1(VXn+1YnXn+1Xn+1T)\displaystyle Y_{n+1}=Y_{n}+N\left(X_{n+1}\right)^{-1}\left(VX_{n+1}-Y_{n}X_{n+1}X_{n+1}^{T}\right)

or,

Hn+1=Hn+A(Wn)1(WnTVWnTWnHn)\displaystyle H_{n+1}=H_{n}+A\left(W_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right) (17)
Wn+1=Wn+B(Hn+1)1(VHn+1WnHn+1Hn+!T)\displaystyle W_{n+1}=W_{n}+B\left(H_{n+1}\right)^{-1}\left(VH_{n+1}-W_{n}H_{n+1}H_{n+!}^{T}\right)

where K(Xn)K\left(X_{n}\right) is a surrogate for A(Wn)A\left(W_{n}\right) which may be different at each step. We introduce KK to be a nonnegative matrix "close enough" to A(Wn)A\left(W_{n}\right) so as to avoid the errors introduced by using a projection onto the positive orthant. Note that YnY_{n} is different from WnW_{n} at each step. But, if YnY_{n} is close enough to WnW_{n}, then XnX_{n} behaves analogously to HnH_{n}. We introduced within (16 and (17) two "general" iterations. The study will cover, in this matter, all possible choices for those K,N,AK,N,A and BB within ( 16 and (17). Appropriate settings for KK and NN from (16) will lead to (LS) iteration. Eventually we will compare it with (ALS), if AA and BB are well chosen within (17).

Remark 4. Set A(Wn)1=(diag(diag(WnTWnHnT/HnT)))1A\left(W_{n}\right)^{-1}=\left(\operatorname{diag}\left(\operatorname{diag}\left(W_{n}^{T}W_{n}H_{n}^{T}\cdot/H_{n}^{T}\right)\right)\right)^{-1} and B(Hn+1)1=((diag(diag(WnHn+1Hn+1T/Wn))))1B\left(H_{n+1}\right)^{-1}=\left(\left(\operatorname{diag}\left(\operatorname{diag}\left(W_{n}H_{n+1}H_{n+1}^{T}\cdot/W_{n}\right)\right)\right)\right)^{-1} to obtain the Lee-Seung iteration (11). As we can see at each step A(Wn)A\left(W_{n}\right) is changing, even within (15); set A(Wn)1=(WnTWn)1A\left(W_{n}\right)^{-1}=\left(W_{n}^{T}W_{n}\right)^{-1} and B(Hn+1)1=(Hn+1Hn+1T)1B\left(H_{n+1}\right)^{-1}=\left(H_{n+1}H_{n+1}^{T}\right)^{-1}, to obtain the (ALS) iteration.

Our main purposes are the following: first to be able to mix (LS) and (ALS) iterations in order to obtain a better one. Second, we show that "structurally" the two algorithms are not very different and we provide the mathematical background for such hybrid to converge.

2. Main results

2.1. Convergence of the hybrid method

Recall the following Lemma:
Lemma 5. [16] Let {an}\left\{a_{n}\right\} be a nonnegative sequence that satisfies

an+1(1w)an+σn𝐌,a_{n+1}\leqslant(1-w)a_{n}+\sigma_{n}\mathbf{M},

where w(0,1)w\in(0,1) and 𝐌>0\mathbf{M}>0 are fixed numbers and {σn}\left\{\sigma_{n}\right\}; is a nonnegative sequence which converges to zero. Then limnan=0\lim_{n\rightarrow\infty}a_{n}=0.
The result remains true provided that the coefficient of ana_{n} stays within an interval in ( 0,1 ).
Proposition 6. Let {an}\left\{a_{n}\right\} be a nonnegative sequence that satisfies

an+1λnan+σn𝐌,nn0,a_{n+1}\leqslant\lambda_{n}a_{n}+\sigma_{n}\mathbf{M},\quad\forall n\geqslant n_{0},

where 𝐌>0\mathbf{M}>0 is fixed number, {λn}\left\{\lambda_{n}\right\} and {σn}\left\{\sigma_{n}\right\}; are nonnegative sequences such that {λn}(0,Λ)\left\{\lambda_{n}\right\}\subset(0,\Lambda), for some Λ<1\Lambda<1 and limnσn=0\lim_{n\rightarrow\infty}\sigma_{n}=0. Then limnan=0\lim_{n\rightarrow\infty}a_{n}=0.

Proof. Note that, for each n,λnΛn\in\mathbb{N},\lambda_{n}\leqslant\Lambda. By defining (1w)=maxn(1λn)(1-w)=\max_{n}\left(1-\lambda_{n}\right), the result follows from Lemma 5 .

Remark 7. If λn>1\lambda_{n}>1 for each nn, the Proposition 6 fails. As an example, choose an=n,λn=2a_{n}=n,\lambda_{n}=2 and σn=0\sigma_{n}=0 for each nn. Then it follows that n+1=an+1λnan+σnM=2nn+1=a_{n+1}\leqslant\lambda_{n}a_{n}+\sigma_{n}M=2n. Proposition 6 remains true if λn>1\lambda_{n}>1, for only a finite subset of \mathbb{N}.

For sake of simplicity, through out this paper, we shall consider the sup - norm for all matrices involved. Within Matlab one use "max (max (…))" command.

Theorem 8. If iteration (17) converges i.e. limnHn=H,limnWn=W\lim_{n\rightarrow\infty}H_{n}=H^{*},\lim_{n\rightarrow\infty}W_{n}=W^{*} and limnYn=W\lim_{n\rightarrow\infty}Y_{n}=W^{*}, and there exists λ(0,1)\lambda\in(0,1) and 𝐌>0\mathbf{M}>0 such that for each step nn, we have the following relations satisfied

Ir,rK(Yn)1WnTWnλ<1\displaystyle\left\|I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|\leqslant\lambda<1 (18)
max{supn{K(Yn)1,A(Wn)1}}𝐌,\displaystyle\max\left\{\sup_{n}\left\{\left\|K\left(Y_{n}\right)^{-1}\right\|,\left\|A\left(W_{n}\right)^{-1}\right\|\right\}\right\}\leqslant\mathbf{M},

then iteration (16) is also convergent; i.e. limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*}.

Proof. Define Mn=YnWnM_{n}=Y_{n}-W_{n}. Note that,

Xn+1=\displaystyle X_{n+1}= Xn+K(Yn)1(YnTVYnTYnXn)\displaystyle X_{n}+K\left(Y_{n}\right)^{-1}\left(Y_{n}^{T}V-Y_{n}^{T}Y_{n}X_{n}\right) (19)
=\displaystyle= Xn+K(Yn)1((WnT+MnT)V(WnT+MnT)(Wn+Mn)Xn)\displaystyle X_{n}+K\left(Y_{n}\right)^{-1}\left(\left(W_{n}^{T}+M_{n}^{T}\right)V-\left(W_{n}^{T}+M_{n}^{T}\right)\left(W_{n}+M_{n}\right)X_{n}\right)
=\displaystyle= Xn+K(Yn)1(WnTVWnTWnXn)\displaystyle X_{n}+K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}X_{n}\right)
+K(Yn)1(MnTV(MnTWn+WnTMn+MnTMn)Xn),\displaystyle+K\left(Y_{n}\right)^{-1}\left(M_{n}^{T}V-\left(M_{n}^{T}W_{n}+W_{n}^{T}M_{n}+M_{n}^{T}M_{n}\right)X_{n}\right),

where WnW_{n} is obtained from (17). Using (17) and (19) one obtains

Hn+1Xn+1\displaystyle H_{n+1}-X_{n+1}
=(HnXn)+A(Wn)1(WnTVWnTWnHn)K(Yn)1(WnTVWnTWnXn)\displaystyle=\left(H_{n}-X_{n}\right)+A\left(W_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)-K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}X_{n}\right)
K(Yn)1(MnTV(MnTWn+WnTMn+MnTMn)Xn)\displaystyle-K\left(Y_{n}\right)^{-1}\left(M_{n}^{T}V-\left(M_{n}^{T}W_{n}+W_{n}^{T}M_{n}+M_{n}^{T}M_{n}\right)X_{n}\right)
=(HnXn)+A(Wn)1(WnTVWnTWnHn)K(Yn)1(WnTVWnTWnHn)\displaystyle=\left(H_{n}-X_{n}\right)+A\left(W_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)-K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)
+K(Yn)1(WnTVWnTWnHn)K(Yn)1(WnTVWnTWnXn)\displaystyle+K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)-K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}V-W_{n}^{T}W_{n}X_{n}\right)
K(Yn)1(MnTV(MnTWn+WnTMn+MnTMn)Xn)\displaystyle-K\left(Y_{n}\right)^{-1}\left(M_{n}^{T}V-\left(M_{n}^{T}W_{n}+W_{n}^{T}M_{n}+M_{n}^{T}M_{n}\right)X_{n}\right)
=(HnXn)+K(Yn)1(WnTWnXnWnTWnHn)+(A(Wn)1K(Yn)1)(WnTVWnTWnHn)\displaystyle=\left(H_{n}-X_{n}\right)+K\left(Y_{n}\right)^{-1}\left(W_{n}^{T}W_{n}X_{n}-W_{n}^{T}W_{n}H_{n}\right)+\left(A\left(W_{n}\right)^{-1}-K\left(Y_{n}\right)^{-1}\right)\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)
K(Yn)1(MnTV(MnTWn+WnTMn+MnTMn)Xn)\displaystyle-K\left(Y_{n}\right)^{-1}\left(M_{n}^{T}V-\left(M_{n}^{T}W_{n}+W_{n}^{T}M_{n}+M_{n}^{T}M_{n}\right)X_{n}\right)
=(Ir,rK(Yn)1WnTWn)(HnXn)+(A(Wn)1K(Yn)1)(WnTVWnTWnHn)\displaystyle=\left(I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right)\left(H_{n}-X_{n}\right)+\left(A\left(W_{n}\right)^{-1}-K\left(Y_{n}\right)^{-1}\right)\left(W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right)
K(Yn)1(MnTV(MnTWn+WnTMn+MnTMn)Xn).\displaystyle-K\left(Y_{n}\right)^{-1}\left(M_{n}^{T}V-\left(M_{n}^{T}W_{n}+W_{n}^{T}M_{n}+M_{n}^{T}M_{n}\right)X_{n}\right).

Thus,

Hn+1Xn+1\displaystyle\left\|H_{n+1}-X_{n+1}\right\|
Ir,rK(Yn)1WnTWnHnXn\displaystyle\leqslant\left\|I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|\left\|H_{n}-X_{n}\right\|
+A(Wn)1K(Yn)1WnTVWnTWnHn\displaystyle+\left\|A\left(W_{n}\right)^{-1}-K\left(Y_{n}\right)^{-1}\right\|\left\|W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right\|
+K(Yn)1VWnXnMnT\displaystyle+\left\|K\left(Y_{n}\right)^{-1}\right\|\left\|V-W_{n}X_{n}\right\|\left\|M_{n}^{T}\right\|
+K(Yn)1WnT+MnTMnXn.\displaystyle+\left\|K\left(Y_{n}\right)^{-1}\right\|\left\|W_{n}^{T}+M_{n}^{T}\right\|\left\|M_{n}\right\|\left\|X_{n}\right\|.

We shall consider here the sup - sup-norm such that Ir,r=1\left\|I_{r,r}\right\|=1. Denote by

an=HnXn\displaystyle a_{n}=\left\|H_{n}-X_{n}\right\|
λn=Ir,rK(Yn)1WnTWn\displaystyle\lambda_{n}=\left\|I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|
σn=max{WnTVWnTWnHn,MnT,Mn}\displaystyle\sigma_{n}=\max\left\{\left\|W_{n}^{T}V-W_{n}^{T}W_{n}H_{n}\right\|,\left\|M_{n}^{T}\right\|,\left\|M_{n}\right\|\right\}
𝐌=supn{A(Wn)1K(Yn)1,K(Yn)1,Xn}\displaystyle\mathbf{M}=\sup_{n}\left\{\left\|A\left(W_{n}\right)^{-1}-K\left(Y_{n}\right)^{-1}\right\|,\left\|K\left(Y_{n}\right)^{-1}\right\|,\left\|X_{n}\right\|\right\}

Note that one has λn(0,Λ)\lambda_{n}\in(0,\Lambda) and σn0\sigma_{n}\rightarrow 0, therefore from Proposition 6, we obtain limnan=limnHnXn=0\lim_{n\rightarrow\infty}a_{n}=\lim_{n\rightarrow\infty}\left\|H_{n}-X_{n}\right\|=0, that is limnXn=H\lim_{n\rightarrow\infty}\left\|X_{n}\right\|=H^{*}. Using the inequality,

XnHHnH+HnXn,\left\|X_{n}-H^{*}\right\|\leqslant\left\|H_{n}-H^{*}\right\|+\left\|H_{n}-X_{n}\right\|,

it follows that limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*}.

Remark 9. Using Matlab it is easy to verify each step, of the condition of (18) by using (max (max (eye (r,r)-inv (diag (diag ((AAB)./B)))(AA)))\left.\left.\left.\left.\left(\left(A^{**}A^{*}B\right)./B\right)\right)\right)^{*}\left(A^{**}A\right)\right)\right), where size (A)=(m,r)(A)=(m,r) and size (B)=(r,n)(B)=(r,n). The second condition of (18), simply demands an upper bound for those matrices involved in the process.

2.2. Further results

In (17), set A(Wn)=(WnTWn)1A\left(W_{n}\right)=\left(W_{n}^{T}W_{n}\right)^{-1}, (respectively, B(Hn+1)=(Hn+1Hn+1T)1B\left(H_{n+1}\right)=\left(H_{n+1}H_{n+1}^{T}\right)^{-1} ) to obtain the (ALS) method. The above Theorem leads to the following result.

Corollary 10. If the iteration (15) converges i.e. (limnHn=H,limnWn=W\left(\lim_{n\rightarrow\infty}H_{n}=H^{*},\lim_{n\rightarrow\infty}W_{n}=W^{*}\right. and limnYn=W)\left.\lim_{n\rightarrow\infty}Y_{n}=W^{*}\right), and there exists a λ(0,1)\lambda\in(0,1) and an 𝐌>0\mathbf{M}>0 such that for each step nn, the following relations are satisfied

Ir,rK(Xn)1WnTWnλ<1\displaystyle\left\|I_{r,r}-K\left(X_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|\leqslant\lambda<1
max{supn{K(Xn)1,(WnTWn)1}}𝐌\displaystyle\max\left\{\sup_{n}\left\{\left\|K\left(X_{n}\right)^{-1}\right\|,\left\|\left(W_{n}^{T}W_{n}\right)^{-1}\right\|\right\}\right\}\leqslant\mathbf{M}

then the iteration (16) is also convergent; i.e. limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*}.

Remark 11.

(a) In other words, this Corollary claims that a hybrid is allowed, provided that appropriate assumptions are satisfied.
(b) Note that by considering the diagonal matrix

A(Wn)=diag(diag(WnTWnHnT/HnT))A\left(W_{n}\right)=\operatorname{diag}\left(\operatorname{diag}\left(W_{n}^{T}W_{n}H_{n}^{T}\cdot/H_{n}^{T}\right)\right)

(17) becomes the Lee-Seung iteration (with the Hadamard product).

Proposition 12. In Theorem 8 one can replace

Ir,rK(Yn)1WnTWnλ\left\|I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|\leqslant\lambda

with

1λWnTWnK(Yn)1\frac{1-\lambda}{\left\|W_{n}^{T}W_{n}\right\|}\leqslant\left\|K\left(Y_{n}\right)^{-1}\right\|

Proof. Note that

1K(Yn)1WnTWn1K(Yn)1WnTWn\displaystyle 1-\left\|K\left(Y_{n}\right)^{-1}\right\|\left\|W_{n}^{T}W_{n}\right\|\leqslant 1-\left\|K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|
Ir,rK(Yn)1WnTWnλ\displaystyle\leqslant\left\|I_{r,r}-K\left(Y_{n}\right)^{-1}W_{n}^{T}W_{n}\right\|\leqslant\lambda

to obtain the conclusion.
By duality, one can consider the case in which limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*} to obtain.
Corollary 13. If the iteration (17) converges (i.e. limnHn=H,limnWn=W\lim_{n\rightarrow\infty}H_{n}=H^{*},\lim_{n\rightarrow\infty}W_{n}=W^{*} and limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*} ), and there exists a λ(0,1)\lambda\in(0,1) and an 𝐌>0\mathbf{M}>0 such that for each step nn, we have the following relations satisfied

Ir,rHn+1Hn+1TN(Xn)1λ<1\displaystyle\left\|I_{r,r}-H_{n+1}H_{n+1}^{T}N\left(X_{n}\right)^{-1}\right\|\leqslant\lambda<1
max{supn{N(Xn)1,B(Hn+1)1}}𝐌\displaystyle\max\left\{\sup_{n}\left\{\left\|N\left(X_{n}\right)^{-1}\right\|,\left\|B\left(H_{n+1}\right)^{-1}\right\|\right\}\right\}\leqslant\mathbf{M}

then the iteration (16) is also convergent; i.e. limnYn=W\lim_{n\rightarrow\infty}Y_{n}=W^{*}.
As in [9], the next step is to consider the second part of (9) with B(Hn+1)=(Hn+1Hn+1T)B\left(H_{n+1}\right)=\left(H_{n+1}H_{n+1}^{T}\right); i.e.

Wn+1=Wn+(VWnHn+1)Hn+1T(Hn+1Hn+1T)1W_{n+1}=W_{n}+\left(V-W_{n}H_{n+1}\right)H_{n+1}^{T}\left(H_{n+1}H_{n+1}^{T}\right)^{-1}

and the following quantity from (16),

Yn+1=Yn+(VYnXn+1)Xn+1N(Xn)1Y_{n+1}=Y_{n}+\left(V-Y_{n}X_{n+1}\right)X_{n+1}N\left(X_{n}\right)^{-1}

to obtain a similar result to Corollary 10.
Corollary 14. If the iteration (15) converges (i.e. limnHn=H,limnWn=W\lim_{n\rightarrow\infty}H_{n}=H^{*},\lim_{n\rightarrow\infty}W_{n}=W^{*} and limnXn=H\lim_{n\rightarrow\infty}X_{n}=H^{*} ) and there exists a λ(0,1)\lambda\in(0,1) and an 𝐌>0\mathbf{M}>0 such that for each step nn, we have the following relations satisfied

Ir,rHn+1Hn+1TN(Xn)1λ<1\displaystyle\left\|I_{r,r}-H_{n+1}H_{n+1}^{T}N\left(X_{n}\right)^{-1}\right\|\leqslant\lambda<1
max{supn{N(Xn)1,(Hn+1Hn+1T)1}}𝐌\displaystyle\max\left\{\sup_{n}\left\{\left\|N\left(X_{n}\right)^{-1}\right\|,\left\|\left(H_{n+1}H_{n+1}^{T}\right)^{-1}\right\|\right\}\right\}\leqslant\mathbf{M}

then the iteration (16) is also convergent; i.e. limnYn=W\lim_{n\rightarrow\infty}Y_{n}=W^{*}.
A result similar to Proposition 12 also holds. For practitioners the changed condition may be more useful.

The matrix V is generated by real audio data.

Remark 15. Analogously, one can replace at each step

Ir,rHn+1Hn+1TN(Xn)1λ\left\|I_{r,r}-H_{n+1}H_{n+1}^{T}N\left(X_{n}\right)^{-1}\right\|\leqslant\lambda

by

1λHn+1Hn+1TN(Xn)1.\frac{1-\lambda}{\left\|H_{n+1}H_{n+1}^{T}\right\|}\leqslant\left\|N\left(X_{n}\right)^{-1}\right\|.

3. Numerical examples. Convergence and divergence behaviors

In the first experiment we let VV denote the spectrogram of the audio signal C6_frenchhorn.wav 2, where the frame length of the FFT was set at 512 . Thus the dimension of VV was m=512m=512 and p=174p=174. Both 𝐖\mathbf{W} and 𝐇\mathbf{H} were randomly initialized. The rank rr was set to 7 and 50%50\% overlap between the windows for generating the spectrogram. All three algorithms were applied to decompose the music notes from the audio signal. The convergence curves were averaged over 20 independent tests. The results are shown in Fig. 2.

In our second experiment, we generated 𝐕\mathbf{V} synthetically as the absolute value of a zero-mean Gaussian distributed random variable and initialized 𝐖\mathbf{W} and 𝐇\mathbf{H} in the same way. The dimensions of these matrices were set as m=500,n=300m=500,n=300 and r=7r=7. The Matlab Program performed 20 independent random tests in which both 𝐖\mathbf{W} and 𝐇\mathbf{H} were kept the same for all the three algorithms. The evolution of the cost function averaged over the 20 tests. In Fig. 3 are shown the behaviors of the proposed algorithm, as well as the (LS) and (ALS) algorithms. When rr is increased to 13 or higher, the (ALS) algorithm becomes unstable, while the proposed algorithm still converges, even though its rate of convergence becomes slower than that of the (LS) algorithm.

Acknowledgement

The first author is indebted to Ioana C. Soltuz, Maria Soltuz for their constant support throughout this journey we made together. Also, the authors are indebted to a referee for carefully reading the paper and for making useful suggestions.

References

[1] R. Albright, J. Cox, D. Duling, A. Langville, C. Meyer, Algorithms, initializations and convergence for the nonnegative matrix factorization, NCSU Technical Report Math 81706, 2006, submitted for publication, URL http://meyer.math.ncsu.edu/Meyer/Abstracts/Publications.html.
[2] R. Balan, J. Rosca, A spectral power factorization, Siemens Corporate Research. Princeton, NJ, Tech. Rep. SCR-01-TR-703, Sep 2001.
[3] R. Balan, A. Jourjine, J. Rosca, AR process and sources can be reconstructed from degenerate mixtures, in Independent Component Analysis and Blind Signal Separation, International Conference on (ICA), Jan 1999, pp. 467-472.
[4] P.J. Brockwell, R. Davis, Time Series: Theory and Methods, Springer, 1991.
[5] P.J. Brockwell, R. Davis, Introduction to Time Series and Forecasting, Springer, 1996.
[6] A. Ben Hamza, D.J. Brady, Reconstructin of reflectance spectra using robust NMF, IEEE Trans. Signal Process. 54 (2006) 9. de unde am ciatt.
[7] M. Berry, M. Browne, A. Langville, P. Pauca, R.J. Plemmons, Algorithms and appliactions for approximation nonnegative matrix factorization, Computational Statistics and Data Analysis, 2006.
[8] D. Donoho, V. Stodden, When does non-negative matrix factorization give a correct decomposition into parts?, in: Advances in Neural Information Processing Systems (NIPS), vol. 17.
[9] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Adv. Neural Inf. Process. Syst. 13 (2) (2005) 556-562.
[10] Daniel D. Lee, H. Sebastian Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (6755) (1999) 788-791.
[11] C.J. Lin, On the convergence of multiplicative update algorithms for nonnegative matrix factorization, IEEE Trans. Neutral Networks 18 (6) (2007) 1589-1596.
[12] C.J. Lin, Projected gradient methods for non-negative matrix factorization, Neural Comput. (2007). to be published.
[13] E.F. Gonzales, Y. Zhang, Accelerating the Lee-Seung algorithm for non-negative matrix factorization, Dept. Comput. Appl. Math. Rice Univ. Houston TX, Tech. Rep. 2005.
[14] G. Strang, Introduction to Applied Mathematics, Wellesley-Cambridge Press, 1986.
[15] M.N. Schmidt, Single-channel source separation using non-negative matrix factorization, Ph. D. Thesis, Technical University of Denmark, 2008.
[16] Ştefan M. Şoltuz, Sequences supplied by inequalities and applications, Revue d’analyse numerique et de theorie de l’approximation 29 (2) (2001) 207212.
[17] Ştefan M. Şoltuz, W.Wang, P. Jackson, A hybrid iterative algorithm for nonnegative matrix factorization, Workshop 15th on Statistical Signal Processing, 2009. SSP ’09. IEEE/SP Cardiff.
[18] W. Wang, X. Zou, Nonnegative matrix factorization based on projected nonlinear conjugate gradient algorithm, ICARN 2008.
[19] R. Zdenuk, A. Cichocki, Nonneagtive matrix factorization with quadratic programming, Neurocomputing 71 (2007) 2309-2320.

2013

Related Posts