Music signal separation by orthogonality and maximum-distance constrained nonnegative matrix factorization with target signal information

全文

(1)Music signal separation by orthogonality and maximum-distance constrained nonnegative matrix factorization with target signal information Kosuke Yagi 1， Yu Takahashi2， Hiroshi Saruwatari 1， Kiyohiro Shikanol， and Kazunobu Kond02 I Nara. Inslilute of Science and Technology， Nara， 630-0192， Japan. 2. 均nωha Corporale Research & Developmenl Cenler. Shizuoka， 438-0192， Japan. Correspond巴nce should be addressed to Kosuke Yagi ([email protected]. jp) ABSTRACT 1n this paper， we address the separation of multiple instrumental sources based on semi-supervised nonneg ative matrix factorization (SNMF) and propose a new constrained SNMF. Rβcently， various types of SNMF have been proposed. 1n particular， we focus our attention on one type of SNMF that utilizes information on a priori bases. 1ndeed， this type of SNMF can achiev巴better separation performance. However， SNMF with out are any constraint between a priori bases and other bases often degrades separation performance. Thus， we propose a new SNMF that imposes a constraint between a priori bases and other bases. An experimental result shows the efficacy of the proposed constrained SNMF. 1.. where Y is an observed nonnegative matrix， which repre sents the time-frequency power spectral components ob tained via short-time discrete Fourier transform， and F， G， H and U are also nonnegative matrices. 1n addition， the matrices F and H are often call巴d‘basis matrices，' which include bases (sp巴ctral pattems for acoustic sig nals) to represent the observed signal Y， and the matrices G and U are often call巴d‘activation matrices，' which in volve activation information for F and H. 1n SNMF， the matrices G， H and U are optimized under the condition that F is known in advance. After the decomposition，. INTRODUCTION. 1n this paper， we address monaural music signal separa tion with a new semi-supervised nonn巴gative matrix fac torization (NMF) algorithrn. Recently， source separation based on NMF [ 1] has been a very active area of research NMF is a type of sparse representation algorithrn that decomposes a nonnegative matrix into two nonnegative matrices. 1n particular， NMF has been used for source separation [2， 3] and automatic music transcription [4] in an acoustic signal processing area NMF for acoustical s明lals has been studied actively and many extensions have been proposed， e.g.， [2， 3， 5， 6]. Furthermore， semi-supervised approaches have been proposed， e.g， [7， 8]. Such semi-supervised approaches can eüminate the problem of spectral pattem c1ustering in blind approaches. 1n particular， a semi-supervised approach that consid巴rs a temporal structure achieves a good performance [8]. However， such semi-supervised techniques involve an inherent problem described below. FG represents the spectra that co打巴spond to the signal. known in advance， and H U expresses the spectra of other sources. However， this decomposition holds even when bases in F and H are th巴 same. As for acoustic signals， this indicates that similar spectral pattems appear in F and H simultan巴ously. This phenomenon often leads to the degradation of source separation performance.. The following equation represents the deçomposition of a simple semi-supervised NMF (SNMF) used in monau ral music signal separation，、‘冒目，J l‘、 rs‘. y"，FG+HU，. To solve the above-mentioned problem， in this paper， we propose two types of constrained SNMF algorithrns that introduce orthogonality and maximum distance con straints. These algorithrns can prevent the formation of the same spectral pattem in F and H， and result in bet ter source separation performance. 1n this study， we. AES 45TH 1N TERNATIONAL CONFERENCE， Helsinki， Finland， 2012 March 1-4. 142-.

(2) Music signal separation by constrained nonnegative matrix factorization. Yagi et al.. conduct an evaluation experiment， showing that the pro posed methods are superior to the conventional SN恥1Fin separation performance. 2.. Simílarly to Ref. [1]， the mínirnízation of the cost func tions can be achieved by using an auxiliary functÍon ap proach. The update rules for the Frobenius-norm-based cost function are. SEMI-SUPERVISED NMF. 1= 2.1.. Training of a priori bases. r=. In SNMF， a prior spectral pattems (bases) are r叫uired to achieve source separation. This supervision can be constructed by arbitrary aIgorithms. Here， we suppose that a priori sp配tral pattems are constructed by NMF as Y targel. "". FQ，. (2). nx where Ytarget(εR主;'TS) is an amplitude spectrogram of a. specific叫et signal for traíning， F(εR2f) is a non. negative matrix that involves spectral pattenlS of the tarKx get signal as column vectors， and Q(εR�;JT') is a nonnegative matrix that co汀esponds to the activation of each spectrum. ln addition， n. is the number of frequency bins， Ts is the number of frames of the trainíng signal， and K is the number of bases. 2.2.. r=. W州he閃r陀eY町(εR�;T) is an observed am町m削l G(作εRSγT) i凶sa叩n activation matrix出at∞rresponds to F， andH(εR含L) and U(εR��T) are nonnegative matrices Moreover， L is the number of bases of H， and T is the number of frames of the observed spectrogram. A cost function based on Frobenius norm that can be used to achiev巴the decomposition is given by JSNMF =. �{. Yωr-. (手. んほk.r +. ト. ，u"r. )f. ，. (4). where Yw，r，んb gk，r， hω.1， and UII， are the nonnegative entries of the matrices Y， F， G， H and U， respectively. Another cost function based on I-divergenc巴is given by ESNW=. 5(. hflogZ. (. 一 Yω パ. AES. 山JZ. l九州. エんk， gkr. + 乞ん/， UIr， ))}.. (5). エωYω.rhω，/. Iω九1. L:I' hωUI'.r+ L:whω/， L:k'んk，gy，fulJ. ，. (7). ZωYωJん.k. Iωん，k L:k'んk， 'gk'，r+ L:ωf叫kL:1，h叩，yul'JgkJ・. Also， the update rules for the I-div巴rgence-based cost function are. hwl. = UI，r =. -. ，r)一l zrYω �. - ，r ， k'gk' ' UI，.，r' (L:k'ん， �� - : �.， "r， + L:I' - . hω -"I， 'UI' . " . hω1， (9) Lω UIr，. -. ZωY叫rhω，/(L:k' J，.ω，k'gk'，r+ L:I' h叩，/'UI'，r)ーl Zω hω1，. UI.r， (10). Update rules of semi-supervised NMF. (3). (6). (8). gk，r=. In SN恥1F， the following decomposition is addressed un der the condition that F is known in advance: Y""FG+HU，. ZrYωr， UI，r 円 h以1， L:r UI，r L:I' h叫l'UI'，r+ L:r UI，r L:k' J，.ωygk'，r. . k'gk'r， + L:I' hω，I'UI'.r)-1 �ωYωJん，k(L:k' J，叫エωん，k (11). Ref. [ 1] will help the reader understand the derivation based on the auxiliary function approach. 2.3.. Problem of SNMF. These SNMF algorithms have no constraínt between F and H. Thus， there is no guarantee that H becomes com pletely di圧erent from F after the decomposition目Ac tually， as illustrated in Fig. I (c)， the incomplete signal spectrogram of the separation result without any con straint can be seen. This is due to the simultaneous for mation of sirnílar spectraI pattems in F and H. Since SNMF represents the observed spectrogram Y by using F， G， H and U， if spectral pattems in F and H are the same， activations that co汀espond to the spectral pattems separately appear in G and U， degrading the separation perfonnance. To cope with this problem， we impose some constraints into the relationship between F and H. The algorithm will be explained in detail in the following secllons. 3.. PROPOSED ALGORITHMS. 45TH INTERNATIONAL CONFERENCE， Helsinki， Finland， 2012 March 1-4 Page 2 of. 6. 143-.

(3) Yagi et al.. 3.1.. Music signal separation by constrained nonnegative matrix factorization. Overview. In this section， we propose two constrained SNMF algo・ rithms. To prevent the simultaneous formulation of simi lar spec汀al pattems in the matrices F and H， we impose a specific constraint between and H. In this study， we investigate two types of constraints， namely the or thogonality and maximum distance constraints. These constraints co汀espond to the minimization of the simi larities between the spectral pattem in F and that in H. Please note that such constraints are not in conftict with those in other constrain巴d N恥1Fs， e. g.， [3 ]. That is， our algorithm can combine other types of constrained NMF. The algorithm will be shown in detail in the following sectJons. F. 3.2.. F. すIIFTHII�r'. ( 12). F. minimize JSNMFsubject to. -. Now， the cost function based on the F robenius norm cri terion with the orthogonality constraint is given by. FTH = O.. (16). The update rules for this problem can be derived. For instance， Ref. [9] would help you solve this problem. However， note that F and H are both nonnegative. Thus， ifムk is a nonzero element， hω，1 should always be zero und巴r the condition F1 H = O. As a result， the degree of freedom for H 巴xtremely decreases. Therefore， the severe condition H = 0 reduces the overall decompo sition performance for the semi-supervised probl巴m we consid巴red， and leads to poor separation performance. FT. Maximum-distance constrained SNMF. In this section， we propose another type of constraint for SNMF to make H different from F as much as possible. Unlike the constraint described in Sect 3.2， we impose the following I-divergence-based constraint:. Tr. =. where the conditions Lωf�k ， = 1 and Lω h l l 釘 e ap plied. This constraint coπesponds to th巴 maximization of the orthogonality between the matrices and H. In other words， it can be interpreted as the minimization of the similarities 創nong all combinations ofん，k and hω. F. H. 3.3.. Orthogonality constrained SNMF. To make different from H as much as possible， we impose the following constraint in addition to the cost function described in Sect. 2.2:. i. Here， the question is whether we can solve the following problem to let and become di鉦'erent completely， as. Zl(んk10g 台一 (んk一九I)}. •. (17). This is the constraint for minimizing the similariti巴s of all combinations ofん，k and hωby maximizing the dis tances among all combinations ofん，k and hω1， . By con sidering this constraint， the cost function based on th巴 Frobenius norm criterion to be minimized becomes. (13). J�SNMF = JSN MF一μmλiん，k10g 与一 (ん，パ叫I)}， Hω，1 w.k1. ‘ (18). where μ。is the weighting parameter. On the basis of this cost function， the update r叫e for hω1 can be derived as. wher巴μm is the weighting par創n eter. On the basis of this cost function， the update rule for hω，1 is given by. Jcs削F= JSNMF+ μ。 I I fl，wh�，1 1 ， 1ωk./ J. 一一 ω. LH. エtYw，tU/，t 工tU/，rヱパ臥州'汁LtUIよk'1，九ω'主此kピは'g臥k'汁μ九i. ，. (14). Similarly， the update rules for hω，1 based on I-divergenc巴 with the orthogonality constr創nt is expressed as. h，.，叫lI =一. �tYω，tUI，t(Lk'んk， 'gk'，t + LI' hω，I'UI'，t)ーl hnw，l 2J.lohω，/ LkfL + Lt Ul，t (15). The update rules for gkt， and described in S巴ct. 2.2.. AES. Ul，r are the same. as those. hω1=. ztYω，rUl，r+μmLk1 ， l，， ku LrUI，tLl'hω/'的，r+LtUI，tLk'ん，k'gk'，t+μmh:l: LkJ叫 (19). The updat巴 rules for UI，r and gw，k are the same as those described in Sect. 2.2. Similarly， the updat巴rule for hω，1 based on the I-divergence criterion with the maximum distance constraint is derived as. =ZrYw，#I，r(Lk'ん，k'gk'，t+LI，hω，l'U/'r， )ー1+μmLkl， hωJ 叫l LrUI，r+μ叩 hdlZKムk. (20). 45TH INTERNATIONAL CONFERENCE， Helsinki， Finland， 2012 March 1-4 Page 3 of. 6. 144-.

(4) Yagi et al.. Music signal separation by constrained nonnegative matrix factorization. These update rules seem unstable because the tenn L:k 1 increases the nonn of h叫1 in every iteration. H巴re， the problem is removed by nonnalizing the nonn of hω，1m every iteration. However， to remove this unnecessary tenn， the following KL-divergence-based constraint is introduced:. Tr. El (んk 台 } 10g. .. (2 1 ). Considering this constraint， the update rule for h叫1 based on the Frobenius nonn criterion is rewritten as. hωl二. ztYω，tUIt，，叫l L:t Ul，tL:I'九/'u/'，t+ L:t Ul，tL:k'fwk， 'ge，t+μmh-:/IL:kんに (22). Similarly， using the KL-divergence-based constraint can 陀move the term L:k 1. Also，the update rule for hω，/ based on the I-divergence criterion [1] with the KL-divergence based constraint is given as. 叫l ==. ZtYw，tUlt， ( L:k'ん，k'ge，t+ L:l'hω，/，U/'，t)ー1 t hko，l・(23) L:tUI，t+μmh �.ll L:kんk，. We can easily derive these update rules using Refs. [1，9].. 4.. 4.1.. EXPERIMENTS. training data to construct a priori bases of the signals made by the 1\位DI synthesizer. The training data is the data on a c1arinet or trumpet or violin signal containing two octave notes that cover all notes in the target signal in the observed signal. To construct a priori bases， we ap plied a basic NMF algorithm [1] to this signal. The num ber of a priori bases was 100. The NMF algorithm was iterated 1∞o times to each signal， and it was confu宜led that the cost function of NMF was sufficiently converged. The numb巴r of iterations for the semi-supervised NMF and constrained SNMF was 200 times. We empirically determined their number of it巴rations. Moreov巴r， th巴 number of bases for the matrix H was 30. Further more，μ。and μm were empirically determined. We used the signal-to・distortion ratio (SDR) defined in Ref. [ 10]， the maximum value of the nonnalized cross-correlation function and su剛氏tiv巴 evaluation resuIts as objective evaluation indexes. Now， the estimate signal $(t) is defined as $(t) = s町get(t). + ein也rf(t) + e副江(t)，. where Starget(t) is the allowed deformation of the tar get source， einterf(t) is the allowed deformation of the sources出at account for the interferences of出e un wanted sources，and earúf(t) is an‘artifact' term that may co汀espond to the artifacts of the separation algorithm， such as musical noise， or simply to the defonnations in duced by the separation algorithm that are not allow吋 According to this， a fonnula for SDR is given by. Experimental conditions. To confirm the effectiveness of the proposed algorit加ns， we apply SNMF and our propos巴d algorithms to solve the separation problem of multiple instrumental sources. We produced two types of monaural signals so that their input SNR is 0 dB. Some signals are a mixture of c1arinet sound (Cl) and flute sound (FI) or hom sound (Hr) or pi ano sound (Pの， or a mixture of trumpet sound (Tp) and flute sound or hom sound or piano sound， and other sig nals are a mixture of violin sound (Vn) and flute sound or hom sound or piano sound目 These signals were arti ficially made by a MIDI synthesiz巴r and involve no re verberation and vibrato. In our experiments， we consider that c1arinet，汀umpet and violin signals are targ巴t signals， and that flute，hom and piano signals are noise. To conduct SNMF， we supposed that target signals are signals known in advance. More over， we prepared a. AES. SDR = 1010glO 4.2.. ". :.."�‘ ..|Is町田t(1)1I2. ヲ. lein凶(t) + earúf(t)IIL. (25). Experimental results. Figure 1 shows an ex創nple of the separation resuIts ob tained by SNMF without and with the orthogonality con straint. From Figs. l(c) and (d)， it can be confirmed that the constr創nt between F and H prevents the lack of the signal in the separation result. Also， Table 1 shows the average and disparity in SDR results based on Frobenius nonn semi-supervised NMF. Table 2 shows the average and disparity SDR resuIts based on I-divergence semi supervised NMF. In these tables， the first column shows the constr創nt typ巴s， the second colurnn shows the evalu ated values， and the third to fifth columns show the sep aration perfonnance in each target. ‘Average' means the average value of the target signal and other signal SDR. 45TH INTERNATIONAL CONFERENCE， Helsinki， Finland， 2012 March 1-4 Page. (24). 4 of 6. - 145 -.

(5) Music signal separation by constrained nonnegative matrix factorization. Yagi et al.. Table 1:. Average and disparity SDR results of con strained SNMF based on Frobenius norm using two ωtave training data Constratnt. Evaluated. type. value. None Orthogonality I-divergence KL-divergence. Tp-Hr. Yn-Pf. Average. 6.72. 5.80. 5.97. Disparity. 7.80. 3.95. 2.84. Average. 7.50. 11.23. 10.86. Disparity. 4.87. -0.90. -0.79. Average. 5.92. 5.25. 6.05. Disparity. 6.72. 2.45. 1.76. Average. 6.14. 7.42. 7.09. Disparity. 6.53. 0.81. 1.20. CトFI. Table 2: Fig. 1: Sample spectrogram of separation resuJt. Spec. 汀ograms of (a) clean ftute signal. (b) mixture of clarinet and ftute signals. (c) extracted ftute signal without any constraint. and (d) extracted ftute signal with 0口hogo・ nality constraint.. Average and disparity SDR results of con strained SN恥1F based on I-divergence using two-octave training data 仁onstratnt. Evaluated. type. value. None. Orthogonality 1・divergence. results ‘Disparity' is defined as. KL-divergence. Disparity = (value of target SDR)一(value of other SDR). (26) CI-F1 is the observed signal that is a mixture of c1arinet. CトFl. Tp-Hr. Average. 5.62. 4.75. 8.08. Disparity. 4.78. 3.76. 4.23. Average. 6.24. 5.25. 8.52. D日panty. 3.95. 3.24. 3.98. Yn-Pf. Average. 12.00. 10.53. 11.50. Disparity. -0.71. -3.78. -0.28. Average. 8.12. 11.68. 11.46. Disparity. 2.78. 0.84. 1.66. and ftute sounds. Tp-Hr is the observed signal that is a núxture of trumpet叩d hom sounds. and Vn-Pf is the observed時nal that is a mixture of violin and piano sounds. The results of the other observed signals are omitted be cause of their sinúlarity to the results p陀sented here. Ta ble 3 shows the maximum values of the normalized cross correlation results. In this table. the first column indi cates the criteria for NMF. ‘Frob' and ‘I-div' indicate the criteria of Frobenius norm and I-divergence. respec tively. The second column shows the constraint types. and the third to fiれh columns show the cross correlation performance in each target.. Table 3: Maximum values of normalized cross correla-. When the observed signal is separated perfectly. the tar get SDR and the other SDR a陀 high. the disparity is small. and the maxirr】um cross correlation is high. From these results. it can be confirmed that the orthogonality and maximum distance constraints for SNMF increase the separation performance. In p回icular. the maximum distance constrained by I-divergence achieves the best performance in our experiments.. In addition. Figs. 2 and 3 show the results of subjec tive evaluation. In these figures. we present the two signals processed by semi-supervised NMF and con strained senú-supervised NMF to ten male and female exanúnees in random order. who were asked to choose which signal they consider to be a c1ean target signal The evaluations made contain three tests that compare the e仔'ectiver】ess of the conventional method with that. AES. ti�n resuJts obtained using two-octave trωning data. 主山口 on 1 Frob. I-div. Tp-Hr. L Vn�Pf. None. 0.73. 0.83. 0.86. Orthogonality. 0.86. 0.96. 0.95. I-divergence. 0.72. 0.82. 0.86. KL-divergence. 0.74. 0.90. 0.89. None. 0.73. 0.83. 0.86. Orthogonality. 0.83. 0.83. 0.92. I-divergence. 0.96. 0.94. 0.96. KL-divergence. 0.90. 0.96. 0.96. Constraint type. 11. CI-FI. 1. 45TH INTERNATIONAL CONFERENCE. Helsinki， Finland， 2012 March 1-4 Page. 5 of 6. - 146 -.

(6) Yagi et al.. Music signal separation by constrained nonnegative matrix factorization. constraint. As our next step， we plan to combine our constrained SNMF and existing extensions for NMF or SNMF， e.g_， spectral continuity， sparseness [3]，and tem poral structure [8]. Also， we plan to derive more gener alized update rules using more generalized criteria and distances，e.g.，ß-divergence. EコConventional. 亡コProposed. ト斗 95 % confidence interval ;!� 1. Orthogon剖ity. 斗→. I�ivergence. 6. KL-divergence 。. 20. 40. 60. [1] D. D. Lee and H. S. Seung，“Algorithms for non negative matrix factorization，" Neurallnf Process. め札vol.J 3，pp.556-562，200 1.. 100. 80. Preference score [%). Fig. 2: Subjective evaluation results based on Frobenius. norrn.. [3] T. vi口anen， “Monaural sound source separation by nonnegative matrix factorization with tem仰 ral continuity and sparseness criteria，" IEEE Trans . ASLP， vol. 15， pp. 1066-1074，2007.. ←十→. O同hogonality. [4] P. Smaragdis， et al.，“Non-negative matrix factor ization for polyphonic music transcription，" Proc. WASPPA， pp_I77-180，2003.. トdivergence. KL・divergence 20. [2] M. N. Schmidt and R. K. Olsson， "Single-channel speech separation using sparse non-negative matrix factorization，" Proc. INTERSPEECH， 2006.. EコConventional. EコPropos凶. ト→95 % confodence inte刊al. 。. REFERENCES. 40. 60. 80. ト十→. [5] T. Virtanen 加d A. Klapuri， “Analysis of poly phonic audio using source-filter model and non negative matrix factorization，" Proc.NIPS， 2006.. 100. Preference score [%). Fig. 3: Subjective evaluation results based on 1divergence norrn.. [6] H. Kameoka，et al.，“Complex N恥仔: A new sparse representation for acoustic signals，" Proc. ICASSP， pp. 3437-3440，2009.. of each proposed method. We used nine observed sig・ nals for each test. The upper graph shows orthogonality constrained semi-supervised NMF vs semi-supervised NMF. The middle graph shows I-divergence constrained semi-supervised NMF vs semi-supervised NMF_ The bottom graph shows KL-divergence constrained semi supervised N恥1F vs semi-supervised N恥1F. In these fig ures， we observe that the sound qualities of the proposed methods are higher th却that of the conventional method. [7] P. Smaragdis， et al.， "Supervised and semi supervised separation of sounds from single channel mixtures，" Proc_ ICA， 2007. 5.. CONCLUSIONS. In this paper， we proposed a new constrained SN恥1F that imposes an orthogonality or maximum distance con・ straint between a priori bases and other bases. From the obtained experimental results， it c加 be confirrned that the proposed constrained SNMF increases the sep aration perforrnance compared with SNMF without any. AES. [8] G. 1. Mysore，“Non-negative hidden Markov m吋・ eling of audio with application to source separa tion，" Proc LVA/ICA 2010， LNCS 6365， pp.140148，20 10. [9] S. Choi，ぺAlgorithms for orthogonal nonnega tive matrix factorization，" Proc. Internati onal Joint Conference on Neural Networks， 2008. [10] E. Vincent， et al.， “The 2008 signal separation evaluation campaign: a community-based approach to large-scale evaluation，" Proc.ICA， pp.734-741， 2009.. 45TH INTERNATIONAL CONFERENCE， Helsinki， Finland， 2012 March 1-4 Page. 6 of 6. - 14 7 -.

(7)