Researches on Music Feature Analysis using Chroma Vector and its Applications

(1)

Researches on Music Feature Analysis using Chroma Vector and its Applications

クロマベクトルを用いた音楽特徴解析とその応用に関する研究

February 2016

Aiko UEMURA

(2)

Researches on Music Feature Analysis using Chroma Vector and its Applications

クロマベクトルを用いた音楽特徴解析とその応用に関する研究

February 2016

Waseda University

Graduate School of Fundamental Science and Engineering Department of Computer Science and Engineering

Research on Image Information

Aiko UEMURA

(3)

Abstract

It is possible to capture structures of chord sequences and music by chroma vector analysis. This thesis focuses on chroma vector characteristics and presents fundamental researches and applications related to music feature analysis.

In the fundamental parts, this paper tries to improve chord recognition performance using doubly nested circle of fifths (DNCOF) vectors and evaluates effects of music compression on the chord recognition performance. DNCOF feature is more similar to the correct solution than the conventional method, and suggests feature better solution which can correct erroneous estimates. This study also shows that audio compression has little effect on the chord recognition performance and chroma vectors do not depend on the audio formats. These results contribute to the improvement of chord recognition performance and reliable usage of chord data format.

In the application parts, this paper tries to detect music parts in television programs by using time-frequency analysis and identify a cover song from live music recorded in a real environment. Since the envelope of time-series chroma vector of music part has a horizontal correlation, this characteristic is applied to estimate music parts in the music television programs. This research also improves cover song identification capability for live performance music of a concert, which has not been sufficiently investigated. These results contribute to new research areas of the chroma vector applications and annotation of titles and an artist of live performance.

(4)

Acknowledgements

I acknowledge the helpful guidance and support of my supervisor, Professor Jiro Katto. I also would like to express my sincere thanks to Professor Tetsunori Kobayashi, Professor Wataru Kameyama and Associate Professor Tetsuro Kitahara, the rest of my thesis committee. They have provided generous advice and very helpful reviews on my work.

This dissertation could not have been written without the help and support of people around me. I could not have come this far without their assistance and encouragement. It is my pleasure to express my grateful appreciation to them. Finally, I sincerely thank every music that triggered my interest in music information processing.

Aiko Uemura 26 January 2016

(5)

i

List of Figures

Page

2.1 Triads of major, minor, augmented and diminished [1]. 5 2.2 An example of the seventh chords (dominant seventh, major

seventh, minor seventh, half-diminished and diminished seventh) [1].

6

2.3 C major scale tone [1]. 6

2.4 Circle of Fifths. 8

2.5 Doubly Nested Circle of Fifths (DNCOF) [2] (The capital letter represents major chord and the lower case represents minor chord).

8

2.6 Tonnetz (The red triangle represents major chord and the blue triangle represents minor chord)

9

2.7 Overview of chord recognition methods. 10

2.8 Overview of chroma vector calculation. 10

2.9 Band width of the constant-Q transformation and Fourier transform [3] (Above: constant Q conversion, below: Fourier transform).

12

2.10 Weighting functions (uniform, discrete, linear, anti-quadratic, exponential and Gaussian) [4]

13

3.1 Overview of the proposed method. 21

3.2 How to map onto DNCOF coordinate (left: a set of twenty-four unit vectors, center: multiplying a chord vector by the left figure, right: obtain a DNCOF vector).

23

3.3 Each result from the Beatles’ Words of Love (from 120 to 145 frames per beat).

25 4.1 ODG values and Chord recognition results (top:

GMM-HMM, bottom: SVM-HMM).

32

4.2 Evaluation of different bit rates data. 32

4.3 Chord recognition results with different bit rates (top:

GMM-HMM, bottom: SVM-HMM).

33

4.4 Chord recognition and PEAQ evaluation System. 34

5.1 Overview of our proposal. 37

5.2 Chroma vector behavior changes from non-music to music. 37

(9)

v

List of Figures

Page

5.3 Chroma vector behavior changes from non-music to music.

(a) CP behavior; (b) CLP behavior; (c) CRP behavior.

38 5.4 Envelope extraction using four neighborhood bins at peak

bin.

40 5.5 Detection of chroma peaks having temporal consecutiveness

applied to acoustic signals in Fig. (a) detected chroma peaks per frame; (b) initial mask extracted by the curvature of chroma peaks; (c) smoothed mask by a median filter.

40

5.6 Mask filtered chroma peaks (as applied to acoustic signals in Fig. 5.2: (a) one-dimension median filtered chroma peaks (N=17); (b) Gaussian filtered (σspace = 0.5) chroma peaks;

(c) bilateral filtered (σspace = 3, σscolor =0.1) chroma peaks.

41

5.7 Energy ratio calculation and classification without (before) and with (after) temporal smoothing by the 5-tap median filter: (a) transition behaviors of energy ratios from non-music to music. (b) an example of music part classification results if threshold=0.2.

43

5.8 F-measures of σspace for Gaussian filter. 45

5.9 F-measures of σ_space andσcolor for the bilateral filter. (a) F-measures of CP; (b) F-measures of CLP; (c) F-measures of CRP.

46

5.10 Recall-precision curve of compared methods (standard signal).

48 5.11 Recall-precision curve of compared methods (TV program). 48 5.12 Chroma vector analysis change from non-music to music

when the signal includes clapping: (a) Chroma component behavior. (b) Mask ﬁltered chroma peaks.

51

6.1 Overview of our proposed system. 55

6.2 The flow of live version music identification. 55

6.3 Music identification accuracy for each query length. 56 6.4 Weight change based on the original song length (290 s). 57

6.5 The flow of audio scene detection. 58

(10)

vi

List of Figures

Page

6.6 Values of RMS, Pulse clarity, and LVID similarity in a time series.

59 6.7 Scene detection results obtained using Pulse clarity, LVID

similarity, and RMS.

61 6.8 Identification results obtained using method 1 (minor change

of the previous method [5]), method 2 (combination of the method 1 and weighted process using constant value), method 3 (combination of the method 1 and weighted process using changing value), and method 4 (method 3 with scene detection).

61

(11)

vii

List of Tables

Page

2.1 Tonic, dominant and sub-dominant. 7

2.2 Cadence. 7

3.1 Error estimation performance (precision, recall and F-measure).

26

3.2 Recognition-rate accuracy in percentage. 27

3.3 Angular difference from correct label. 27

4.1 Encoding settings. 30

4.2 Model output variables (MOV) of PEAQ basic version [61] 31

5.1 The contents of evaluation. 45

5.2 F-measures of each image filter (standard signal) 47

5.3 F-measures of each image filter (TV programs) 47

5.4 F-measures of compared methods 47

5.5 Ratio of scenes included in false-negative results. 49 5.6 Ratio of scenes included in false-negative results at the

scene boundary.

50

6.1 Dataset details. 60

(12)

1

Chapter 1 Introduction

1.1 Research Background

In recent years, we can deal with large amounts of music with the evolution of high-speed network, large capacity storage and efficient data compression technology.

Furthermore, since many applications and distribution systems for music are provided, we have a greater opportunity to enjoy music. As a result, we can handle “all” music

"anytime" and "anywhere", and demand for analysis and search of music has been increasing. Accordingly, many studies have been carried out on the automatic recognition of different musical elements.

Chord is a compact representation of a harmony, which is one of the three components of music along with melody and rhythm. Chords are also necessary elements in tonal music. Automatic chord recognition is very useful for analyzing musical contents such as melody, keys and harmony. In addition, chord recognition has been focused on in automatic transcription and music information retrieval because it can be regarded as a pre-processing of automatic transcription. Automatic transcription tasks help people who want to record every single tone in the score by hearing the real sound manually. This is also helpful for people who have little music experience, and is useful to obtain chord labels and their boundaries in cover song identification because chord sequences of cover songs are often analogous. Therefore, it can be concluded that the chord is an important clue to music information retrieval.

Generally, the feature used in the chord recognition is a chroma vector [6]

which is a 12-dimensional vector that represents the intensity of each of the 12 semitones pitch classes of the chromatic scale irrespective of octaves. Great attention has been devoted to this chroma vector, especially in the MIR field, such as Music Information Retrieval Evaluation eXchange (MIREX) [7] international competition. We can consider that the chroma vector is suitable to capture the transition of chords and music structure. In this thesis, I apply chroma vector analysis to the chord recognition and its applications.

(13)

2

1.2 Thesis Contributions

This thesis contributes to fundamentals and applications of chord recognition task. In the fundamental parts, a chroma vector is extended to improve the chord recognition performance using doubly nested circle of fifths (DNCOF) and effect of compression ratio of different audio data formats on the chord recognition performance is investigated. In the application parts, music part detection from television programs using time-frequency analysis of chroma vectors and cover song identification from a live music using chroma features are investigated.

The first contributions (Chapters 3 and 4) are about chord recognition performance and chord data set.

 The proposed feature is closer to the correct solution from the output label of the conventional method in DNCOF coordinate. It is possible to improve the chord recognition performance by exploiting the angular difference.

 Since there is small influence of audio compression on the chord recognition performance, Chapter 4 contributes to confirmation of effectiveness of chroma vectors irrespective of the audio formats

The second contribution (Chapter 7) is about application the chroma vector analysis to extract the music part from the music TV programs.

 This study focuses on characteristics of time-series chroma components and proposes an application contributing to new research areas of the chroma vector.

The third contribution (Chapter 6) is about the live performance music analysis of a concert, including several songs, for cover-song identification systems.

 Proposed methods are expected to contribute to the annotation of titles and an artist of live sounds. They will also assist digital rights management for streaming music.

(14)

3

1.3 Thesis Organization

This thesis consists of an introductory chapter (Chapter 2), four research chapters (Chapter 3-6), and a concluding chapter (Chapter 7).

In Chapter 2, this study describes recognition techniques of musical elements, signal processing and elements for the chord recognition as the main content.

In Chapter 3, this research proposes a novel vector using Doubly Nested Circle of Fifths (DNCOF). This is obtained by mapping a chroma vector onto DNCOF coordinate. This DNCOF vector has information of "direction" and "magnitude". A DNCOF vector is used for estimation of error correction as post-processing of the conventional method.

In Chapter 4, this paper focuses on the effects of audio compression for chord recognition as an example of the frequency variations. I describe the relationship between the chord recognition performance and sound quality. I also evaluate the recognition performance for data sets by varying bit rates.

In Chapter 5, this study proposes a method of extracting music parts from music TV programs using chroma vectors. An envelope of the chroma vector is extracted by using a mask and some image processing filters, and is used to classify.

In Chapter 6, this research proposes a robust approach to identify a song from live music by improving a traditional cover song identification method [5]. The proposed method has two phases; live version identification and audio scene detection to solve two problems that occur when applying the conventional method to live sound sources.

In Chapter 7, we describe a summary and future work of this paper.

(15)

4

Chapter2 Background and Related Works

This chapter describes recognition technique of musical elements, signal processing and elements for the chord recognition as the main content. Also, it refers to audio compression on the music information processing, audio scene analysis for TV program, music identification towards the live version identification, as the application aspect.

2.1 Musical Content 2.1.1 Equal Temperament

The equal temperament refers to a scale by 12 minutes to one octave. It separates an adjacent pair of pitches by a constant frequency ratio. The "semitone," a 12-minute one interval to the two intervals, is referred to as a "whole tone".

Humans recognize the magnitude of frequency as a level of sound. We recognize “high” and “low” by a logarithmic scale based on the structure of the ear.

When there are three notes f₁ f₂  f₃ Hz of a certain frequency and we hear at regular intervals, the relationship is expressed by the following equation.

2 1 2

3 f

f f  f 

(2.1) The sound f2 satisfies the following formula, in which sound f2 is one octave higher than the sound f1 and is perceived as the same sound.

1

2 2f

f  (2.2)

(16)

5

Key letter and key number represent note names such as A4 and F #1. The key letter is a relative position within one octave, and key number increases by one when one octave is higher. The note of the octave difference is given the same key letter. There are 12 species for a key letter; C，C# (D♭)，D，D# (E♭)，E，F，F# (G♭)，G，G# (A♭)， A，A# (B♭)，B. The key number can take all of the integers in theory, but it is in the range of 0-9 in practice. In contemporary music, fA4 =440 (Hz) is given as a reference value for the note name. The following equation gives all of the note names of equal temperament.

 



 























2 1 2 1 1

1

2

4 2 half 4

# 1 half 4

4

# 1 half 4

9 half

12 / 1 half

N L N N N L

B A

A G

C

f f

f a

f f

a f

a a



(2.3)

2.1.2 Chord

The chords refer to the vertical superposition of the sound which is responsible for one of the principles in music based on tonality. The notes of a chord can include octave notes and it is not adversely affected by chord characteristics.

The most basic chord is a triad of stacking third and fifth notes above the root note. There are four types of the structure difference; major (major third and perfect fifth), minor (minor third and perfect fifth), augmented (major third and augmented fifth), diminished (minor third and diminished fifth) as shown in Fig. 2.1.

Figure 2.1: triads of major, minor, augmented and diminished [1].

The seventh chord is chord combination of the seventh note and triad. For example, there are dominant seventh (major triad and minor seventh), major seventh (major triad and major seventh), minor seventh (minor triad and minor seventh), half-diminished (diminished triad and minor seventh) and diminished seventh (diminished triad and minor seventh) as shown in Fig. 2.2.

(17)

6

Figure 2.2: Examples of the seventh chord (dominant seventh, major seventh, minor seventh, half-diminished and diminished seventh) [1].

In addition, a tension chord is the combination of the root note and ninth note.

2.1.3 Harmony

Harmony is the progress of the chord that refers to a combination of voice connection and placement of the voice in terms of music theory in Western music. In Western music, it is one of the three elements of music together with melody and rhythm. In general, it shows tonal harmony based on the major-minor organization. Harmony is the fundamental structure of tonal music in the music theory.

The significant sounds of the major scale are I, IV, and V, which are called tonic, sub-dominant and dominant. The chord that has these root note is referred to as the tonic, sub-dominant and dominant Chords. Four chords and the remaining I are classified in the same Function as a surrogate chord because it has the same function as one of the chords. Three notes, I, IV, and V, are called the major triad. All of the seven scales are included in this three chord. For example, C major scale tone have seven chords as shown in Figure 2.3.

Figure 2.3: C major scale tone [1]

Each chord has a function in the context of the key. The character at the beginning shows a chord representing the strongest function in Table 2.1. Another chord is referred to as a proxy function chord. Besides, V is often a stronger function than V7

in Dominant.

(18)

7

Table 2.1: Tonic, dominant and sub-dominant.

Tonic (T) I, VI, III (and each seventh chord)

Dominant (D) V, VII, III (and each seventh and ninth chord)

Sub-dominant (S) IV, II (and each seventh chord) Table 2.2: Cadence.

T→D→T I→V→I，I→V7→VI and so on.

T→S→D→T I→IV→V→I，IV→II7→V7→I and so

on.

T→S→T I→IV→I and so on.

The cadence means the progression of chords and melody in the termination part of the phrase. Harmony refers to a general consolidation method of chord according to function. Music is constructed by the combination of three patterns and these chains in Table 2.2.

2.1.4 Musical Model

・ Circle of Fifths

The key is determined by a combination of root and scale. Circle of Fifths is one example of tonal space (modulatory space) in musical terms, and refers to a circle as shown in the diagram below. When we pick any note as a starting point from among the 12 major keys, the keys in the relation of perfect fifth make a circle like clockwork.

Since the total number is 12 and 7 of semitone included in the perfect fifth is a prime, this model can deal with all major keys.

(19)

8

Figure 2.4: Circle of Fifths

Figure 2.5: Doubly Nested Circle of Fifths (DNCOF) [2] (The capital letter represents major chord and the lower case represents minor chord)

・ Doubly Nested Circle of Fifths

The Circle of Fifths can be expanded to Doubly Nested Circle of Fifths (DNCOF) in Fig.

2.5. The capital letter represents major chord and the lower case represents minor chord.

In this DNCOF, major and minor chords are arranged next to each other and chords

(20)

9

which are adjacent to each other have the same two notes in three notes. All 24 chords make a circle like clockwork.

・ Tonnetz

Tonnetz can be used to represents traditional harmonic relationships in Western music.

It has the note name as a node [8] as shown in Fig. 2.6. Tonnetz is a network that links perfect fifth, major third and the minor third. Each major and minor chord is represented by a triangle on Tonnetz.

Figure 2.6: Tonnetz (The red triangle represents major chord and the blue triangle represents minor chord) [8]

2.2 Chord Recognition

Fig. 2.7 shows an overview of traditional researches on chord recognition. They are roughly divided into three approaches to improve the recognition performance;

improvement of chroma vector, improvement of recognition methods and improvement of pre or post-processings.

(21)

10

Figure 2.7: Overview of chord recognition methods.

2.2.1 Chroma Vector

The chroma vector is defined as a 12-dimensional vector that represents the intensity of each of 12 semitone pitch classes in the chromatic scale which is irrespective of octaves.

Chroma vectors accommodate the harmonic musical structure as shown in Fig. 2.8.

They are often used for chord recognition.

Figure 2.8: Overview of chroma vector calculation.

(22)

11

Let t and f represent time unit (frame number) and a log-scale frequency. Then, 12-bin chroma vectors at time t are calculated by



 







 ^H

L

Oct

Oct h

h

c f f t df

t

C( ) BPF_, ( ) ( , ) (2.1)

where BPFc,h(f) is a bandpass filter that passes chroma c in octave position h, (f, t) is a spectral magnitude, and OctL and OctH specify an octave range.

・ Frequency Analysis

(f, t) is often calculated by Fast Fourier Transform (FFT) or constant Q transform [3].

FFT can reduce addition and multiplication of the Discrete Fourier Transform (DFT) to N log2 N, where N is the size of DFT matrix. The N-point DFT is defined by



^





 ¹

0

2 ) exp(

) ( )

(

N

n N

j kn n

x k

X 

, ⁽⁰^ⁿ^ ^N^¹⁾ (2.2) where x(n) represents the audio signal. The power spectrum is calculated by:



² ²



1 2

0

)) ( ( I )) ( ( 1 R

2 ) exp(

) 1 (

) (

k X k

N X

N j kn n

N x k

S ^N

n











^





)1 0

( kN (2.3)

When FFT results are converted into a logarithmic scale such as the octave in music, the resolution becomes worse for lower octaves. It has been known that the human ear perceives in a logarithmic frequency while the FFT result yields to a linear scale of the Fourier analysis. As shown in Figure 2.9, the analysis window of a constant Q transformation is narrower at low frequencies and wider at higher frequencies.

(23)

12

Figure 2.9: Bandwidth of the constant-Q transformation and Fourier transform (above:

constant Q conversion, below: Fourier transform) [9].

Constant Q transformation is expressed by the following equation.

n f j k

N n

cq

k w n k x n e

^k

X

²^

1 ) (

0

) ( ) , ( )

(

^







(2.4)

where w (n,k) is the analysis window, whose length is a function of the bin position k in N (k). In this case, the k-th center frequency fk is defined by

min 1

2 f

f

k k B

 



 

 

(2.5)

(24)

13

Figure 2.10: Weighting functions (uniform, discrete, linear, anti-quadratic, exponential and Gaussian) [4]

where fmin is the minimum frequency in the constant Q transform and B is the bin number of one octave. The k-th bin of the window width N (k) is expressed by

f Q k S

N

k 

 



 )

( (2.6)

where S is the sampling frequency, Q is the number of cycles to be processed at its center frequency. The bandwidth of the filter is to use a filter bank that is proportional to the center frequency for the implementation of CQT. CQT is not to escape from the relationship between the time and frequency uncertainty, temporal uncertainty to concentrate on a lower octave.

・ Weighting Functions

An earlier report of [4] describes weighting functions corresponding to BPFc,h(f) of spectral power bins, which are suitable for chord recognition.

The BPF is defined by in [10]



 



  



 200

)) 100 (

( cos2 2 1

) 1 (

BPF_c,h f F^c^,h

f



(2.7)

Let Fc,h denote a log-scale frequency that is given by

(25)

14

) 1 ( 100

,

 1200 h  c 

F

_c_h (2.8)

This ﬁlter is applied to octaves from OctL to OctH.

・ Chroma calculation approach

Numerous approaches have been developed to improve the chroma vector [11, 12].

Harte et al. proposed tuning of the chroma vector and thereby avoided distributing its spectral power to other bins [11]. Other approaches have been undertaken to combine various musical features to calculate better chroma vectors. Ellis presented a chroma vector that is synchronized with the beat [12]. After estimating the beat and calculates the average value of the chroma vector in each beat. This depends on the accuracy of the beat estimation. Khadkevich proposed a tuning method based on analyzing magnitude and phase-change spectrum [13].

Some cases show means to transform pitch representation. Mauch and Dixon applied non-negative least-squares problems to the spectrogram and obtained clearer chroma vectors [14]. Müller proposed Chroma DCT-Reduced log Pitch (CRP) based on Discrete Cosine Transform (DCT). The upper coefficients of the pitch-frequency cepstrum are employed to an inverse DCT. Finally, 12-bin chroma vectors are obtained from the pitch vectors. For our experiments, we use the number of upper coefficients p=55 similar to a report of the literature [15]. The tool is provided for CRP features which is provided by chroma toolbox [16].

Chroma features are often applying a logarithmic compression of a pitch representation. Each energy value e of the pitch representation is calculated using the value log(ηe+ 1), where η is a positive constant. An earlier report [15] uses the number of upper coefficients η =100.

2.2.2 Modeling

There are three ways of recognizing the chord. First, authors of [6, 11] proposed methods to use templates defining a chord pitch power based on the chroma vectors. In

(26)

15

their studies, 12-element bit masks were used where 1 represents a chord note existence and 0 does non-existent. Oudre et. al extended this approach and used templates of four or six harmonics of every note of the chord [17].

Second, authors of [18, 19] proposed HMM-based methods that consider the chord labels as hidden values in the model. In general, it determined (a) to (e) as HMM settings for performing the chord recognition.

(a) State set: Σ={Si | 0≦i≦M}

Each state corresponds to the respective chords.

(b) State transition probability: A={aij=P(SiSl | 0≦i, j≦M)}

Transition probabilities between each state are defined. It represents the ease of transition between each chord.

(c) Output probability: B={f(o;Si) | 0≦i≦M }

Output probability represents the probability outputs a signal o in the state Si,. If the signal is an occurrence of a finite probability event, the probability value is defined by

fi(o)=P(o | Si), Σ_oP(o | Si) (2.9) (d) Output signal series: o(t)(t = 0, …, T)

Series of the signal that is output in the transition process of the HMM. Signal output to the time unit is one.

(e) State series:s(t)(t = 0, …, T)

It represents a sequence of state at the time t.

Bello and Pickens introduced Sheh’s work using music theory [2], where they set initial values of state transition probabilities of HMM based on chord distance. Another is a support vector machine and HMM (SVM-HMM) model trained by the ground truth data.

The chord sequences are estimated with the chroma vectors and the trained SVM-HMM [20].

Third, there are hybrid proposals of the above methods that combine training data and music theory [21, 22]. Some works were devoted to estimate musical contents

(27)

16

such as keys and base lines from audio [19, 22-24]. Each element relates closely to the chord, so it is expected that they can be used as a clue to improve the chord recognition rate.

2.2.3 Pre/Post Processing

There are some studies for improving the recognition performance by the pre-processing before the chroma calculation. An earlier report [25] described time-frequency behavior differences of the spectrograms of harmonic and percussive instruments. Khadkevich applied time-frequency reassignment (TFR) technique to chroma vectors [26]. As post-processing of the correction of the recognition result, the Viterbi decoder finds the sequence of chords based on the chord probabilities [27].

2.2.4 Data Set

The music is divided into continuous segments of the offset time and chords labels. The notation of chord labels has been proposed by Harte [28]. This literature provides correct data of 180 songs by the Beatles. In addition, other labels such as C. King and QUEEN are available in Isoponics [29]. The label is expressed as follows for each segment.

0.000000 0.175157 N 0.175157 1.852358 C 1.852358 3.454535 G 3.454535 4.720022 A:min 4.720022 5.126371 A:min/b7

Start and end times represents the seconds from the beginning of the file. The chord name is described as:

root: shorthand(extra-degrees) / bass

The chords represented by the short hand are triad chords, seventh chords, sixth chords, extended chords and suspended chords. If it contains a degree other than those denoted by the shorthand represented as extra-degree. If the root note and the bass sound is

(28)

17

different, it is written as /bass. The annotation is the .lab format and it has followed in MIREX tasks.

The chord label has also been provided for the popular music of the RWC (Real World Computing) Popular Music Database [30] as AIST Annotation [31]. In The McGill Billboard Project, it provides a lab file using the chord label for the chord estimation task in MIREX2013 [32].

2.3 Effects of Audio Compression for Music Information Processing

The examinations of bit rate differences for music content analyzes are rarely reported in the literature. Hamawaki analyzed Mel-Frequency Cepstrum Coefficient (MFCC) differences of compressed MP3 audio with various bit rates and their effects on content-based MIR results [33]. The MIR results of music data consisting of various bit rates differ significantly from the results of music data, which consist of raw files. The authors of [34] investigate the effect of bit rate on musical genre classification using audio spectrum projection (ASP) and HMM. The classification performance of MP3 files is worse at a lower bit rate than 96kbps. They showed ASP or MFCC on audio collections encoded at 128kbps or high is safe. The collections encoded at lower bit rates or collections including heterogeneous data decrease over 5% and homogeneous encoding is suitable for genre classiﬁcation. Jacobson et al. showed that the performance of the onset detection falls at 32kbps or less using MP3 sounds [35].

2.4 Music scene detection in TV programs

Related tasks have been pursued using various techniques and features. Many features are useful to characterize audio signals [36-43]. The time-domain features are volume, root mean square (RMS), and zero crossing rate. They are often used for distinguishing silence [40, 44-46]. The frequency-domain features are based on discrete Fourier transform (DFT). For example, frequency centroid, spectral flux, and Mel-Frequency Cepstrum Coeﬃcients (MFCC) are categorized by their frequency-domain features, which are often used for speech and music [44, 46].

(29)

18

Scheirer and Taniguchi also used 4 Hz modulation energy, as speech has a characteristic energy modulation peak at around 4 Hz [42, 43]. The authors of an earlier report [45] analyzed audio features (short-time average energy, short-time average zero crossing rate, fundamental frequency, and frequency centroid) and classified audio signals into four types using thresholds. To separate speech and music, they used a low energy ratio (LER), for which speech is defined as higher than that of music.

For video scene estimation, most are assisted by visual information [45-50].

Many methods have extracted visual features such as color, motion, and edge from each shot or clip. Such methods have then inferred the scene boundary from the music and video scene candidate. Wang et al. reported audio and visual features that can selectively characterize scene contents [37].

Fundamentally, discriminating audio scenes can be done in two ways. One is the thresholding approach. Chianese et al. detected scene changes using Euclidean distance among audio features such as volume, energy, and the spectral characteristic of two contiguous video shots [49]. Another approach is a model based such as Gaussian Mixture Models (GMM) [44, 51] and Support Vector Machine (SVM) [46] for classification. Zhu summarizes the features as a vector. Then it is normalized by the standard deviation of the training data [46]. As another classification approach, Nitanda extracts audio-cuts using fussy c-means clustering based segmentation and classification (FSC) and classifies them for each segment [41].

Few works have used chroma vectors except [52], which pointed out different characteristics of chroma vectors between speech and music. However, they insufficiently exploited characteristics of time-series chroma components. An earlier report [25] emphasized the separation of instruments from the musical signals.

2.5 Cover Song and Live Version Identification

The Query by Humming (QBH) approach was used to incorporate audio into early MIR systems. Gradually the format has become available from polyphonic audio to music. It is used for queries of a database. An earlier report [53] described the use of cosine distance between the most repeated melodic fragments of songs to calculate similarity.

(30)

19

Tsai specifically investigated QBM and proposed a matching method by vocal melody extraction from music [54].

Alternatively, music similarity can be characterized by harmony rather than melody. Features that are less affected by sound quality, such as LPC and MFCC are often used for identification. Shazam [55] is a well-known application for iOS and Android developed by Shazam Entertainment Ltd. An audio fingerprint is extracted from an input signal and is compared with the fingerprints in the database. If a match is found, then the title associated with the original is obtained from the database [56].

Some studies have specifically addressed cover songs resembling the original version in chord sequences [5, 56, 57]. Ellis showed a chroma vector that is synchronized to [5].

He solved the problem of the key and the time lag using cross-correlation and circulation cross-correlation for similarity calculation. In addition, Ravuri set the three standards for beats per minute (BMP) to remedy the failure of beat tracking [57]. He added dynamic programming (DP) matching to calculate similarity. The output was identified based on weights using Support Vector Machine (SVM). Serra et al. used a tonality descriptor based on chroma. They proposed a new chroma similarity measure and a dynamic programming local alignment algorithm [58].

Mahieux specifically explored the concept of high-speed processing and access of vast amounts of data. He applied principal component analysis to 2D Fourier Transform Magnitude and calculated the similarity by Euclidean distance [59].

In contrast, to cover songs, only slight differences exist between an original song and a live version because the original song artists themselves perform the live version. Riley et al. proposed a system that was specialized for live versions. It can absorb the differences of effects and tempo [60]. They separated music into small pieces and extracted a chroma vector for each segment. Then vector quantization (VQ) is applied to the chroma vectors for comparison with prepared audio-words. Rafii et. al applied to image processing approach to audio fingerprinting system that can deal with live version identification. The fingerprints were derived using constant-Q spectrogram and template matching was executed using the Hamming distance and the Hough Transform [61].

(31)

20

When a signal of the entire concert performance, including several songs is inputted, these works will identify one song that has the highest similarity score. Our method performs identification by parsing the audio signal into segments of several tens of seconds.

(32)

21

Chapter 3 Chord Recognition using Doubly Nested Circle of Fifths

This chapter describes a chord recognition method from music signals using chroma vectors and musical knowledge known as “Doubly Nested Circle of Fifths (DNCOF).”

DNCOF represents the relationships of major and minor chords where the neighboring two triads are similar. This study obtains a novel feature from chroma vectors by mapping them onto two-dimensional DNCOF coordinate, which we call “DNCOF vectors.” It is expected that the DNCOF vectors can contribute to correcting false recognition obtained by the chroma vectors when their mapped positions are apart from one another in the DNCOF coordinate. This research evaluated our proposal using the Beatles' datasets and showed its effectiveness.

3.1 Proposed Method

This section provides our proposed method. An earlier report had tried to extract key information from audio signals based on the music theory known as Circle of Fifths [62]. They generated key information by abstracting chroma vectors to be mapped onto Circle of Fifths which means key similarity. We extend this method to chord recognition.

Figure 1 displays an overview of our proposal, in which an HMM is prepared for chord recognition and a DNCOF vector is obtained in three steps from audio signals.

First, 12-bin chroma vectors are calculated with beat tracking support. Second, initial chord estimation is carried out, and estimation scores for 24 major/minor chords are collected to generate a chord vector. Third, the chord vector is mapped onto DNCOF coordinate. Finally, the DNCOF vector is utilized for error detection and correction of the recognized chords.

(33)

22

Figure 3.1: Overview of the proposed method.

3.1.1 Chord Vector Calculation

In our proposal, 12-bin chroma vectors are calculated using ISP (Intelligent Sound Processing) toolbox [63]. We align chroma vectors using a spectrum centered on 400Hz.

We also apply beat tracking for the chroma vectors to be synchronized with a beat similar to [5].

A chord vector is calculated from the chroma vector as a twenty-four-dimensional vector. For this purpose, we prepare an HMM for chord estimation using chroma vectors (its detail is given in 2.4.1). We treat only major and minor chords in this study because most studies consider 24 chords. Accordingly, the HMM is trained for these chords only and produces 24 chord estimation scores (i.e. probabilities) for the chroma vector. A chord vector is defined as a collection of these probabilities and is given by



















) (

) ( )

(

min t C

t C t

C

B C

 (3.1)

)) (

| ( )

(t p P chroma t

C_P _n

n 

min 24

2

1

C , P C # , , P B

P    

(34)

23

3.1.2 DNCOF Vector Calculation

DNCOF [2] is one of music theories and represents the relationships among major and minor chords where the neighboring two triads are similar. A DNCOF vector is calculated from a chord vector by mapping it onto DNCOF coordinate. A DNCOF vector is a two-dimensional vector and expresses a chord component based on similarity.

It is calculated as:

) ) (

( ) ) (

( UC t

t y

t t x

DNCOF   _DNCOF



 



 (3.2)















 



  



 



  



 



  



 



  



23 24 sin 2

, 24 ,

2 0 sin

23 24 cos 2

, 24 ,

2 0 cos





 U

　Ｔ

　 ( ) ( )]

) ( [ )

(t C t C _min t C _min t

C_DNCOF  _C _E  _A

where U is a matrix determined by a set of twenty-four unit vectors, and CDNCOF(t) is obtained by swapping vector components of C(t) according to chord location on the DNCOF in Figure 2. A DNCOF vector can be also represented in polar coordinates by:

 



 



 



 





)]) ( ).

( ([

) ( ), ( )

( ) (

t y t x angle

t y t x t

t r

 ^, ^^ ^^angle^(v⁾^^ (3.3)

where r(t) is a magnitude of the vector and θ(t) is an angle from the y-axis (direction to C major). We can regard r(t) as the likelihood of chords and θ(t) as chord types.

(35)

24

Figure 3.2: How to map onto DNCOF coordinate (left: a set of twenty-four unit vectors, center: multiplying a chord vector by the left figure, right: obtain a DNCOF vector).

3.1.2 Error Correction

By observing temporal transition, we can see DNCOF plots as shown in Figure 4. The horizontal axis denotes time and the vertical one does angles of the DNCOF vectors on DNCOF coordinate (standard (zero) angle is the direction of C major and the positive direction of the angle is counterclockwise). We expect that DNCOF plots are easy to handle and contribute to detecting false results of recognition using chroma vectors.

We carry out chord recognition using chroma vectors only firstly. This is accomplished by an HMM using a single Gaussian model for its output probability. In our model, we use two chord types, major and minor chords. We use maximal gamma values (which are chord likelihood when chroma vectors are observed) from the forward-backward algorithm instead of Viterbi algorithm to decide chord class.

We focus on the distance between plots by the chroma vectors and those by the DNCOF vectors in DNCOF order. This is because their positions are apart from one another when the results are false. In fact, the HMM based on chroma vectors have some correct candidates in its second or third scores even when the results are false.

This can be exploited for error correction if we can handle chord vectors and DNCOF vectors in an adequate manner. Otherwise, the distance of chroma and DNCOF vectors remains large.

We treat labels of the chroma vector's results as chord types (angle) and maximal gamma values of the HMM as likelihood (magnitude) in DNCOF coordinate.

The distances are calculated by cosine distance frame-by-frame. The threshold

(36)

25

processing is used for the error judgment. The threshold of distance is 0.966 in this experiment.

In this study, error correction is carried out using simple compensation methods.

We compare candidates of the chroma-based HMM and the angle of the DNCOF vector in error frame. We choose the nearest label from chroma candidates in DNCOF order and conclude that the most probable candidate in this frame. We also use former chroma values to correct errors when errors occur in chord temporarily.

3.2 Experimental Results

We implement our proposed method using MATLAB and examine experiments on a personal computer. Audio signals are downsampled to 32 kHz and separated into small pieces. The frame size consists of 4096 samples. We used annotated Beatles dataset as described in Isophonics [29], which includes 180 Beatles’ songs. It provides labels other than major and minor chords. We group triads and other chords with the same root into the same category. For instance, we treat C minor triad and C minor augment chord as C minor chord.

3.2.1 Error Correction

To evaluate the effectiveness of our error detection methods, we computed recall and precision of 180 Beatles songs. They are defined as:

fn R

R fp

R R

N N Recall N N

N Precision N

 

  , (3.4)

where NR represents the number of correct results, Nfp denotes the number of false-positive results, and Nfn stands for the number of false-negative results.

(37)

26

Figure 3.3: Each result from the Beatles’ Words of Love (from 120 to 145 frames per beat)

Table 3.1: Error estimation performance (precision, recall and F-measure)

Precision Recall F-measure

0.842 0.826 0.834

Table 3.1 shows estimation error performance of precision, recall and F-measure.

DNCOF vectors can determine whether errors have occurred in chord recognition using chroma vectors. Figure 3.3 supports this result that shows how far apart the DNCOF vectors and chroma results are. When the errors have occurred, we can see that each DNCOF plot is slightly displaced from chroma one as shown in Figure 3.3. In addition, the threshold value is 0.966. This represents that the difference of angle corresponds to one in DNCOF order (one angle = π/12) when errors have occurred.

3.2.2 Chord Recognition Performance

Though trained data and test date are prepared separately in many recognition experiments, we follow MIREX Audio chord estimation style [7] in this paper. We

(38)

27

Table 3.2: Recognition-rate accuracy in percentage Conventional Proposed

72.4% 73.1%

Table 3.3: Angular difference from correct label Conventional (label) Proposed (label) DNCOF Vector

2.55 2.54 0.32

computed recognition rates using 3-fold cross validation. We evaluate the next two methods:

 Chord recognition using chroma (no compensation) [12]

 Compensation using DNCOF

Table 3.2 shows the recognition-rate accuracy in percentage. The combined case shows the better performance than compared methods. The errors tend to have intervals and occur in chord boundaries. The errors are slightly improved at the chord boundaries when using combined compensation. The compensation values are apart from correct ones if transition happens from one chord to distant chord in DNCOF order.

3.2.3 Angular Difference on DNCOF coordinate

We also evaluated the angular difference between output and correct label in DNCOF coordinate. Table 3.3 shows how far apart each result and the correct are.

The DNCOF vector is closer to the correct label compared to the output label of the conventional method in DNCOF coordinate. It is possible to improve the chord recognition performance by improving the angular difference.

Fig 3.3 supports this result that shows how far apart the DNCOF vectors and chroma results are. When the errors have occurred, we can see that each DNCOF plot is slightly displaced from chroma one as shown in Fig. 3.3. We can recognize that the chroma outputs are not enough and the correct answer remains in other. Therefore, the

(39)

28

chroma based HMM can be improved with the help of DNCOF if we can correctly adjust the false output.

On the other hand, some plots are not corrected in Fig 3.3. We can see that the correct error path is drawn by chroma path. This may be due to the mapping method for DNCOF vectors abstracted from chroma vectors and the fact that DNCOF vectors depend on chroma vectors.

3.3 Summary

In this chapter, this study proposed an approach for chord recognition using DNCOF vectors and chroma vectors. The DNCOF vector is closer to the correct label compared to the output label of the conventional method in DNCOF coordinate. DNCOF vector can contribute to improving the chord recognition performance by improving the angular difference.

However, the DNCOF vectors leave much to be improved. To accomplish this goal, we enhance the precision of DNCOF vectors and re-evaluate the method to improve chroma vectors themselves. Furthermore, we should obtain error intervals clearly and examine the better methods for error correction.

(40)

29

Chapter 4 Effects of Audio

Compression on Chord Recognition

Feature analysis of audio compression is necessary to achieve high accuracy in musical content recognition and content-based music information retrieval (MIR). Bit rate differences are expected to affect adversely musical content analysis and content-based MIR results because the encoding might change the frequency response. In this paper, we specifically examine its effect on the chroma vector, which is a commonly used feature vector for music signal processing. We analyze sound qualities extracted from encoded music files with different bit rates and compare them with the chroma features of original songs obtained using datasets for chord recognition.

4.1 Background

Audio compression technologies such as MPEG-1 Audio Layer-3 (MP3), Advanced Audio Coding (AAC), and Windows Media Audio (WMA) have made significant contributions to music distribution services via the Internet. These technologies have enabled efficient compression of music files with high quality. Feature analysis of audio compression with different bit rates is necessary to achieve high accuracy for musical content analysis and content-based music information retrieval (MIR). The bit rate difference has an unexpected adverse effect on musical contents recognition and MIR results because of the frequency response change induced by encoding.

4.2 Methods

4.2.1 Chroma vector calculation

We calculate the various chroma vectors extracted from MP3, Low Complexity AAC (AAC LC) and Ogg Vorbis files at different bit rates. The encoding settings are shown

(41)

30

Table 4.1: Encoding settings.

Codec MP3 AAC LC Ogg Vorbis

Extension mp3 m4a ogg

Encoder LAME [64] NeroAACEnc [65] oggenc2 [66]

Bit Rate (kbps) 32–320 12–320 32–320

in Table 4.1. All files are based on the Constant Bit Rate (CBR) mode, which uses the same bit rate in every frame throughout the entire file.

LAME (ver. 3.99.5) [64] is used for encoding to obtain MP3 files. Each music file has bit rates of 14 kinds from 32 kbps to 320 kbps. NeroAACEnc (ver. 1.5.1.0) [65]

is used to obtain AAC-LC files. Each music file has bit rates of 16 kinds from 12 kbps to 320 kbps. Oggenc2.87 [66] is used to obtain Ogg Vorbis files. Ogg Vorbis is quality based encoding and recommended the use of Variable Bit Rate (VBR) mode that the amount of output file varies per time segment. For this study, we set the same parameter value of the maximum and minimum bitrate using the management mode to obtain encoded files similar to CBR mode. Each music file has bit rates of 14 kinds from 32 kbps to 320 kbps.

We obtain chroma vectors, which are calculated using the ISP toolbox [63].

The DFT calculation is executed using window frames of length 93 ms and 75%

over-lap. We align chroma vectors using a spectrum centered on 400 Hz. To obtain the final features, we adjust frames averaged in each beat by reference to beat synchronization [5].

4.2.2 Perceptual Evaluation of Audio Quality (PEAQ)

PEAQ is a standardized algorithm for objectively measuring perceived audio quality as ITU-R BS.1387-1. ODG (Objective Diﬀerence Grade) value represents the measured audio quality of the signal on a continuous scale from -4 (very annoying impairment) to 0 (imperceptible impairment). OGD score is calculated from some variables based on human sound perception. These objective measures are called MOV (Model Output Variable).

(42)

31

Table 4.2: Model output variables (MOV) of PEAQ basic version [67]

Model Output Variables (MOV) Interpretation

WinModDiffB Windowed modulation (envelopes) difference

AvgModDiff1B Averaged modulation difference

AvgModDiff2B Averaged modulation difference with emphasis on introduced modulations

RmsNoiseLoudB Root Means Square (RMS) value of the noise loudness

BandwidthRefB Bandwidth of the Reference Signal BandwidthTestB Bandwidth of the Test Signal RelDistFramesB Frequency of audible distortions

Total NMRB Average Total Noise to Mask Ratio

MFDPB Detection probability after low pass filtering

ADBB Average Distorted Block

EHSB Harmonic structure of the error

There are two algorithms of PEAQ, a basic and an advanced version. A basic version is featuring a low complexity approach and an advanced version for higher accuracy at the trade-off of higher complexity. Table 4.2 shows a list of those MOVs and their interpretation. For audio quality evaluation, we used basic version implemented by Kabal [67].

4.2.3 Application to Chord Recognition

To evaluate the effect of audio compression for chord recognition, we computed the accuracy using chroma vectors by reference to [12]. One method is a single Gaussian model of Hidden Markov Model (HMM) for its output probability. Subsequently, to decide the chord class, we use maximal gamma values (which are chord likelihood when chroma vectors are observed) from the forward–backward algorithm instead of the Viterbi algorithm. Another is a support vector machine and HMM (SVM-HMM)

(43)

32

model trained by the ground truth data [20]. The structured SVM training package is provided by [68]. The chord sequences are estimated with the chroma vectors and the trained SVM. Both models use two chord types: major and minor chords. We respectively use the 12 cyclic shifts of a 12-dimensional chroma vector to obtain several training data for major and minor chords. We transpose all chroma features to C major or C minor first. Then, the models for two chords are defined. Finally, we obtain the models of 24 chords by transposing the C major and C minor models.

4.3 Experimental Results

We implement our proposed method using MATLAB and examine experiments using a personal computer. The original music sources are collected from CDs. All audio signals are sampled to 44.1 kHz. We use ground truth chord annotations as the dataset of isophonics [29] and AIST annotations [31] consisting of 307 songs.

4.3.1 PEAQ Evaluation

First, we compute ODG of 307 songs. Fig 4.1 shows average ODG results for recognition performance of 32–320 kbps. We compute chord recognition accuracy to analyze the effects of audio compression on chord recognition results. For each song, accuracy is calculated as the ratio between the lengths of the correctly analyzed chords and the total length of the song. For our evaluation, we use four-fold cross validation for each bit rate. The final average score is obtained by averaging the scores of all 307 songs for each bit rate. The dataset provides labels other than major and minor chords.

We group triads and other chords with the same root into the same category. For instance, we treat the C minor triad and C minor augment chord as the C minor chord.

The accuracies of the HMM method and the SVM one for PCM files were, respectively, 63.5% and 67.8%. It is apparent that ODG increases as the recognition rate increases, as expected, which implies that the frequencies of the signal are easily influenced by compression with lower bit rates but are unsusceptible to higher bit rates.

(44)

33

Figure 4.1: ODG values and Chord recognition results (top: GMM-HMM, bottom:

SVM-HMM).

Figure 4.2: Evaluation of different bit rate data.

4.3.2 Chord Recognition Performance with Different Bitrate

The evaluation is executed by dividing the 307 songs into three groups. The models learn the different bit rate data as shown in the figure below. We also treat with 72 types chord (major, minor, diminished, augmented, major seventh and minor seventh).

(45)

34

Figure 4.3: Chord recognition results with different bit rates (top: GMM-HMM, bottom: SVM-HMM).

Fig 4.3 shows the recognition rate accuracy in percentage terms. The chroma vectors are not so different among various bit rates in each codec, which shows that the lower bit rate has little or nothing to do with chord recognition using machine learning.

4.4 Discussions

The audio compression for chord recognition does not strongly affect the power distribution of chroma vectors, which indicates that the frequencies left in compression such as the fundamental frequency are critical for chroma vector calculation because a chroma vector is defined as a 12-dimensional vector irrespective of the octave. The frequency response might also be changed by encoding algorithm and downsampling.

For example, HE-AAC (High-Efficiency Advanced Audio Coding) is of current interest, which is an extension of AAC LC for low bit rate applications. The influence can be analyzed by downsampling on chroma vectors, but it needs further study.

Researches on Music Feature Analysis using Chroma Vector and its Applications

Researches on Music Feature Analysis using Chroma Vector and its Applications

Aiko UEMURA

Researches on Music Feature Analysis using Chroma Vector and its Applications

Aiko UEMURA

Abstract

Acknowledgements

Contents

List of Figures

List of Figures

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Research Background

1.2 Thesis Contributions

1.3 Thesis Organization

Chapter2 Background and Related Works

2.1 Musical Content 2.1.1 Equal Temperament

 



 

























2

2

f f

f a

f a

f f

a f

a a



2.1.2 Chord

2.1.3 Harmony

2.1.4 Musical Model

・ Circle of Fifths

・ Doubly Nested Circle of Fifths

・ Tonnetz

2.2 Chord Recognition

2.2.1 Chroma Vector

 

・ Frequency Analysis









k w n k x n e

X

) ( ) , ( )

(





2 f

f

 



 

 

・ Weighting Functions



) 1 ( 100

 1200 h  c 

F

・ Chroma calculation approach

2.2.2 Modeling

2.2.3 Pre/Post Processing

2.2.4 Data Set

2.3 Effects of Audio Compression for Music Information Processing

2.4 Music scene detection in TV programs

2.5 Cover Song and Live Version Identification