Constrained Motif Discovery

(1)

Constrained Motif Discovery

Yasser Mohammad and Toyoaki Nishida

Abstract— The goal of motif discovery algorithms is to efficiently find unknown recurring patterns in time series. Most available algorithms cannot utilize domain knowledge in any way which results in quadratic or at least sub-quadratic time and space complexity. For large time series datasets for which domain knowledge can be available this is a severe limitation. In this paper we define the Constrained Motif Discovery problem which enables utilization of domain knowledge into the motif discovery process. We also show that most unconstrained motif discovery problems be converted into constrained motif discovery problem using a change point detection algorithm. We provide two algorithms for solving this problem and compare their performance to state-of-the-art motif discovery algorithms on a large set of synthetic time series. The proposed algorithms can provide linear time and constant space complexity. The proposed algorithms provided four to ten folds increase in speed compared to two state of the art motif discovery algorithms without loss of accuracy and provided better noise robustness in high noise levels.

I. INTRODUCTION

The research in unsupervised motif discovery have led to many techniques including the PROJECTIONS algorithm [1], PERUSE [2], Gemoda [3] among many others ([4]). With the exception of Gemoda which is quadratic in time and space complexities, these algorithms aim to achieve sub-quadratic time complexity by first looking for candidate motif stems using some heuristic method and then doing an exhaustive motif detection instead of motif discovery which is linear in time. The most used method for finding these stems is the PROJECTIONS algorithm [1] which requires discritization of the data using the SAX [5] algorithm. A major drawback of all methods relying on discritization of the data is the need to specify a word length and a vocabulary size that are difficult to decide in real world situations. Another problem of most of the previously proposed tech-niques is the need to specify an exact or at least roughly correct motif length. A third problem is deciding when to stop searching for new motifs ([4] suggested using density estimation for this purpose). One common problem to all the algorithms based on the PROJECTIONS algorithm [4], [1] is the need to construct and keep the collision matrix which is in general quadratic in the length of the SAX word describing the time series. Given that the optimal word size can be Yasser Mohammad is with the Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University

yasser@ii.ist.i.kyoto-u.ac.jp

Toyoaki Nishida is with the Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University

nishida@i.kyoto-u.ac.jp

The ﬁrst author would like to thank the Egyptian Ministry of Higher Education for supporting his PhD study during which this research took place

very short for short motifs in signals with high frequency components, the size of this matrix can grow quadratic with the length of the time series. Catalano et al. [6] suggested a very efﬁcient algorithm for locating variable length patterns in data series using random sampling that allows it to run in linear time and constant memory. The main problem of this approach is that it relies on random sampling which can lead to poor performance for long time series with infrequent embedded motifs including long records of human activities. All of the methods proposed for motif discovery that we are aware of assume no prior knowledge of the probable locations of the motifs which leads to this explosion in the processing time or space needed. This paper introduces the Constrained Motif Discovery problem in which we try to ﬁnd motifs utilizing information about their probable locations in the data series. This leads to a linear time algorithm with constant space complexity based on the user choice. The introduction of constraints also allows interactive implementations which can utilize the human eye as the ultimate pattern recognizer to achieve higher performance.

II. CONSTRAINEDMOTIFDISCOVERYPROBLEM Deﬁnition 1: Time series

A time series of length n (Xn) is deﬁned as an ordered set of values x (t) where 1≤ t ≤ n and x (t) ∈ U. U is called the domain of Xn.

Deﬁnition 2: Distance Function

A distance function d over a domain U is a function that implements the transformation: dy1l1, y2l2

: Ul1× Ul2 →

+

for any two time series yl1 _{and y}l2 _{deﬁned over the} domain U . The transformation function be nonzero and transitive.

Deﬁnition 3: Subsequence set

Given a time series Xn of length n, the subsequence set of X of length range lmin: lmax(X_sublmin:lmax) is deﬁned as the

set of all unique time series xl_sub that satisfy: 1) lmin≤ l ≤ lmax

2) xl_sub(t) = x (t + i) for all 1≤ t ≤ l and some integer i≥ 0

Deﬁnition 4: Trivial Match Set

Given a subsequence xsub of a time series X, the trivial match set of xsub (triv (xsub)) is deﬁned as the set of all

time subsequences Tsubthat intersects the subsequence xsub

in one or more points. Deﬁnition 5: Motif

Given a time series Xn, and a distance function d and two length limits (lmin and lmax), a motif M is deﬁned as a set of time series called motif occurrences (mi) that satisfy:

(2)

1) mi ∈ X_sublmin:lmax

2) lmin≤ len (mi)≤ lmax

3) arg min x (d (mi, x)) ∈ M, x ∈ xlmin:lmax_sub − triv (mi) 4) d (mi, x) > d (mi, y)

wherex∈ M, y ∈ Xsub− M − triv (mi)

5) mi is maximal. This means that mi is not a

subse-quence of any larger motif occurrence satisfying the previous conditions.

6) The number of elements in the motif M is larger than the mean of the number of elements in any other subsequence set that satisfy the previous conditions. Formally this can be stated as: There is a real value η > 1 so that:

|M| > η × E {|NM|}

where N M is the set of all time series sets satisfying the other conditions in this list, |S| is the number of elements in the set S, and E{} is the expectation operator.

Deﬁnition 6: Constrained Motif Discovery Problem Given a time series Xn, another time series Pn of the same length called a constraint (where 0.0≤ p (t) < 1.0), two real valued limits lmin and lmax, and a distance function d ﬁnd

all motifs (Mlmin:l_max_{) of length l where l}

min≤ l ≤ lmax.

The unconstrained motif discovery problem solved by available motif discovery algorithms can be considered a special case of the constrained motif discovery problem where p (t1) = p (t2) for all 1≤ t1≤ n and 1 ≤ t2≤ n.

In the remaining of this paper we will assume without loss of generality that the domain of the time series (U ) is the set of real numbers (). Discrete Time Wrapping (DTW) will be used as the distance function.

III. MC ALGORITHMS

Catalano et al. [6] proposed an efﬁcient algorithm for locating recurring motifs in time series that works in linear time and constant memory. The algorithm works by process-ing a ﬁxed sized set of candidate and comparison windows randomly sampled from the time series. The steps of the algorithm can be summarized as:

1) Select a subsequence sw of length w≥ lmax.

2) Select w values randomly from X and concatenate them to form a noise sequence nw

3) select a set of nc comparison subsequences of X

({cwi}) each of length w.

4) ﬁnd the set Swˆ _{of subsequences of of length ˆ}_{w where} ˆ

w ≤ lmin for sw,cwi, and nw. Then normalize all of

the resulting subsequences to have unit mean square. 5) For the candidate subsequence of sw (sw_kˆ) do the

following

a) Randomly select w− ˆw− 1 subsequences from the set of all subsequences of the comparison windows (cwi). call this set the comparison set

ˆ cwˆ

j

b) ﬁnd the distances dkj= dswkˆ, ˆcwj.

c) Group the set ˆcwˆ

j with their parent subsequence

cwi and for every group select ˆcwjˆ that has the

minimum distance dkj. This leads to a set of R

subsequences ˜cwrˆ where R≤ nc

d) keep only the ˆR subsequences of ˜cwrˆ with least

dkr.

e) repeat the previous three steps for subsequences of the noise subsequence nw. This leads to an-other set of ˆR subsequences called ˜nwˆ

r

6) Remove all candidate subsequences swˆ

k that has similar

average distance with both ˜nwˆ

r and ˜cwrˆ and then repeat

the steps above using this reduced set. Repeat this reduction for nr times.

7) If the ﬁnal set swˆ

k is not empty output each of them

as a motif seed Msafter concatenating any continuous

subset of them.

Assuming that the constraint p (t) is specifying the proba-bility that a motif occurrence ends at or near the time step t, we can modify the first three steps of the original algorithm to speed up the motif discovery process. Two alternative ways to do so are presented in the following subsections. These Modified Catalano algorithms are what we call the MC algorithms in the rest of this paper. In section V we will show that this simple modification can speedup the original Catalano algorithm significantly given that the constraint is selected wisely. In section IV we will show an effective domain independent method for calculating the constraint for any real valued time series.

A. MCFull Algorithm

If calculating the constraint P is not computationally expensive or cannot be calculated for any required point of the series X using only local information, we can speed up Catalano algorithm by modifying the ﬁrst three steps as follows:

1) Apply a gaussian smoothing ﬁlter (N (0, σ2)) to the original P constraint which results on the smoothed constraint ˜P .

2) normalize ˜P so that

n

t=1p (t) = 1 and 0ˆ ≤ ˆp(t) ≤ 1.

3) Randomly select a subsequence swof length w≥ lmax

using ˆP as the probability distribution.

4) Randomly select w values from X and concatenate them to form a noise sequence nw using 1− ˆP as the probability distribution.

5) Randomly select a set of nc comparison subsequences of X ({cwi}) each of length w using ˆP as the

probability distribution.

The rest of the algorithm goes exactly as the original algorithm. The smoothing step is required to account for any inaccuracy of the constraint. This way we understand the constraint as specifying the probability that a motif occurrence ends near and not necessary at every time step. This algorithm is called MCFull (standing for Modiﬁed Catalano Full) because the P constraint has to be calculated completely before the algorithm is run.

(3)

B. MCInc Algorithm

If calculating the constraint P is computationally expen-sive and more over if p (t) can be calculated using only local information around t, then an incremental version of MCFull can be constructed (named MCInc hereafter) by modifying the MCFull steps as follows:

1) Randomly select a time step 1 ≤ τ ≤ n using a uniform distribution.

2) Calculate p (t) for τ − m ≤ t ≤ τ + m and calculate ˆ

p (τ ) = 1 2×m+1

τ +m t=τ −mp (t).

3) Repeat steps 1 and 2 as long as ˆp (τ )≤ th

4) select the subsequence sw of length w ≥ lmax that

ends at τ

5) Randomly select w values from X and concatenate them to form a noise sequence nw.

6) Repeat steps 1,2,3 fornctimes to select the comparison

subsequences of X ({cwi}) each of length w.

IV. CALCULATING THE CONSTRAINT

The Constrained Motif Discovery problem as specified in section II does not define any specific way to calculate the constraint value P , because this constraint can come from domain knowledge. In this section we briefly describe a method for calculating the constraint value P given only the time series X. Using this technique it is possible to apply algorithm designed to solve the constrained motif discovery problem even when there is no enough domain knowledge to calculate the constraint. In section V we will use this technique to compare our proposed algorithms with other motif discovery algorithms and will show that in many cases this indirect method of solving the unconstrained motif discovery problem can be even more efficient than using traditional approaches.

To ﬁnd the needed noisy estimation of the motif locations for cases in which domain knowledge cannot be utilized we use our Robust Singular Spectrum Transform (RSST) [7] to ﬁnd change points in the time series under the assumption that motifs are corresponding to changes in the dynamics of the time series or the generating processes. Although pathological cases that does not follow this assumptions can be designed, in most real-world time series mining we are interested in motifs that cause or result from some change either in the time series itself or related variables. In all such cases the proposed algorithms can be of value in discovering the interesting motifs much faster than currently available techniques.

The main idea is to ﬁnd the change score at every point which is deﬁned as the probability that there is a change in the dynamics of the time series and utilize this score as the constraint value at this point.

Only the basic idea of the RSST algorithm is given here. For more details please refer to [7]. The main idea of the RSST algorithm is to calculate the directions of maximal change in the time series before and after every point t of the time series X. The maximal change directions of the past is calculated as the singular vectors associated with the

l largest singular values of the matrix defined by arranging n overlapping subsequences of length w before the point in rows. The parameter l is determined dynamically at every point of the time series. The maximal change directions of the future is defined the same way. When used to find the constraint for a time series X we select w = lmin/2 and

n =2 × lmax/lmin.

The angles between directions estimated using the future of the time series and the subspace deﬁned by the directions of the past of the time series is then used to ﬁnd the change score p (t).

V. EVALUATION

To evaluate the effectiveness of the proposed modiﬁca-tions of Catalano algorithm in solving the constrained motif discovery problem we will compare the performance of the following algorithms on a wide range of synthetic data with embedded motifs:

1) Projections [Pro]: A general motif discovery algorithm introduced in [8] and is the basis of many other state-of-the-art motif discovery algorithms [8], [1]. This algorithm requires the speciﬁcation of the exact motif length.

2) Catalano’s algorithm [Cat]: The original Catalano et al.’s algorithm as speciﬁed in [6].

3) MCFull: proposed in section III-A 4) MCInc: proposed in section III-B

50400 different synthetic time series with controlled em-bedded motifs were produced by changing various features of the time series. Every time series was composed of a recurring pattern called the background signal with embed-ded motifs embedembed-ded at random locations with controllable numbers. Uniform noise was added to the time series. The features of the time series that was changed in the tests are: 1) time series length was changed from 1×103_{to 1}_×106 2) The relative number of embedded motifs was changed

from 1% to 10% of the length of the time series. 3) The length of the embedded motif was changed from

50 to 100 points.

4) The noise level deﬁned as the peak value of the white noise added divided by the peak-to-peak value of the motifs from 0% to 20%.

5) The background signal strength deﬁned as the peak-to-peak value of the background signal divided by the peak-to-peak value of the embedded pattern was changed between 0% to 40%.

6) The generating process of both the background signal and the motif.

For Catalano, MCFull and MCInc, the parameters lmin,

lmax were ﬁxed at 40 and 100 respectively. For the

Pro-jections algorithm the correct motif size was given to the algorithm.

Fig. 1 shows the processing time per point for every one of the tested algorithms. It is clear that the Projections algo-rithm is subquadratic although not linear while the Catalano, MCFull, and MCInc are all linear specially for long time

(4)

Fig. 1. Processing Time per point of the input time series for the four tested motif discovery algorithms. For MCFull and MCInc the time required to calculate the constraint using RSST is also added

series. The average speedup obtained by using MCFull was 4.19 times over the Catalno algorithm and 10.76 times over the Projections algorithm. The average speedup obtained by MCInc was 4.09 over the Catalano algorithm and 9.84 times over the Projections algorithm.

Fig. 2. The Speedup obtained over Catalano et al.’s original algorithm for the two proposed algorithms

To further study the effect of time series length on the speedup obtained by using the MCFull and MCInc algo-rithms, Fig. 2 plots the speedup over the Catalano Algorithm as a function of the time series length. As the ﬁgure shows, MCFull can even be slower than Catalano algorithm for short time series while MCInc shows four folds increase in speed at the same conditions. When the time series length increases, the advantage of MCInc over MCFull decreases until for long time series, MCFull achieves faster processing than MCInc.

VI. CONCLUSION

This paper presented the Constrained Motif Discovery problem and the first two algorithms to solve it. The two algorithms (MCFull and MCInc) are modifications of the recently developed algorithm by Catalano et al. for solving unconstrained motif discovery problems. The paper also briefly introduced the RSST algorithm for change point detection and showed how to use it to convert unconstrained

motif discovery problems into constrained motif discovery problems that can be solved in linear time. The evaluation of the proposed algorithms show four folds increase in speed over the original algorithm in a large set of synthetic time series. Comparison with the Projections algorithm also showed a ten folds increase in speed compared. The proposed algorithms was also more robust against noise for high noise levels even though the Projections algorithm was the best algorithm for low noise levels.

The proposed algorithms can be used to solve many real world problems in which information about the locations of motifs can be inferred from other available time series. For example in interaction contexts, the behavior of the interactors are dependent on each other and the locations of motifs in one of them can correspond (with some delay) to the locations of motifs in the other. The ability to use this kind of knowledge to speedup and increase the accuracy of motif discovery is a unique feature of the proposed algorithms.

Directions of future research include applying the pro-posed algorithms to real world data and modifying them to solve multidimensional constrained motif discovery prob-lems. Speeding up the RSST algorithm in order to solve unconstrained motif discovery problems faster can be another direction of future research. Other algorithms for solving the constrained motif discovery problem that provide the exhaustiveness in ﬁnding motif occurrences guaranteed by PROJECTIONS based algorithms is a third direction of future research.

REFERENCES

[1] Bill Chiu, Eamonn Keogh, and Stefano Lonardi. Probabilistic discovery of time series motifs. In KDD ’03: Proceedings of the ninth ACM

SIGKDD international conference on Knowledge discovery and data mining, pages 493–498, New York, NY, USA, 2003. ACM.

[2] Tom Oates. Peruse: An unsupervised algorithm for ﬁnding recurring patterns in time series. In International Conference on Data Mining, pages 330–337, 2002.

[3] Kyle L. Jensen, Mark P. Styczynxki, Isidore Rigoutsos, and Greogory N. Stephanopoulos. A generic motif discovery algorithm for sequenctial data. BioInformatics, 22(1):21–28, 2006.

[4] D. Minnen, T. Starner, I. Essa, and C.L. Isbell. Improving activity discovery with automatic neighborhood estimation. In Int. Joint Conf.

on Artiﬁcial Intelligence, 2007.

[5] E. Keogh, J. Lin, and A. Fu. Hot sax: efﬁciently ﬁnding the most unusual time series subsequence. Data Mining, Fifth IEEE International

Conference on, pages 8 pp.–, Nov. 2005.

[6] Joe Catalano, Tom Armstrong, and Tim Oates. Discovering patterns in real-valued time series. In Knowledge Discovery in Databases: PKDD

2006, pages 462–469, 2007.

[7] Yasser Mohammad and Toyoaki Nishida. Change point detection using robust singular spectrum transform applied to mining human-human interaction records. In International Conference on Data Mining, 2009. submitted.

[8] J. Buhler and M. Tompa. Finding motifs using random projections. In

5th Internatinal Conference on Computational Biology, pages 69–76,

2001.