mckinney time series

(1)

Time Series Analysis in Python with statsmodels

Wes McKinney¹ Josef Perktold² Skipper Seabold³

1Department of Statistical Science Duke University

2Department of Economics University of North Carolina at Chapel Hill

3Department of Economics American University

10^th Python in Science Conference, 13 July 2011

(2)

What is statsmodels?

A library for statistical modeling, implementing standard statistical models in Python using NumPy and SciPy

Includes:

Linear (regression) models of many forms Descriptive statistics

Statistical tests Time series analysis ...and much more

(3)

What is Time Series Analysis?

Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data

Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing...

(4)

Talk Overview

Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests

Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models

Filtering tools (Hodrick-Prescott and others)

Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyond

(5)

Statsmodels development update

We’re now on GitHub! Join us:

http://github.com/statsmodels/statsmodels Check out the slick Sphinx docs:

http://statsmodels.sourceforge.net

Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical models

(6)

Statsmodels development update

Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas:

Formula framework, for specifying model design matrices Need integrated rich statistical data structures (pandas)

Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guide

(7)

Aside: statistical data structures and user interface

While I have a captive audience...

Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics)

Let’s have a BoF session so I can justify this statement

Feedback I hear is that end users find the fragmented, incohesive set of Python tools for data analysis and statistics to be confusing, frustrating, and certainly not compelling them to use Python...

(Not to mention the packaging headaches)

(8)

Aside: statistical data structures and user interface

We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users

Or we might as well all start programming in R...

(9)

Example data: EEG trace data

0 ₅₀₀

1000 1500 2000 2500 3000 3500 4000

600 500 400 300 200 100 0 100 200 300

(10)

Example data: Macroeconomic data

3.03.5 4.04.5 5.05.5

cpi

4.55.0 5.56.0 6.57.0

7.5 m1

1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 8.0

8.5 9.0

9.5 realgdp

(11)

Example data: Stock data

2001 2002 2003 2004 2005 2006 2007 2008 2009

0 100 200 300 400 500 600 700 800

AAPL GOOG MSFT YHOO

(12)

Descriptive statistics

Autocorrelation, partial autocorrelation plots

Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models

acf = tsa . acf ( eeg , 50) pacf = tsa . pacf ( eeg , 50)

0.5 0.0 0.5

1.0 Autocorrelation

0.5 0.0 0.5

1.0 Partial Autocorrelation

(13)

Statistical tests

Ljung-Box test for zero autocorrelation

Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality

Whiteness (iid-ness) and normality

See our conference paper (when the proceedings get published!)

(14)

Autoregressive moving average (ARMA) models

One of most common univariate time series models:

y_t = µ + a₁y_t−1+ ... + a_ky_t−p+ ǫ_t+ b₁ǫ_t−1+ ... + b_qǫ_t−q where E (ǫt, ǫs) = 0, for t 6= s and ǫt∼ N (0, σ²)

Exact log-likelihood can be evaluated via the Kalman filter, but the

“conditional” likelihood is easier and commonly used

statsmodelshas tools for simulating ARMA processes with known coefficients a_i, b_i and also estimation given specified lag orders

import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]

nobs = 100

y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)

(15)

ARMA Estimation

Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y)

result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params

# array([ 3.97, -0.97, -0.05, -0.13])

Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults object

(16)

Vector Autoregression (VAR) models

Widely used model for modeling multiple (K -variate) time series, especially in macroeconomics:

Y_t = A₁Y_t−1+ . . . + A_pY_t−p+ ǫ_t, ǫ_t∼ N (0, Σ) Matrices A_i are K × K .

Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processes

(17)

Vector Autoregression (VAR) models

>>> model = VAR(data); model.select_order(8) VAR Order Selection

=====================================================

aic bic fpe hqic

---

0 -27.83 -27.78 8.214e-13 -27.81

1 -28.77 -28.57 3.189e-13 -28.69

2 -29.00 -28.64* 2.556e-13 -28.85

3 -29.10 -28.60 2.304e-13 -28.90*

4 -29.09 -28.43 2.330e-13 -28.82

5 -29.13 -28.33 2.228e-13 -28.81

6 -29.14* -28.18 2.213e-13* -28.75

7 -29.07 -27.96 2.387e-13 -28.62

=====================================================

(18)

Vector Autoregression (VAR) models

>>> result = model.fit(2)

>>> result.summary() # print summary for each variable

<snip>

Results for equation m1

==================================================== coefficient std. error t-stat prob ---

const 0.004968 0.001850 2.685 0.008

L1.m1 0.363636 0.071307 5.100 0.000

L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683

L2.m1 0.250589 0.072050 3.478 0.001

L2.realgdp -0.085874 0.092032 -0.933 0.352

L2.cpi 0.169803 0.128376 1.323 0.188

====================================================

(19)

Vector Autoregression (VAR) models

>>> result = model.fit(2)

>>> result.summary() # print summary for each variable

<snip>

Correlation matrix of residuals

m1 realgdp cpi

m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000

(20)

VAR: Impulse Response analysis

Analyze systematic impact of unit “shock” to a single variable irf = result.irf(10)

irf.plot()

0 2 4 6 8 10

0.2 0.00.2 0.40.6 0.8^1.0

m1→m1

0 2 4 6 8 10

0.4^0.3 0.2 0.00.1 0.1^0.2

realgdp→m1

0 2 4 6 8 10

0.40.3 0.20.1 0.0^0.1 0.30.2

0.4 ^cpi^→^m1

0 2 4 6 8 10

0.150.10 0.050.00 0.050.10

0.150.20 ^m1^→^realgdp

0 2 4 6 8 10

0.2 0.0^0.2 0.40.6 0.8^1.0

realgdp→realgdp

0 2 4 6 8 10

0.4 0.30.2 0.00.1^0.1

0.2 ^cpi^→^realgdp

0.00 0.050.10

0.150.20 ^m1^→^cpi

0.05 0.000.05

0.100.15 ^realgdp^→^cpi 0.4 0.6 0.8

1.0 ^cpi^→^cpi Impulse responses

(21)

VAR: Forecast Error Variance Decomposition

Analyze contribution of each variable to forecasting error fevd = result.fevd(20)

fevd.plot()

0 5 10 15 20

0.0 0.2 0.4 0.6 0.8

1.0 ^m1

0 5 10 15 20

0.0^0.2 0.40.6 0.8^1.0

1.2 ^realgdp

0.40.2 0.60.8^1.0

1.2 ^cpi

Forecast error variance decomposition (FEVD) _m1 realgdp cpi

(22)

VAR: Statistical tests

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test

=========================================================

Test statistic Critical Value p-value df

---

1.248787 2.387325 0.289 (4, 579)

========================================================= H_0: [’cpi’, ’realgdp’] do not Granger-cause m1

Conclusion: fail to reject H_0 at 5.00% significance level

(23)

Filtering

Hodrick-Prescott (HP) filter separates a time series y_t into a trend τ_t and a cyclical component ζt, so that yt= τt+ ζt.

4 2 0 2 4 6 8 10 12

14 Inflation

Cyclical component Trend component

(24)

Filtering

In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these:

1966 1971 1976 1981 1986 1991 1996 2001 2006 4

2 0 2 4

Inflation and Unemployment: BK Filtered INFLUNEMP

1963 1968 1973 1978 1983 1988 1993 1998 2003 2008 4

2 0 2 4

Inflation and Unemployment: CF Filtered INFLUNEMP

(25)

Preview: Bayesian dynamic linear models (DLM)

A state space model by another name:

y_t = F_t^′θ_t+ νt^, ^νt ∼ N (0, Vt) θ_t = G θ_t−1+ ω_t, ω_t ∼ N (0, Wt⁾

Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMs

(26)

Preview: DLM Example (Constant+Trend model)

model = Polynomial(2)

dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors

state_discount=.95) # discount factor

50 100 150 200

Constant + Trend DLM

(27)

Preview: Stochastic volatility models

0 200 400 600 800 1000

0.2 0.4 0.6 0.8 1.0 1.2 1.4

1.6 JPY-USD Exchange Rate Volatility Process

(28)

Future: sandbox and beyond

ARCH / GARCH models for volatility

Structural VAR and error correction models (ECM) for cointegrated processes

Models with non-normally distributed errors

Better data description, visualization, and interactive research tools More sophisticated Bayesian time series models

(29)

Conclusions

We’ve implemented many foundational models for time series analysis, but the field is very broad

User interface can and should be much improved

Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net

Contact: pystatsmodels@googlegroups.com