Principle of Microarray data analysis

(1)

Principle of Microarray Data

Analysis

Date: 2012.10.16

Speaker: Chien-Yi, Tung [email protected]

http://lucas.genome-analyst.org/activities-

articles/course/nrpb20121016principleofmicroarraydataanalysis

(2)

 ^Hardware:

 CPU: at least P4

 4Gb RAM in 64-bite OS is strongly recommended!

 With great patience and expert’s help, Gb in -bit OS is OK.

 ^Software:

 JAVA environment

 http://www.java.com/zh_TW/download/manual.jsp

 R GUI environment

 http://cran.csie.ntu.edu.tw/

 MultiExperimentViewer

 http://sourceforge.net/projects/mev-tm4/files/latest/download

System requirement

(3)

R is a free software environment for statistical computing and

graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.(http://www.r-project.org/)

The R Project for Statistical Computing

(4)

 ^Download

 Window: (http://cran.csie.ntu.edu.tw/bin/windows/base/)

 Mac OS X: (http://cran.csie.ntu.edu.tw/bin/macosx/)

 Linux: (http://cran.csie.ntu.edu.tw/bin/linux/)

 Install R environment

 ^Window:

 Execute R-2.15.1-win.exe [Software\R]

 Install into C:\R\R_2.15.1 not C:\program files\R\R_2.15.1

 ^{Mac OS X}

 Install R-2.11.0 pkg [Software\R]

 Install tktcl.pkg [Software\R]

 Install R-2.15.1-signed.pkg [Software\R]

Installation instruction of R

environment

(5)

R GUI for window

(6)

R language

(7)

 ^Using R console to Install basic package from bioconductor

source("http://bioconductor.org/biocLite.R")

biocLite() ,Optional step, It will take long time…

 Install package for array analysis

biocLite("affy")

biocLite("tkWidgets")

 Download LazyPack ⁽https://sites.google.com/site/lucastproject/tools/lucaslazypack₎

install.packages(file.choose(),type="source", repos=NULL) library(lucasLazyPack)

lz.GUI()

Install LazyPack

(8)

 Download JAVA [Software\JAVA]

(http://www.java.com/zh_TW/download/manual.jsp)

 Install 32 bit JAVA

Window

 jre-7u7-windows-i586.exe MacOS X

 jre-7u7-macosx-x64.dmg

Installation of JAVA environment

(9)

 Download JAVA3D [Software\JAVA3D]

(http://java3d.java.net/binary-builds.html)

 Insall 32 bit for MeV

Window

 j3d-1_5_2-windows-i586 Mac OSX

 J3d-1_5_2-macosx.zip

 Unzip and Copy /lib/ext/*.jar to

 /system/lib/java/extensions

 ^JOGL

Installation of JAVA 3D environment

(10)

MeV is a desktop application for the analysis, visualization and data- mining of large-scale genomic data. (http://www.tm4.org/mev/)

MultiExperiment Viewer

(11)

 http://sourceforge.net/projects/mev-tm4/files/mev-tm4/MeV 4.8.1/

Download MeV

(12)

 Required JAVA runtime 32-bit

 Required JAVA 3D API(32-bit) for PCA 3D plot

 Download MeV(4.8) and Unzip in your Hard Drive (C: or D:)

 Run TMEV.bat.[Software/MeV]

!problem of JAVA version! For EzMeV_win version

 Run TMEV2.bat

MeV Installation for Window

(13)

Installation guide line

Software_Window

 R/R-2.15.1-win.exe

 Install in C:\R\R-2.15.1

 R package(affy,tkWidgets)

 R package(lucasLazyPack)

 JAVA/jre-7u7-window-i586.exe

 JAVA3D/j3d-1_5_2-windows-i586.exe

 MeV/MeV_4_8_1_r2727_Win.zip

Software_Mac

 R/tcltk.pkg

 R/R-2.11.0.pkg

 R package(affy,tkWidgets)

 R package(lucasLazyPack)

 JAVA/jre-7u7-macosx-x64.dmg

 JAVA3D/j3d-1_5_2-macosx.zip

 MeV/MeV_4_8_1_r2727_mac.gz

(14)

 NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/)

Get Free Data set from GEO

(15)

Exercise I

Get data set from NCBI GEO

(16)

Search data set of Macrophage and monocyte…

Get Demo data set (GSE11540)

URL: (http://www.ncbi.nlm.nih.gov/geo/)

Comparison of monocytes and macrophages to characterize the

differential gene expression profiles of human monocytes and monocyte- derived-macrophages. Samples obtained from patients with acute

coronary syndrome.

This SuperSeries is composed of the following subset Series:

 GSE10213 (Affymetrix)

 GSE11430 (Illumina microarray)

Exercise I: Get Microarray Data

(17)

Demo cases

Normalized Data Matrix

Raw Data [Data\GSE11540]

(18)

Structure of SeriesMatrixFiles

(19)

 Data preprocessing

 Normalization

 Probe summarization

 Signal adjustment flag, flooring…

 Signal transform (fold change, log transform)

 Data QC/QA

 QC index (Hybridization control)

 Visualization (PCA, MDS, Box plot..)

 Find differentially expressed genes

 Statistical analysis

 p-Value correction

 Biological interpretation

 Gene annotation

 Gene set analysis

 Gene network analysis

General Workflow of Data Analysis

Array profiles

A few D.E. genes

Large-scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

(20)

Single or Dual Color System

Gene 1 Gene 2 Gene 3 Gene 4

Sample 2 Sample 1

Sample 3 Sample 2 Sample 1

(21)

 Spotting array system

 Probe flag

 Image process

 Background correction

 R/G Normalization(lowess)

 Signal flooring

 Convert to signal ratio.

 Log Transform

 Affymetrix system

 Image process

 Summarization

 Normalization(quantile)

 Log Transform

 Signal flooring

From signal to expression level..

One gene

(22)

Experimental Errors and Batch Effects

Gene Test sample Control

Gene 1 1230 122

Gene 2 323 25

Gene 3 32 3

Gene 4 541 32

…. 37 37

…. …. ….

Gene 3 have 10 fold up-regulated in test sample?!

(23)

Gene A Gene B Gene C Gene D

mRNA conc.2 Labeling hybridization Probe Affinity

Where is variation come from?

Intensity = F (mRNA conc, E_labeling, Ehybridization^{, Probe}affinity ..⁾

Batch effect Unpredictable Signal

(24)

Dual color system

Intensity = F (mRNA conc, E_labeling, Ehybridization^{, Probe}affinity⁾

mRNA Conc. Labeling hybridization Probe Affinity Should be equal between R/G

Or Validate by dye-swap

The same

Gene A ratio Gene A ratio

Probe A in Array 2 Probe A in Array 1

(25)

 Dual color system

 Probe flag

 Signal flooring

 Log Transform

 Single color system

 Summarization

 Log Transform

 Signal flooring

Data preprocessing - Normalization

(26)

Gene A Gene B Gene C Gene D

mRNA conc.2 Labeling hybridization Probe Affinity

Single Color system

Intensity = F (mRNA conc, E

_labeling,

E

hybridization

^{, Probe}

affinity ..

⁾

Signal Should be equal between chips

(27)

 Assumption

 Assume the signal of global or a subset genes is similar among each samples.

 Assume the distribution of signal value is normal distribution.

 ^Method

 ^Algorithm

 Mean shift

 Regression: Linear, Lowess, Print-tips Lowess

 ^Quantile

 Basal line

 Whole, Spike in, house-keeping gene

 Global and local (print tips)

Normalization

(28)

Quantile Normalization

A B C

1 1 7 0 1 1 3 2 1 1 9 0 2 1 9 3 2 9 0 2 1 5 7 3 6 3 3 1 0 8 3 1 9 4 4 1 3 3 4 2 4 4 1 9 0 5 1 9 8 5 1 0 0 5 1 7 8 6 1 1 0 6 9 1 6 1 1 7

A B C

3 6 8 4 6 8 6 6 8 6 1 1 9 2 1 1 9 2 1 1 9 4 1 3 4 6 1 3 4 5 1 3 4 1 1 5 3 5 1 5 3 1 1 5 3 2 1 6 4 3 1 6 4 4 1 6 4 5 1 7 5 1 1 7 5 3 1 7 5

A B C

1 1 5 3 1 1 7 5 1 1 5 3 2 1 6 4 2 1 1 9 2 1 1 9 3 6 8 3 1 6 4 3 1 7 5 4 1 3 4 4 6 8 4 1 6 4 5 1 7 5 5 1 5 3 5 1 3 4 6 1 1 9 6 1 3 4 6 6 8

A B C Mean

3 63 4 24 6 117 = 68 6 110 2 90 2 157 = 119 4 133 6 91 5 178 = 134 1 170 5 100 1 190 = 153.3 2 193 3 108 4 190 = 163.7 5 198 1 132 3 194 = 174.7

Raw data Rank by Signal

Replace signal by Row mean Return original rank That’s why the

distribution must be equal!

(29)

 Dual color system

 Probe flag

 Signal flooring

 Log Transform

 Single color system

 Summarization

 Log Transform

 Signal flooring

Data preprocessing - Normalization

RMA

(30)

Exercise II

Import and Normalized raw data of Affyemtrix array (.CEL)

(31)

1. Run R (for EzMeV version: RGUI.bat)

2. Command:

1. >Library(lucasLazyPack) 2. >lz.GUI()

Exercise II: Data normalization

(Affymetrix .CEL file)

(32)

1. Step:

1. >Click Load .CEL

2. >Select .CEL file (Data\GSE11524\GSE10213) 3. >Set normalization method

Exercise II: Data normalization

(33)

 Save as GSE10213_RMA.TXT

QC report

(34)

Signal flooring and log

transform

Gene Test sample Control Ratio

Gene 1 150 122 1.23

Gene 2 323 25 12.8

Gene 3 242 0.5 484

Gene 4 541 230 2.35

Gene 5 0.1 25 250

…. …. ….

Gene 3 is the most up-regulated gene!? (484 fold)

(35)

 Dual color system

 Probe flag

 R/G Normalization

 Signal flooring

 Convert to signal ratio

 Log Transform

 Single color system

 Summarization

 Normalization

 Log transform

 Signal flooring

Data preprocessing – Signal flooring

(36)

 Normalization

 Data QC/QA

 Gene annotation

General Workflow of Data Analysis

Array profiles

A few D.E. genes

Cellular Functions

Genes

(37)

Exercise III

Sample reorder Signal flooring

(38)

1. Load Data\GSE11540\GSE10213_RMA.txt 2. Click Reorder

3. Set Order of sample 4. Close dialog

5. Save GSE10213_RMA_Reorder.txt

Reorder Data Matrix (GSE10213)

!Sample_geo_acc!Sample_source_name_ch1 GSM257664 whole blood, monocyte_16 GSM257665 whole blood, macrophage_16 GSM257666 whole blood, monocyte_20 GSM257667 whole blood, macrophage_20 GSM257668 whole blood, monocyte_21 GSM257669 whole blood, macrophage_21 GSM257670 whole blood, monocyte_26 GSM257671 whole blood, macrophage_26 GSM257672 whole blood, monocyte_28 GSM257673 whole blood, macrophage_28

(39)

1. Load Data\GSE11540\GSE11540-GPL6097_series_matrix.txt 2. Click Reorder

3. Set Order of sample 4. Close dialog

5. Save GSE11430_SM_Reorder.txt

Reorder Data Matrix (GSE11430)

!Sample_geo_acc!Sample_source_name_ch1 GSM257770 whole blood, monocyte_16 GSM257793 whole blood, macrophage_16 GSM257794 whole blood, monocyte_20 GSM257795 whole blood, macrophage_20 GSM257796 whole blood, monocyte_21 GSM257797 whole blood, macrophage_21 GSM257798 whole blood, monocyte_26 GSM257799 whole blood, macrophage_26 GSM257801 whole blood, monocyte_28 GSM257805 whole blood, macrophage_28

(40)

 Load DataMatrix

 Reorder RMA result(GSE10213_RMA_Reorder.txt)

 ^Flooring

 Save as GSE10213_RMA_Reorder_F4.txt

Exercise III: Data processing

(41)

 Normalization

 Data QC/QA

 Gene annotation

General Workflow of Data Analysis

Array profiles

A few D.E. genes

Cellular Functions

Genes

(42)

Microarray Data

Quality Control and Assessment

Is hybridization reaction OK? Is signal comparable?

Is expected variations observed?

(43)

 Spike-in control

 Agilent system

 Affymetrix system

Is hybridization reaction OK?

(44)

 Data distribution

 ^Histogram

 ^{Box plot}

Is signal comparable?

(45)

Intra-group similarity << inter-group similarity

 Distance measurement

 ^Euclidean

 ^Manhattan

 Pearson dissimilarity

 Visualization

 Hierarchical CLustering

 Dimension reduction

 Principle Component Analysis

 Matric MultiDimensional Scaling

Is expected variations observed?

http://www.tm4.org/mev_manual/app3.html

(46)

Exercise IV

Data QC and visualization

(47)

 Load DataMatrix

 SeriesMatrix (GSE11430_SM_Reorder.txt)

 ^Boxplot

Exercise IV: Data QC

(48)

 Load DataMatrix

 SeriesMatrix (GSE11430_SM_Reorder.txt)

 Similarity Matrix

Exercise IV: Data QC

(49)

 Load DataMatrix

 ^MDS

Exercise IV: Data QC

(50)

Data quality is the first and the most

important issue in microarray analysis!

Is hybridization reaction OK? Is array signal comparable?

Is expected variations observed?

(51)

 Where is experiment variation?

 Signal: treatment or not.

 Noise: batch, gender, age, cell cycle, growth environment…..

 System errors

 Technique repeats

 Biological repeats

 How many repeats?

Problem from Experiment Design

Good Fat Normal

Batch Age Batch Age

1 1 6

2 2 9

3 3 18

4 1 15

5 2 8

6 3 10

Bad Fat Normal

Batch Age Batch Age

1 1 18

2 1 24

3 1 18

4 3 6

5 3 8

6 3 6

(52)

 Normalization

 Data QC/QA

 Gene annotation

General Workflow of Data Analysis

Array profiles

A few D.E. genes

Cellular Functions

Genes

(53)

 Statistics

 Two samples (ratio)

 One class sample

 Two sample pools

 ^t-test Welch’s t test

 Limma (a modified t-test)

 SAM (based on permutation)

 >2 group (1-way ANOVA)

 Pair comparison (pair t-test)

 Multi-factor analysis (n-way ANOVA)

 Time course

 Select D.E. genes

 Multiple test correction

 p-value correction

 Filtration

 Fold changes-cut-off

 p-Value cut-off

 Volcano plot

Find differentially expressed genes

1 2 3 4 5

(54)

Test Variance modeling Reference Package R Welch’s T-test - Fixed variance

- Heterodasticity ^Welch CIT (internal)

ANOVA - Fixed variance

- homoscedasticity ^Fisher CIT (internal)

Wilcoxon Non parametric _Wilcoxon _stats

SAM Non parametric Tusher et al

2001 ^samr

RVM Inverse gamma distribution on the variance (estimated from all the data set)

Wright & Simon

2003 CIT (internal) Limma

Moderate t-test. Usual variance replaced by a conditional variance. Bayesian

approach

Smyth

2004 ^limma

VARMIXT Gamma mixture model on the variance Delmar et al

2005 ^varmixt

SMVAR Mixed model (fixed condition effect and random gene effect)

Jaffrézic et al

2007 ^SMVar

56

(55)

Type I error rate Power

Stability Ease of use Calculation time Small

sample size

Large sample

size

Small sample

size

Large sample

size

t-test + +++ + +++ +++ +++ +++

anova +++ +++ + +++ +++ +++ +++

wilcoxon + + + ++ ++ +++ ++

SAM +++ +++ + ++ ++ ++ ++

RVM + + +++ +++ ++ + +

Limma +++ +++ +++ +++ +++ ++ +++

VarMixt +++ +++ +++ +++ + + +

SMVar + + ++ +++ + ++ ₅₇ +++

Summary table

Use

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933223/?tool=pubmed

(56)

 Multiple test issue

 About P < 0.05?

 Multiple-test p-value correction

 False positive discovery rate of p-value

 Validation by different methodology

 Northern blot

 Real-time QPCR

Is gene expression variation true?

(57)

Exercise V

Statistical analysis

(58)

Introduction of MeV interface

Start:

1. TMEV.bat (window) 2. TMEV.sh (Linux)

3. TMEV_MacOSX_4_X (Mac)

Major function:

1. Clustering 2. Statistics 3. Classification 4. Data Reduction 5. Meta Analysis 6. Visualization

(59)

Load Data for Affymetrix RMA file [Select File Loader]

[Affymatrix files] [RMA files]

Select GSE _RMA_Reorder.txt

Download Annotation (Require internet connection) [Homo sapiens]

[hgu133plus2]

[Automatically download] Press [Load]

Import Affymetrix data to MeV

(60)

Load Data for illumina human Bead array V1

[Select File Loader]

[Tab Delimited Multiple sample] Select GSE _SM_Reorder.txt

Download Annotation (Require internet connection) [Homo sapiens]

[illuminaHumanv1BeadID] [Automatically download] Press [Load]

Import illumina array data to MeV

(61)

[Display]>[Gene/Row Labels]>[label by ??]

Setup gene label

(62)

1. [Cluster Manager]/[Sample Cluster]

2. Select all _ma sample and [Store Rows as Cluster]

=> Marcophage(Green)

3. Select all -mo sample and [Store Rows as Cluster]

=> Monocyte(Red)

Setup sample cluster

(63)

T-test

Equal sample sizes, equal variance

Unequal sample sizes, equal variance

Unequal sample sizes, unequal variance

(64)

 Welch’s t-test (unequal variation)

1. [Analysis]>[Statistics]>[TTEST]

2. Setup:

1. Between Subjects

2. Macrophage: Group 1, Monocyte Group 2

3. Welch approximation(unequal group variance) 4. Overall alpha: 0.01

5. p-Value based on t-distribution 6. Adjusted Bonferroni correction

7. Construct Hierarchical tree for Significant genes only

Exercise V: Welch’s t-test

(65)

Assume: k gene in a chip.

Experiment-wise significant level: αe Comparison-wise significance level: αc Bonferroni correction

K independent test, so all α is equal. so overall number of type ) errors = K x α αe = αc/K => αc = αe x K

Corrected p = p x K

Dunn-Sidak

αe = 1-(1-αc) ^K => αc = 1-(1-αe) ^1/k Corrected p = 1-(1-p) ^1/k

Bonferroni Step-down (Holm) Sort p-values (ascending)

1st. corrected p-value = p-value x n(gene number) 2nd. corrected p-value = p-value x n-1

3rd. corrected p-value = p-value x n-2

…

nth. corrected p-value = p-value x 1

Multiple test correction

(66)

Family-wise error rate (FWER) is the probability of making one or more false discoveries. (p-value of p- value)

Step up False Discovery Rate p-Value sort ascending

1st. Corrected p = p x n/(n-0) 2nd. Corrected p = p x n/(n-1) 3rd. Corrected p = p x n/(n-2) ...

nth. Corrected p = p x n /1

Step down False Discovery Rate p-Value sort descending

1st. Corrected p = p x n/(n) 2nd. Corrected p = p x n/(n+1-2) 3rd. Corrected p = p x n/(n+1-3) ...

nth. Corrected p = p x n /1 q-Value

False positive discovery (FDR)

(67)

Significant genes

# of Significant Genes: 61

% of Genes that are Signficant: 0%

Non-significant genes

# of non-significant Genes: 54614

% of Genes that are not signficant: 100%

Result of Welch’s t test

(68)

Gene View

(69)

Volcano Plot

(70)

Volcano plot

Pos.Only >> Store selected gene as Cluster [Up] (red) Neg.Only >> Store selected gene as Cluster [Dn] (Green)

(71)

Gene cluster

(72)

 Paired t-test (unequal variation)

1. [Analysis]>[Statistics]>[TTEST]

2. Setup:

1. Paired

2. Select paried samples

3. Welch approximation(unequal group variance) 4. Overall alpha: 0.01

5. p-Value based on t-distribution 6. Adjusted Bonferroni correction

Exercise V: Paired t-test

(73)

Result of Paired t test

Significant genes

% of Genes that are Signficant: 0% Non-significant genes

(74)

 ^Limma

1. [Analysis]>[Statistics]>[LIMMA]

2. Setup:

1. Two Class

2. Significance Level: Alpha = .000001

Exercise V: LIMMA

Significant genes # of Significant Genes: 399

Non-significant genes # of non-significant Genes: 54276

(75)

Venn Diagram

(76)

Analysis of Variance (ANOVA)

GSE11324

MCF7 cells were stimulated with 100 nM estrogen for 0, 3, 6, or 12h. All experiments were performed in triplicate.

(77)

ANOVA

Data pre-procession (GSE11324.anl)

1. Use LucasLazyPack create GSE11324_RMA.txt 2. Import GSE11324_RMA.txt to MeV

3. Setup sample cluster (0,3,6,12hr) Statistics

1. [Analysis]>[Statistics]>[ANOVA] 2. Number of groups: 4

3. p-Value based on F-distribution 4. Overall alpha: 0.01

5. Adjusted Bonferroni correction

Exercise V: ANOVA

Significant genes

Non-significant genes

(78)

1. Get Data set for NCBI GEO

1. GSE11540( included GSE10213,GSE11430) 2. Read sample annotation of GEO data set

2. Pre-process raw data by R (Affymetrix .cel files)

1. Use GSE10213

2. RMA normalization 3. Signal flooring 4. Sample reordering

3. Data QC by R

1. Use GSE10213/GSE11430 2. Box plot

3. Similary matrix 4. MDS plot

4. Statistical analysis by MeV

1. Import Data to MeV (GSE10213/GSE11430 ) 2. PCA plot (GSE10213)

3. class comparison of GSE Welch’s t test, limma) 4. Pair sample comparison GSE10213 (Paired t test)

5. Multiple class comparison GSE11324(ANOVA)

Guide line of course exercise.

(79)

Principle of Microarray data analysis