Principle of Microarray Data
Analysis
Date: 2012.10.16
Speaker: Chien-Yi, Tung [email protected]
http://lucas.genome-analyst.org/activities-
articles/course/nrpb20121016principleofmicroarraydataanalysis
Hardware:
CPU: at least P4
4Gb RAM in 64-bite OS is strongly recommended!
With great patience and expert’s help, Gb in -bit OS is OK.
Software:
JAVA environment
http://www.java.com/zh_TW/download/manual.jsp
R GUI environment
http://cran.csie.ntu.edu.tw/
MultiExperimentViewer
http://sourceforge.net/projects/mev-tm4/files/latest/download
System requirement
R is a free software environment for statistical computing and
graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.(http://www.r-project.org/)
The R Project for Statistical Computing
Download
Window: (http://cran.csie.ntu.edu.tw/bin/windows/base/)
Mac OS X: (http://cran.csie.ntu.edu.tw/bin/macosx/)
Linux: (http://cran.csie.ntu.edu.tw/bin/linux/)
Install R environment
Window:
Execute R-2.15.1-win.exe [Software\R]
Install into C:\R\R_2.15.1 not C:\program files\R\R_2.15.1
Mac OS X
Install R-2.11.0 pkg [Software\R]
Install tktcl.pkg [Software\R]
Install R-2.15.1-signed.pkg [Software\R]
Installation instruction of R
environment
R GUI for window
R language
Using R console to Install basic package from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite() ,Optional step, It will take long time…
Install package for array analysis
biocLite("affy")
biocLite("tkWidgets")
Download LazyPack (https://sites.google.com/site/lucastproject/tools/lucaslazypack)
install.packages(file.choose(),type="source", repos=NULL) library(lucasLazyPack)
lz.GUI()
Install LazyPack
Download JAVA [Software\JAVA]
(http://www.java.com/zh_TW/download/manual.jsp)
Install 32 bit JAVA
Window
jre-7u7-windows-i586.exe MacOS X
jre-7u7-macosx-x64.dmg
Installation of JAVA environment
Download JAVA3D [Software\JAVA3D]
(http://java3d.java.net/binary-builds.html)
Insall 32 bit for MeV
Window
j3d-1_5_2-windows-i586 Mac OSX
J3d-1_5_2-macosx.zip
Unzip and Copy /lib/ext/*.jar to
/system/lib/java/extensions
JOGL
Installation of JAVA 3D environment
MeV is a desktop application for the analysis, visualization and data- mining of large-scale genomic data. (http://www.tm4.org/mev/)
MultiExperiment Viewer
http://sourceforge.net/projects/mev-tm4/files/mev-tm4/MeV 4.8.1/
Download MeV
Required JAVA runtime 32-bit
Required JAVA 3D API(32-bit) for PCA 3D plot
Download MeV(4.8) and Unzip in your Hard Drive (C: or D:)
Run TMEV.bat.[Software/MeV]
!problem of JAVA version! For EzMeV_win version
Run TMEV2.bat
MeV Installation for Window
Installation guide line
Software_Window
R/R-2.15.1-win.exe
Install in C:\R\R-2.15.1
R package(affy,tkWidgets)
R package(lucasLazyPack)
JAVA/jre-7u7-window-i586.exe
JAVA3D/j3d-1_5_2-windows-i586.exe
MeV/MeV_4_8_1_r2727_Win.zip
Software_Mac
R/tcltk.pkg
R/R-2.11.0.pkg
R package(affy,tkWidgets)
R package(lucasLazyPack)
JAVA/jre-7u7-macosx-x64.dmg
JAVA3D/j3d-1_5_2-macosx.zip
MeV/MeV_4_8_1_r2727_mac.gz
NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/)
Get Free Data set from GEO
Exercise I
Get data set from NCBI GEO
Search data set of Macrophage and monocyte…
Get Demo data set (GSE11540)
URL: (http://www.ncbi.nlm.nih.gov/geo/)
Comparison of monocytes and macrophages to characterize the
differential gene expression profiles of human monocytes and monocyte- derived-macrophages. Samples obtained from patients with acute
coronary syndrome.
This SuperSeries is composed of the following subset Series:
GSE10213 (Affymetrix)
GSE11430 (Illumina microarray)
Exercise I: Get Microarray Data
Demo cases
Normalized Data Matrix
Raw Data [Data\GSE11540]
Structure of SeriesMatrixFiles
Data preprocessing
Normalization
Probe summarization
Signal adjustment flag, flooring…
Signal transform (fold change, log transform)
Data QC/QA
QC index (Hybridization control)
Visualization (PCA, MDS, Box plot..)
Find differentially expressed genes
Statistical analysis
p-Value correction
Biological interpretation
Gene annotation
Gene set analysis
Gene network analysis
General Workflow of Data Analysis
Array profiles
A few D.E. genes
Large-scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Single or Dual Color System
Gene 1 Gene 2 Gene 3 Gene 4
Sample 2 Sample 1
Gene 1 Gene 2 Gene 3 Gene 4
Sample 3 Sample 2 Sample 1
Gene 1 Gene 2 Gene 3 Gene 4
Gene 1 Gene 2 Gene 3 Gene 4
Spotting array system
Probe flag
Image process
Background correction
R/G Normalization(lowess)
Signal flooring
Convert to signal ratio.
Log Transform
Affymetrix system
Image process
Background correction
Summarization
Normalization(quantile)
Log Transform
Signal flooring
From signal to expression level..
One gene
One gene
Experimental Errors and Batch Effects
Gene Test sample Control
Gene 1 1230 122
Gene 2 323 25
Gene 3 32 3
Gene 4 541 32
…. 37 37
…. …. ….
Gene 3 have 10 fold up-regulated in test sample?!
Gene A Gene B Gene C Gene D
mRNA conc.2 Labeling hybridization Probe Affinity
Where is variation come from?
Intensity = F (mRNA conc, Elabeling, Ehybridization , Probeaffinity ..)
Batch effect Unpredictable Signal
Dual color system
Intensity = F (mRNA conc, Elabeling, Ehybridization , Probeaffinity)
mRNA Conc. Labeling hybridization Probe Affinity Should be equal between R/G
Or Validate by dye-swap
The same
Gene A ratio Gene A ratio
Probe A in Array 2 Probe A in Array 1
Dual color system
Probe flag
R/G Normalization(lowess)
Signal flooring
Convert to signal ratio.
Log Transform
Single color system
Background correction
Summarization
Normalization(quantile)
Log Transform
Signal flooring
Data preprocessing - Normalization
Gene A Gene B Gene C Gene D
mRNA conc.2 Labeling hybridization Probe Affinity
Single Color system
Intensity = F (mRNA conc, E
labeling,E
hybridization, Probe
affinity ..)
Signal Should be equal between chips
Assumption
Assume the signal of global or a subset genes is similar among each samples.
Assume the distribution of signal value is normal distribution.
Method
Algorithm
Mean shift
Regression: Linear, Lowess, Print-tips Lowess
Quantile
Basal line
Whole, Spike in, house-keeping gene
Global and local (print tips)
Normalization
Quantile Normalization
A B C
1 1 7 0 1 1 3 2 1 1 9 0 2 1 9 3 2 9 0 2 1 5 7 3 6 3 3 1 0 8 3 1 9 4 4 1 3 3 4 2 4 4 1 9 0 5 1 9 8 5 1 0 0 5 1 7 8 6 1 1 0 6 9 1 6 1 1 7
A B C
3 6 8 4 6 8 6 6 8 6 1 1 9 2 1 1 9 2 1 1 9 4 1 3 4 6 1 3 4 5 1 3 4 1 1 5 3 5 1 5 3 1 1 5 3 2 1 6 4 3 1 6 4 4 1 6 4 5 1 7 5 1 1 7 5 3 1 7 5
A B C
1 1 5 3 1 1 7 5 1 1 5 3 2 1 6 4 2 1 1 9 2 1 1 9 3 6 8 3 1 6 4 3 1 7 5 4 1 3 4 4 6 8 4 1 6 4 5 1 7 5 5 1 5 3 5 1 3 4 6 1 1 9 6 1 3 4 6 6 8
A B C Mean
3 63 4 24 6 117 = 68 6 110 2 90 2 157 = 119 4 133 6 91 5 178 = 134 1 170 5 100 1 190 = 153.3 2 193 3 108 4 190 = 163.7 5 198 1 132 3 194 = 174.7
Raw data Rank by Signal
Replace signal by Row mean Return original rank That’s why the
distribution must be equal!
Dual color system
Probe flag
R/G Normalization(lowess)
Signal flooring
Convert to signal ratio.
Log Transform
Single color system
Background correction
Summarization
Normalization(quantile)
Log Transform
Signal flooring
Data preprocessing - Normalization
RMA
Exercise II
Import and Normalized raw data of Affyemtrix array (.CEL)
1. Run R (for EzMeV version: RGUI.bat)
2. Command:
1. >Library(lucasLazyPack) 2. >lz.GUI()
Exercise II: Data normalization
(Affymetrix .CEL file)
1. Step:
1. >Click Load .CEL
2. >Select .CEL file (Data\GSE11524\GSE10213) 3. >Set normalization method
Exercise II: Data normalization
Save as GSE10213_RMA.TXT
QC report
Signal flooring and log
transform
Gene Test sample Control Ratio
Gene 1 150 122 1.23
Gene 2 323 25 12.8
Gene 3 242 0.5 484
Gene 4 541 230 2.35
Gene 5 0.1 25 250
…. …. ….
Gene 3 is the most up-regulated gene!? (484 fold)
Dual color system
Probe flag
R/G Normalization
Signal flooring
Convert to signal ratio
Log Transform
Single color system
Background correction
Summarization
Normalization
Log transform
Signal flooring
Data preprocessing – Signal flooring
Data preprocessing
Normalization
Probe summarization
Signal adjustment flag, flooring…
Signal transform (fold change, log transform)
Data QC/QA
QC index (Hybridization control)
Visualization (PCA, MDS, Box plot..)
Find differentially expressed genes
Statistical analysis
p-Value correction
Biological interpretation
Gene annotation
Gene set analysis
Gene network analysis
General Workflow of Data Analysis
Array profiles
A few D.E. genes
Large-scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Exercise III
Sample reorder Signal flooring
1. Load Data\GSE11540\GSE10213_RMA.txt 2. Click Reorder
3. Set Order of sample 4. Close dialog
5. Save GSE10213_RMA_Reorder.txt
Reorder Data Matrix (GSE10213)
!Sample_geo_acc!Sample_source_name_ch1 GSM257664 whole blood, monocyte_16 GSM257665 whole blood, macrophage_16 GSM257666 whole blood, monocyte_20 GSM257667 whole blood, macrophage_20 GSM257668 whole blood, monocyte_21 GSM257669 whole blood, macrophage_21 GSM257670 whole blood, monocyte_26 GSM257671 whole blood, macrophage_26 GSM257672 whole blood, monocyte_28 GSM257673 whole blood, macrophage_28
1. Load Data\GSE11540\GSE11540-GPL6097_series_matrix.txt 2. Click Reorder
3. Set Order of sample 4. Close dialog
5. Save GSE11430_SM_Reorder.txt
Reorder Data Matrix (GSE11430)
!Sample_geo_acc!Sample_source_name_ch1 GSM257770 whole blood, monocyte_16 GSM257793 whole blood, macrophage_16 GSM257794 whole blood, monocyte_20 GSM257795 whole blood, macrophage_20 GSM257796 whole blood, monocyte_21 GSM257797 whole blood, macrophage_21 GSM257798 whole blood, monocyte_26 GSM257799 whole blood, macrophage_26 GSM257801 whole blood, monocyte_28 GSM257805 whole blood, macrophage_28
Load DataMatrix
Reorder RMA result(GSE10213_RMA_Reorder.txt)
Flooring
Save as GSE10213_RMA_Reorder_F4.txt
Exercise III: Data processing
Data preprocessing
Normalization
Probe summarization
Signal adjustment flag, flooring…
Signal transform (fold change, log transform)
Data QC/QA
QC index (Hybridization control)
Visualization (PCA, MDS, Box plot..)
Find differentially expressed genes
Statistical analysis
p-Value correction
Biological interpretation
Gene annotation
Gene set analysis
Gene network analysis
General Workflow of Data Analysis
Array profiles
A few D.E. genes
Large-scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Microarray Data
Quality Control and Assessment
Is hybridization reaction OK? Is signal comparable?
Is expected variations observed?
Spike-in control
Agilent system
Affymetrix system
Is hybridization reaction OK?
Data distribution
Histogram
Box plot
Is signal comparable?
Intra-group similarity << inter-group similarity
Distance measurement
Euclidean
Manhattan
Pearson dissimilarity
Visualization
Hierarchical CLustering
Dimension reduction
Principle Component Analysis
Matric MultiDimensional Scaling
Is expected variations observed?
http://www.tm4.org/mev_manual/app3.html
Exercise IV
Data QC and visualization
Load DataMatrix
Reorder RMA result(GSE10213_RMA_Reorder.txt)
SeriesMatrix (GSE11430_SM_Reorder.txt)
Boxplot
Exercise IV: Data QC
Load DataMatrix
Reorder RMA result(GSE10213_RMA_Reorder.txt)
SeriesMatrix (GSE11430_SM_Reorder.txt)
Similarity Matrix
Exercise IV: Data QC
Load DataMatrix
Reorder RMA result(GSE10213_RMA_Reorder.txt)
MDS
Exercise IV: Data QC
Data quality is the first and the most
important issue in microarray analysis!
Is hybridization reaction OK? Is array signal comparable?
Is expected variations observed?
Where is experiment variation?
Signal: treatment or not.
Noise: batch, gender, age, cell cycle, growth environment…..
System errors
Technique repeats
Biological repeats
How many repeats?
Problem from Experiment Design
Good Fat Normal
Batch Age Batch Age
1 1 6
2 2 9
3 3 18
4 1 15
5 2 8
6 3 10
Bad Fat Normal
Batch Age Batch Age
1 1 18
2 1 24
3 1 18
4 3 6
5 3 8
6 3 6
Data preprocessing
Normalization
Probe summarization
Signal adjustment flag, flooring…
Signal transform (fold change, log transform)
Data QC/QA
QC index (Hybridization control)
Visualization (PCA, MDS, Box plot..)
Find differentially expressed genes
Statistical analysis
p-Value correction
Biological interpretation
Gene annotation
Gene set analysis
Gene network analysis
General Workflow of Data Analysis
Array profiles
A few D.E. genes
Large-scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Statistics
Two samples (ratio)
One class sample
Two sample pools
t-test Welch’s t test
Limma (a modified t-test)
SAM (based on permutation)
>2 group (1-way ANOVA)
Pair comparison (pair t-test)
Multi-factor analysis (n-way ANOVA)
Time course
Select D.E. genes
Multiple test correction
p-value correction
Filtration
Fold changes-cut-off
p-Value cut-off
Volcano plot
Find differentially expressed genes
1 2 3 4 5
Test Variance modeling Reference Package R Welch’s T-test - Fixed variance
- Heterodasticity Welch CIT (internal)
ANOVA - Fixed variance
- homoscedasticity Fisher CIT (internal)
Wilcoxon Non parametric Wilcoxon stats
SAM Non parametric Tusher et al
2001 samr
RVM Inverse gamma distribution on the variance (estimated from all the data set)
Wright & Simon
2003 CIT (internal) Limma
Moderate t-test. Usual variance replaced by a conditional variance. Bayesian
approach
Smyth
2004 limma
VARMIXT Gamma mixture model on the variance Delmar et al
2005 varmixt
SMVAR Mixed model (fixed condition effect and random gene effect)
Jaffrézic et al
2007 SMVar
56
Type I error rate Power
Stability Ease of use Calculation time Small
sample size
Large sample
size
Small sample
size
Large sample
size
t-test + +++ + +++ +++ +++ +++
anova +++ +++ + +++ +++ +++ +++
wilcoxon + + + ++ ++ +++ ++
SAM +++ +++ + ++ ++ ++ ++
RVM + + +++ +++ ++ + +
Limma +++ +++ +++ +++ +++ ++ +++
VarMixt +++ +++ +++ +++ + + +
SMVar + + ++ +++ + ++ 57 +++
Summary table
Use
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933223/?tool=pubmed
Multiple test issue
About P < 0.05?
Multiple-test p-value correction
False positive discovery rate of p-value
Validation by different methodology
Northern blot
Real-time QPCR
Is gene expression variation true?
Exercise V
Statistical analysis
Introduction of MeV interface
Start:
1. TMEV.bat (window) 2. TMEV.sh (Linux)
3. TMEV_MacOSX_4_X (Mac)
Major function:
1. Clustering 2. Statistics 3. Classification 4. Data Reduction 5. Meta Analysis 6. Visualization
Load Data for Affymetrix RMA file [Select File Loader]
[Affymatrix files] [RMA files]
Select GSE _RMA_Reorder.txt
Download Annotation (Require internet connection) [Homo sapiens]
[hgu133plus2]
[Automatically download] Press [Load]
Import Affymetrix data to MeV
Load Data for illumina human Bead array V1
[Select File Loader]
[Tab Delimited Multiple sample] Select GSE _SM_Reorder.txt
Download Annotation (Require internet connection) [Homo sapiens]
[illuminaHumanv1BeadID] [Automatically download] Press [Load]
Import illumina array data to MeV
[Display]>[Gene/Row Labels]>[label by ??]
Setup gene label
1. [Cluster Manager]/[Sample Cluster]
2. Select all _ma sample and [Store Rows as Cluster]
=> Marcophage(Green)
3. Select all -mo sample and [Store Rows as Cluster]
=> Monocyte(Red)
Setup sample cluster
T-test
Equal sample sizes, equal variance
Unequal sample sizes, equal variance
Unequal sample sizes, unequal variance
Welch’s t-test (unequal variation)
1. [Analysis]>[Statistics]>[TTEST]
2. Setup:
1. Between Subjects
2. Macrophage: Group 1, Monocyte Group 2
3. Welch approximation(unequal group variance) 4. Overall alpha: 0.01
5. p-Value based on t-distribution 6. Adjusted Bonferroni correction
7. Construct Hierarchical tree for Significant genes only
Exercise V: Welch’s t-test
Assume: k gene in a chip.
Experiment-wise significant level: αe Comparison-wise significance level: αc Bonferroni correction
K independent test, so all α is equal. so overall number of type ) errors = K x α αe = αc/K => αc = αe x K
Corrected p = p x K
Dunn-Sidak
αe = 1-(1-αc) K => αc = 1-(1-αe) 1/k Corrected p = 1-(1-p) 1/k
Bonferroni Step-down (Holm) Sort p-values (ascending)
1st. corrected p-value = p-value x n(gene number) 2nd. corrected p-value = p-value x n-1
3rd. corrected p-value = p-value x n-2
…
nth. corrected p-value = p-value x 1
Multiple test correction
Family-wise error rate (FWER) is the probability of making one or more false discoveries. (p-value of p- value)
Step up False Discovery Rate p-Value sort ascending
1st. Corrected p = p x n/(n-0) 2nd. Corrected p = p x n/(n-1) 3rd. Corrected p = p x n/(n-2) ...
nth. Corrected p = p x n /1
Step down False Discovery Rate p-Value sort descending
1st. Corrected p = p x n/(n) 2nd. Corrected p = p x n/(n+1-2) 3rd. Corrected p = p x n/(n+1-3) ...
nth. Corrected p = p x n /1 q-Value
False positive discovery (FDR)
Significant genes
# of Significant Genes: 61
% of Genes that are Signficant: 0%
Non-significant genes
# of non-significant Genes: 54614
% of Genes that are not signficant: 100%
Result of Welch’s t test
Gene View
Volcano Plot
Volcano plot
Pos.Only >> Store selected gene as Cluster [Up] (red) Neg.Only >> Store selected gene as Cluster [Dn] (Green)
Gene cluster
Paired t-test (unequal variation)
1. [Analysis]>[Statistics]>[TTEST]
2. Setup:
1. Paired
2. Select paried samples
3. Welch approximation(unequal group variance) 4. Overall alpha: 0.01
5. p-Value based on t-distribution 6. Adjusted Bonferroni correction
7. Construct Hierarchical tree for Significant genes only
Exercise V: Paired t-test
Result of Paired t test
Significant genes
# of Significant Genes: 1
% of Genes that are Signficant: 0% Non-significant genes
# of non-significant Genes: 54674
% of Genes that are not signficant: 100%
Limma
1. [Analysis]>[Statistics]>[LIMMA]
2. Setup:
1. Two Class
2. Significance Level: Alpha = .000001
3. Construct Hierarchical tree for Significant genes only
Exercise V: LIMMA
Significant genes # of Significant Genes: 399
% of Genes that are Signficant: 1%
Non-significant genes # of non-significant Genes: 54276
% of Genes that are not signficant: 99%
Venn Diagram
Analysis of Variance (ANOVA)
GSE11324
MCF7 cells were stimulated with 100 nM estrogen for 0, 3, 6, or 12h. All experiments were performed in triplicate.
ANOVA
Data pre-procession (GSE11324.anl)
1. Use LucasLazyPack create GSE11324_RMA.txt 2. Import GSE11324_RMA.txt to MeV
3. Setup sample cluster (0,3,6,12hr) Statistics
1. [Analysis]>[Statistics]>[ANOVA] 2. Number of groups: 4
3. p-Value based on F-distribution 4. Overall alpha: 0.01
5. Adjusted Bonferroni correction
6. Construct Hierarchical tree for Significant genes only
Exercise V: ANOVA
Significant genes
# of Significant Genes: 162
% of Genes that are Signficant: 0%
Non-significant genes
# of non-significant Genes: 54513
% of Genes that are not signficant: 100%
1. Get Data set for NCBI GEO
1. GSE11540( included GSE10213,GSE11430) 2. Read sample annotation of GEO data set
2. Pre-process raw data by R (Affymetrix .cel files)
1. Use GSE10213
2. RMA normalization 3. Signal flooring 4. Sample reordering
3. Data QC by R
1. Use GSE10213/GSE11430 2. Box plot
3. Similary matrix 4. MDS plot
4. Statistical analysis by MeV
1. Import Data to MeV (GSE10213/GSE11430 ) 2. PCA plot (GSE10213)
3. class comparison of GSE Welch’s t test, limma) 4. Pair sample comparison GSE10213 (Paired t test)
5. Multiple class comparison GSE11324(ANOVA)