Microarray Data Analysis
Date: 2012.07.17
What is Microarray Technique?
Principle of microarray
ID
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
0 1000002000003000004000005000006000007000008000009000001000000
Target Preparation
& Deposition
Probe Preparation
& Hybridization
Scanning & Data Analysis
Array Spotter
Array Scanner
General Workflow of Data Analysis
•
Data preprocessing
– Normalization – Probe summarization– Signal adjustment (flag, flooring…)
– Signal transform (fold change, log transform)
•
Data QC/QA
– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)
•
Find differentially expressed genes
– Statistical analysis– p-Value correction
•
Biological interpretation
– Gene annotation – Gene set analysis – Gene network analysisArray profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
From signal to expression level..
• Dual color system
– Probe flag
– Background correction
– R/G Normalization(lowess)
– Signal flooring
– Convert to signal ratio.
– Log Transform
• Single color system
– Background correction
– Summarization
– Normalization(quantile)
– Log Transform
– Signal flooring
One gene One gene
One gene
Experimental Errors and Batch Effects
Gene Test sample Control
Gene 1 1230 122
Gene 2 323 25
Gene 3 32 3
Gene 4 541 32
…. 37 37
…. …. ….
Gene 3 have 10 fold up-regulated in test sample?!
Normalization
• Assumption
– Assume the signal of global or a subset genes is similar among each samples.
– Assume the distribution of signal value is normal distribution.
• Method
–
Algorithm
• Mean shift
• Regression: Linear, Lowess, Print‐tips Lowess
• Quantile –
Basal line
• Whole, Spike in, house‐keeping gene
• Global and local (print tips)
Variation Sources
Intensity = F (mRNA conc, E
labeling,E
hybridization, Probe
affinity ..)
0 5 10 15
Gene A Gene B Gene C Gene D
mRNA conc.2 Labeling hybridization Probe Affinity Batch effect Unpredictable Signal
Single or Dual Color System
0 1 2 3 4 5 6 7 8
Gene 1 Gene 2 Gene 3 Gene 4 Sample 2
Sample 1
0 2 4 6
Gene 1 Gene 2 Gene 3 Gene 4
Sample 3 Sample 2 Sample 1
0 2 4 6
Gene 1 Gene 2 Gene 3 Gene 4
0 2 4 6
Gene 1 Gene 2 Gene 3 Gene 4
Dual color system
Intensity = F (mRNA conc, Elabeling,Ehybridization, Probeaffinity)
0 5 10 15
Chip 1 Chip 2
mRNA Conc. Labeling hybridization Probe Affinity Gene A ratio Gene A ratio
Data preprocessing ‐ Normalization
• Dual color system
– Probe flag
– R/G Normalization(lowess)
– Signal flooring
– Convert to signal ratio.
– Log Transform
• Single color system
– Background correction
– Summarization
– Normalization(quantile)
– Log Transform
– Signal flooring
Single Color system
Intensity = F (mRNA conc, E
labeling,E
hybridization, Probe
affinity ..)
0 5 10 15
Gene A Gene B Gene C Gene D
mRNA conc.2 Labeling hybridization Probe Affinity Signal Should be equal between chips
Data preprocessing ‐ Normalization
• Dual color system
– Probe flag
– R/G Normalization(lowess)
– Signal flooring
– Convert to signal ratio.
– Log Transform
• Single color system
– Background correction
– Summarization
– Normalization(quantile)
– Log Transform
– Signal flooring
RMA
Signal flooring
Gene Test sample Control Ratio
Gene 1 150 122 1.23
Gene 2 323 25 12.8
Gene 3 242 0.5 484
Gene 4 541 230 2.35
Gene 5 0.1 25 250
…. …. ….
Gene 3 is the most significantly changed genes!?
Data preprocessing – Signal flooring
• Dual color system
– Probe flag
– R/G Normalization
– Signal flooring
– Convert to signal ratio
– Log Transform
• Single color system
– Background correction
– Summarization
– Normalization
– Log transform
– Signal flooring
General Workflow of Data Analysis
•
Data preprocessing
– Normalization – Probe summarization– Signal adjustment (flag, flooring…)
– Signal transform (fold change, log transform)
•
Data QC/QA
– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)
•
Find differentially expressed genes
– Statistical analysis– p-Value correction
•
Biological interpretation
– Gene annotation – Gene set analysis – Gene network analysisArray profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Microarray Data
Quality Control and Assessment
Is hybridization reaction OK?
Is signal comparable?
Is expected variations observed?
Is hybridization reaction OK?
• Spike ‐in control
– Agilent system
– Affymetrix system
Is signal comparable?
• Data distribution
–
Histogram
–Box plot
Is expected variations observed?
• Distance measurement
HCLPCA or MDS
Distance matrics
• Euclidean
• Manhattan
• Pearson dissimilarity
Data quality is the first and the most
important issue in microarray analysis!
Is hybridization reaction OK?
Is array signal comparable?
Is expected variations observed?
Problem from Experiment Design
• Where is experiment variation?
– Signal: treatment or not.– Noise: batch, gender, age, cell cycle, growth environment…..
• System errors
– Technique repeats – Biological repeats – How many repeats?Good Fat Normal
Batch Age Batch Age
1 1 6
2 2 9
3 3 18
Bad Fat Normal
Batch Age Batch Age
1 1 18
2 1 24
3 1 18
4 3 6
5 3 8
6 3 6
General Workflow of Data Analysis
•
Data preprocessing
– Normalization – Probe summarization– Signal adjustment (flag, flooring…)
– Signal transform (fold change, log transform)
•
Data QC/QA
– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)
•
Find differentially expressed genes
– Statistical analysis– p-Value correction
•
Biological interpretation
– Gene annotation – Gene set analysis – Gene network analysisArray profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Look for differentially Expressed
genes
Hypothesis testing
Differential analysis : comparison of 2 populations (or more)
according to a variable of interest (expression level).
Statistical hypothesis : assumption about a population parameter :
2 types of statistical hypotheses :
H
0: Expression level is the same between the 2 populations
H
1: Expression level differs between the 2 populations
25
Power
H0accepted H0rejected
H0TRUE True negative False positive (type I error) H0FALSE False negative
(type II error) True positive
Type I error rate : (0.05)
Type II error rate : Power= 1 –
Ability of a test to detect a gene as differentially expressed, given that this gene is actually differentially
expressed.
Test Variance modeling Reference Package R Welch’s T‐test ‐ Fixed variance
‐ Heterodasticity Welch CIT (internal)
ANOVA ‐ Fixed variance
‐ homoscedasticity Fisher CIT (internal)
Wilcoxon Non parametric Wilcoxon stats
SAM Non parametric Tusher et al
2001 samr
RVM Inverse gamma distribution on the
variance (estimated from all the data set) Wright & Simon
2003 CIT (internal) Limma Moderate t‐test. Usual variance replaced
by a conditional variance. Bayesian approach
Smyth
2004 limma
VARMIXT Gamma mixture model on the variance Delmar et al
2005 varmixt
SMVAR Mixed model (fixed condition effect and
random gene effect) Jaffrézic et al
2007 SMVar
27
28
Summary table
Type I error rate Power
Stability Ease of use Calculation time Small
sample size
Large sample
size
Small sample
size
Large sample
size
t‐test + +++ + +++ +++ +++ +++
anova +++ +++ + +++ +++ +++ +++
wilcoxon + + + ++ ++ +++ ++
SAM +++ +++ + ++ ++ ++ ++
RVM + + +++ +++ ++ + +
Limma +++ +++ +++ +++ +++ ++ +++
VarMixt +++ +++ +++ +++ + + +
SMVar + + ++ +++ + ++ +++
Use
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933223/?tool=pubmed
Find differentially expressed genes
• Statistics
–
Two samples (ratio)
–One class sample
–Two sample pools
• t‐test(Welch’s t test)
• Limma (a modified t‐test)
• SAM (based on permutation)
–
>2 group (1‐way ANOVA)
–Pair comparison (pair t‐test)
–Multi ‐factor analysis (n‐way ANOVA)
–Time course
• Select D.E. genes
–
Multiple test correction
• p‐value correction
–
Filtration
• Fold changes‐cut‐off
• p‐Value cut‐off
–
Volcano plot
1 2 3 4 5
T ‐test
Equal sample sizes, equal variance Unequal sample sizes, equal variance
Unequal sample sizes, unequal variance
Analysis of Variance (ANOVA)
Is gene expression variation true?
• Multiple test issue
– About P < 0.05?
– Multiple ‐test p‐value correction
– False positive discovery rate of p‐value
• Validation by different methodology
– Northern blot
– Real ‐time QPCR
Multiple test correction
Assume: k gene in a chip.
Experiment‐wise significant level: αe Comparison‐wise significance level: αc Bonferroni correction
K independent test, so all α is equal. so overall number of type I errors = K x α αe = αc/K => αc = αe x K
Corrected p = p x K Dunn‐Sidak
αe = 1‐(1‐αc) K => αc = 1‐(1‐αe) 1/k Corrected p = 1‐(1‐p) 1/k Bonferroni Step‐down (Holm) Sort p‐values (ascending)
1st. corrected p‐value = p‐value x n(gene number) 2nd. corrected p‐value = p‐value x n‐1
3rd. corrected p‐value = p‐value x n‐2
…
nth. corrected p‐value = p‐value x 1
False positive discovery (FDR)
Family‐wise error rate (FWER) is the probability of making one or more false discoveries. (p‐value of p‐ value)
Step up False Discovery Rate p‐Value sort ascending 1st. Corrected p = p x n/(n‐0) 2nd. Corrected p = p x n/(n‐1) 3rd. Corrected p = p x n/(n‐2) ...
nth. Corrected p = p x n /1 Step down False Discovery Rate p‐Value sort descending 1st. Corrected p = p x n/(n) 2nd. Corrected p = p x n/(n+1‐2) 3rd. Corrected p = p x n/(n+1‐3) ...
nth. Corrected p = p x n /1 q‐Value
General Workflow of Data Analysis
•
Data preprocessing
– Normalization – Probe summarization– Signal adjustment (flag, flooring…)
– Signal transform (fold change, log transform)
•
Data QC/QA
– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)
•
Find differentially expressed genes
– Statistical analysis– p-Value correction
•
Biological interpretation
– Gene annotation – Gene set analysis – Gene network analysisArray profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Functions
Genes
Gene Annotation
X gene is a W kinase
X gene is involved in Y function
X gene is involved in Z pathway
Gene Ontology
• Systemic biological function vocabulary
• Acyclic network structure
• Gene association
www.geneontology.org
KEGG: Kyoto Encyclopedia of Genes and Genomes
www.genome.jp/kegg/
The Reactome Project
OMIM ‐ Online Mendelian Inheritance in Man
http://www.ncbi.nlm.nih.gov/omim
Biological interpretation
• Functional annotation of individual gene
– Gene ontology (GO, KEGG..)
– Pathway (KEGG, BIOCARTA, GenMAPP)
– OMIM
– Database gateway (GeneCards, Uniprot..)
D.E. Genes Array:
Probe description
Annotation Database
Annotation Database
Annotation
Database
Gateway
GeneCards
http://www.genecards.org/
Uniprot
• http://www.uniprot.org/
What happen in cellular function?
• Individual gene studies
–
Find differential expressed gene (DEG)
–Look up gene annotation
–
Validate function by genetic manipulations
• over‐express, knock‐down..
• The risks of the straight forward strategy
–
Which DEG?
• A try and error game –
Which function?
• Do you check right functional assay? –
Do it work?
• Function regulation is a team work.
Function 2 Function 1
Cross‐talk 3
ClassA ClassB
ttest cut‐off
FDR<0.05
FDR<0.05 Biological meaning?
Expression Analysis Systematic Explorer
• Cut off issues
• Gene set issues
Expression Analysis Systematic Explorer
• Also:
• Functional annotation enrichment analysis
• GeneOntology analysis
• Interpretation of gene list – Gene Ontology
– KEGG pathway – Biocarta pathway
•
Statistic methods
– Fish exact test – Hypergeometric test – Binomial testFisher Exact Test
• When members of two independent groups can fall into one of two mutually exclusive categories, Fisher Exact test is used to determine whether the proportions of those falling into each category differs by group. In DAVID annotation system, Fisher Exact is adopted to measure the gene‐enrichment in annotation terms.
A Hypothetical Example:
In human genome background (30,000 gene total), 40 genes are involved in p53 signalling pathway. A given gene list has found that 3 out of 300 belong to p53 signalling pathway. Then we ask the question if 10/300 is more than random chance comparing to the human background of 40/30000.
A 2x2 contigency table is built on above numbers:
•
Fisher Exact P‐Value = 0.008 and since P‐Value <= 0.01, this user gene list is specifically associated (enriched) in p53 signalling pathway than random chance.
User Genes Genome
In Pathway 3 40
Not In Pathway 297 29960
Adapted from DAVID (http://niaid.abcc.ncifcrf.gov/)
The Database for Annotation, Visualization and
Integrated Discovery (DAVID)
• ID conversion
• Gene annotation
• EASE analysis
• ….
http://david.abcc.ncifcrf.gov/
Gene set approaches
• From Gene list (IGA)
• Expression Analysis Systematic Explorer
• From Gene set score(GSA)
• Gene Set Enrichment Analysis
• event‐based Gene Set Analysis
Gene Set Enrichment Analysis
http://www.pnas.org/content/102/43/15545.abstract
ES/NES statistic
-
+
ClassA ClassB
Gene Set 1
ttest cut‐off
Gene Set 2
Gene Set 3
Gene set 3 enriched in Class B
Gene set 2 enriched in Class A
Gene Set Enrichment Analysis
Dataset distribution
Number of genes
Gene Expression Level
The Kolmogorov–Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples.
The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions.
The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.
Gene set 1 distribution
Gene set 2 distribution
http//:Fbioinfo.cnio.es/files/training/Fourth_Functional_Analysis_of_Gene_Expression/CNIO_GSEA_16_02_2010.p pt
NES NES
pval pval
FDR FDR
Benjamini‐Hochberg
ES(S) = Max(|Phit‐Pmiss|) Kolmogorov‐Smirnoff (K‐S) distance
Gene Set Enrichment Analysis
GSEA Implementation
http://www.broadinstitute.org/gsea/msigdb/index.jsp
• Gene set (.gmt files)
• GSEA in MeV
• GSEA for JAVA
• GSEA for R
Something else..
• Microarray is dead! NGS is new king!
• Great! My magic gene can link to cell cycle,
apoptosis and P53 pathway!
• Follow what you’re interested and biological
clues.
General Workflow of Data Analysis
• Data preprocessing – Normalization – Probe summarization
– Signal adjustment (flag, flooring…)
– Signal transform (fold change, log transform)
• Data QC/QA
– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)
• Find differentially expressed genes – Statistical analysis
– p‐Value correction
• Biological interpretation – Gene annotation – Gene set analysis – Gene network analysis
Array profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Functions
Genes