Microarray Data Analysis

(1)

Microarray Data Analysis

Date: 2012.07.17

What is Microarray Technique?

(2)

Principle of microarray

ID

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000

0 1000002000003000004000005000006000007000008000009000001000000

Target Preparation

& Deposition

Probe Preparation

& Hybridization

Scanning & Data Analysis

Array Spotter

Array Scanner

General Workflow of Data Analysis

•

Data preprocessing

– Normalization – Probe summarization

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

•

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

•

Find differentially expressed genes

– Statistical analysis

– p-Value correction

•

Biological interpretation

– Gene annotation – Gene set analysis – Gene network analysis

Array_profiles

A few D.E. genes

Large‐scaled screening

Literature_studies Biomarkers

Cellular Functions

Genes

(3)

From signal to expression level..

• Dual color system

– Probe _flag

– Background correction

– R/G Normalization(lowess)

– Signal _flooring

– Convert to signal ratio.

– Log _Transform

• Single color system

– Background correction

– Summarization

– Normalization(quantile)

– Log _Transform

– Signal _flooring

One_gene One_gene

One_gene

Experimental Errors and Batch Effects

Gene Test sample Control

Gene 1 1230 122

Gene 2 323 25

Gene 3 32 3

Gene 4 541 32

…. 37 37

…. …. ….

Gene 3 have 10 fold up-regulated in test sample?!

(4)

Normalization

• Assumption

– Assume the signal of global or a subset genes is similar among each samples.

– Assume the distribution of signal value is normal distribution.

• Method

–

Algorithm

• Mean_shift

• Regression: Linear, Lowess, Print‐tips Lowess

• Quantile –

Basal _line

• Whole, Spike in, house‐keeping gene

• Global and local (print tips)

Variation _Sources

Intensity = F (mRNA conc, E

labeling,

^E

hybridization

^, Probe

affinity ..

⁾

0 5 10 15

Gene_A Gene_B Gene_C Gene_D

mRNA conc.2 Labeling hybridization Probe Affinity Batch_effect Unpredictable Signal

(5)

Single or Dual Color System

0 1 2 3 4 5 6 7 8

Gene 1 Gene 2 Gene 3 Gene 4 Sample 2

Sample 1

0 2 4 6

Gene 1 Gene 2 Gene 3 Gene 4

Sample 3 Sample 2 Sample 1

0 2 4 6

Dual color system

Intensity = F (mRNA conc, Elabeling,^Ehybridization, Probeaffinity⁾

0 5 10 15

Chip₁ Chip₂

mRNA Conc. Labeling hybridization Probe Affinity Gene A ratio ^Gene^A ratio

(6)

Data preprocessing ‐ Normalization

• Dual color system

– Probe _flag

– R/G Normalization(lowess)

– Signal _flooring

– Convert to signal ratio.

– Log _Transform

• Single color system

– Background correction

– Summarization

– Normalization(quantile)

– Log _Transform

– Signal _flooring

Single Color system

Intensity = F (mRNA conc, E

labeling,

^E

hybridization

^, Probe

affinity ..

⁾

0 5 10 15

Gene_A Gene_B Gene_C Gene_D

mRNA_conc.2 Labeling hybridization Probe_Affinity Signal Should be equal between chips

(7)

Data preprocessing ‐ Normalization

• Dual color system

– Probe _flag

– R/G Normalization(lowess)

– Signal _flooring

– Convert to signal ratio.

– Log _Transform

• Single color system

– Background correction

– Summarization

– Normalization(quantile)

– Log _Transform

– Signal _flooring

RMA

Signal flooring

Gene Test sample Control Ratio

Gene 1 150 122 1.23

Gene 2 323 25 12.8

Gene 3 242 0.5 484

Gene 4 541 230 2.35

Gene 5 0.1 25 250

…. …. ….

Gene 3 is the most significantly changed genes!?

(8)

Data preprocessing – Signal flooring

• Dual color system

– Probe _flag

– R/G Normalization

– Signal _flooring

– Convert to signal ratio

– Log _Transform

• Single color system

– Background correction

– Summarization

– Normalization

– Log _transform

– Signal _flooring

General Workflow of Data Analysis

•

Data preprocessing

•

Data QC/QA

•

Find differentially expressed genes

•

Biological interpretation

Array_profiles

A few D.E. genes

Cellular Functions

Genes

(9)

Microarray _Data

Quality Control and Assessment

Is hybridization reaction OK?

Is signal comparable?

Is expected variations observed?

Is hybridization reaction OK?

• Spike ‐in control

– Agilent _system

– Affymetrix system

(10)

Is signal comparable?

• Data distribution

–

Histogram

–

Box _plot

Is expected variations observed?

• Distance measurement

_HCL

PCA_or MDS

Distance matrics

• Euclidean

• Manhattan

• Pearson dissimilarity

(11)

Data quality is the first and the most

important issue in microarray analysis!

Is hybridization reaction OK?

Is array signal comparable?

Is expected variations observed?

Problem from Experiment Design

• Where is experiment variation?

– Signal: treatment or not.

– Noise: batch, gender, age, ^cell cycle^, growth environment…..

• System _errors

– Technique_repeats – Biological repeats – How many repeats?

Good Fat Normal

Batch Age Batch Age

1 1 6

2 2 9

3 3 18

Bad Fat Normal

Batch Age Batch Age

1 1 18

2 1 24

3 1 18

4 3 6

5 3 8

6 3 6

(12)

General Workflow of Data Analysis

•

Data preprocessing

•

Data QC/QA

•

Find differentially expressed genes

•

Biological interpretation

Array_profiles

A few D.E. genes

Cellular Functions

Genes

Look for differentially Expressed

genes

(13)

Hypothesis _testing

Differential _analysis : comparison of 2 populations (or more)

according to a variable of interest (expression level).

Statistical hypothesis ^: assumption about a population parameter :

2 types of statistical hypotheses :

H

₀

: Expression level is the same between the 2 populations

H

₁

: Expression level differs between the 2 populations

25

Power

H0accepted H0rejected

H0TRUE True negative False positive (type I error) H0FALSE ^False^negative

(type II error) ^True^positive

Type I error rate : _{ (0.05)}

Type II error rate : _ ^Power^⁼^1 –^

Ability of a test to detect a gene as differentially expressed, given that this gene is actually differentially

expressed.

(14)

Test Variance modeling Reference Package R Welch’s T‐test ‐ Fixed variance

‐ Heterodasticity ^Welch CIT (internal)

ANOVA ‐ Fixed variance

‐ homoscedasticity ^Fisher ^CIT (internal)

Wilcoxon Non parametric Wilcoxon stats

SAM Non parametric Tusher et al

2001 ^samr

RVM Inverse gamma distribution on the

variance (estimated from all the data set) Wright & Simon

2003 ^CIT (internal) Limma Moderate t‐test. Usual variance replaced

by a conditional variance. Bayesian approach

Smyth

2004 ^limma

VARMIXT Gamma mixture model on the variance Delmar et al

2005 ^varmixt

SMVAR Mixed model (fixed condition effect and

random gene effect) Jaffrézic et al

2007 ^SMVar

27

28

Summary _table

Type I error rate Power

Stability Ease of use Calculation time Small

sample size

Large sample

size

Small sample

size

Large sample

size

t‐test + +++ + +++ +++ +++ +++

anova +++ +++ + +++ +++ +++ +++

wilcoxon + + + ++ ++ +++ ++

SAM +++ +++ + ++ ++ ++ ++

RVM + + +++ +++ ++ + +

Limma +++ +++ +++ +++ +++ ++ +++

VarMixt +++ +++ +++ +++ + + +

SMVar + + ++ +++ + ++ +++

Use

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933223/?tool=pubmed

(15)

Find differentially expressed genes

• Statistics

–

Two samples (ratio)

–

One class sample

–

Two sample pools

• t‐test(Welch’s t test)

• Limma (a modified t‐test)

• SAM (based on permutation)

–

>2 group (1‐way ANOVA)

–

Pair comparison (pair t‐test)

–

Multi ‐factor analysis (n‐way ANOVA)

–

Time _course

• Select D.E. genes

–

Multiple test correction

• p‐value correction

–

Filtration

• Fold changes‐cut‐off

• p‐Value cut‐off

–

Volcano _plot

1 2 3 4 5

T _‐test

Equal sample sizes, equal variance Unequal sample sizes, equal variance

Unequal sample sizes, unequal variance

(16)

Analysis of Variance (ANOVA)

Is gene expression variation true?

• Multiple test issue

– About P < 0.05?

– Multiple ‐test p‐value correction

– False positive discovery rate of p‐value

• Validation by different methodology

– Northern _blot

– Real _{‐time QPCR}

(17)

Multiple test correction

Assume: k gene in a chip.

Experiment‐wise significant level: αe Comparison‐wise significance level: αc Bonferroni correction

K independent test, so all α is equal. so overall number of type I errors = K x α αe = αc/K => αc = αe x K

Corrected p = p x K Dunn_‐Sidak

αe = 1‐(1‐αc) ^K=> αc = 1‐(1‐αe) ^1/k Corrected p = 1‐(1‐p) ^1/k Bonferroni Step‐down (Holm) Sort p‐values (ascending)

1st. corrected p‐value = p‐value x n(gene number) 2nd. corrected p‐value = p‐value x n‐1

3rd. corrected p‐value = p‐value x n‐2

…

nth. corrected p‐value = p‐value x 1

False positive discovery (FDR)

Family‐wise error rate (FWER) is the probability of making one or more false discoveries. (p‐value of p‐ value)

Step up False Discovery Rate p‐Value sort ascending 1st. Corrected p = p x n/(n‐0) 2nd. Corrected p = p x n/(n‐1) 3rd. Corrected p = p x n/(n‐2) ...

nth. Corrected p = p x n /1 Step down False Discovery Rate p‐Value sort descending 1st. Corrected p = p x n/(n) 2nd. Corrected p = p x n/(n+1‐2) 3rd. Corrected p = p x n/(n+1‐3) ...

nth. Corrected p = p x n /1 q‐Value

(18)

General Workflow of Data Analysis

•

Data preprocessing

•

Data QC/QA

•

Find differentially expressed genes

•

Biological interpretation

Array_profiles

A few D.E. genes

Cellular Functions

Genes

Gene Annotation

X gene is a W kinase

X gene is involved in Y function

X gene is involved in Z pathway

(19)

Gene _Ontology

• Systemic biological function vocabulary

• Acyclic network structure

• Gene association

www.geneontology.org

KEGG: Kyoto Encyclopedia of Genes and Genomes

www.genome.jp/kegg/

(20)

The Reactome Project

OMIM ‐ Online Mendelian Inheritance in Man

http://www.ncbi.nlm.nih.gov/omim

(21)

Biological interpretation

• ^Functional annotation of individual gene

– Gene ontology (GO, KEGG..)

– Pathway (KEGG, BIOCARTA, GenMAPP)

– OMIM

– Database gateway (GeneCards, Uniprot..)

D.E._Genes _Array:

Probe description

Annotation Database

Annotation

Database

^Gateway

GeneCards

http://www.genecards.org/

(22)

Uniprot

• http://www.uniprot.org/

What happen in cellular function?

• Individual gene studies

–

Find differential expressed gene (DEG)

–

Look up gene annotation

–

Validate function by genetic manipulations

• over‐express, knock‐down..

• The risks of the straight forward strategy

–

Which _DEG?

• A try and error game –

Which _function?

• Do you check right functional assay? –

Do _it work?

• Function regulation is a team work.

Function₂ Function 1

Cross‐talk 3

(23)

ClassA ClassB

ttest cut‐off

FDR<0.05

FDR<0.05 Biological meaning?

Expression Analysis Systematic Explorer

• Cut off issues

• Gene set issues

Expression Analysis Systematic Explorer

• Also:

• Functional annotation enrichment analysis

• GeneOntology analysis

• Interpretation of gene list – Gene_Ontology

– KEGG_pathway – Biocarta pathway

•

Statistic _methods

– Fish exact test – Hypergeometric test – Binomial_test

(24)

Fisher Exact Test

• When members of two independent groups can fall into one of two mutually exclusive categories, Fisher Exact test is used to determine whether the proportions of those falling into each category differs by group. In DAVID annotation system, Fisher Exact is adopted to measure the gene‐enrichment in annotation terms.

A Hypothetical Example:

In human genome background (30,000 gene total), 40 genes are involved in p53 signalling pathway. A given gene list has found that 3 out of 300 belong to p53 signalling pathway. Then we ask the question if 10/300 is more than random chance comparing to the human background of 40/30000.

A 2x2 contigency table is built on above numbers:

•

Fisher Exact P‐Value = 0.008 and since P‐Value <= 0.01, this user gene list is specifically associated (enriched) in p53 signalling pathway than random chance.

User Genes Genome

In Pathway 3 40

Not In Pathway 297 29960

Adapted from DAVID (http://niaid.abcc.ncifcrf.gov/)

The Database _for Annotation, Visualization _and

Integrated Discovery (DAVID)

• ID conversion

• Gene annotation

• EASE _analysis

• ….

http://david.abcc.ncifcrf.gov/

(25)

Gene set approaches

• From Gene list (IGA)

• Expression Analysis Systematic Explorer

• From Gene set score(GSA)

• Gene Set Enrichment Analysis

• event‐based Gene Set Analysis

Gene Set Enrichment Analysis

http://www.pnas.org/content/102/43/15545.abstract

(26)

ES/NES statistic

-

+

ClassA ClassB

Gene Set 1

ttest cut‐off

Gene Set 2

Gene Set 3

Gene set 3 enriched in Class B

Gene set 2 enriched in Class A

Gene Set Enrichment Analysis

Dataset distribution

Number of genes

Gene Expression Level

The Kolmogorov–Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples.

The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions.

The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.

Gene set 1 distribution

Gene set 2 distribution

http//:Fbioinfo.cnio.es/files/training/Fourth_Functional_Analysis_of_Gene_Expression/CNIO_GSEA_16_02_2010.p pt

(27)

NES NES

pval pval

FDR FDR

Benjamini‐Hochberg

ES(S)_= Max(|P_hit_‐P_miss|) Kolmogorov‐Smirnoff (K‐S) distance

Gene Set Enrichment Analysis

(28)

GSEA Implementation

http://www.broadinstitute.org/gsea/msigdb/index.jsp

• Gene set (.gmt files)

• GSEA _in MeV

• GSEA _for JAVA

• GSEA _for R

Something _else..

• Microarray is dead! NGS is new king!

• Great! My magic gene can link to cell cycle,

apoptosis and P53 pathway!

• Follow what you’re interested and biological

clues.

(29)

General Workflow of Data Analysis

• Data preprocessing – Normalization – Probe summarization

• Data QC/QA

• Find differentially expressed genes – Statistical_analysis

– p‐Value correction

• Biological interpretation – Gene annotation – Gene set analysis – Gene network analysis

Array_profiles

A few D.E. genes

Cellular Functions

Genes

The _End

(30)

Microarray Data Analysis