• 検索結果がありません。

Microarray Data Analysis

N/A
N/A
Protected

Academic year: 2018

シェア "Microarray Data Analysis"

Copied!
30
0
0

読み込み中.... (全文を見る)

全文

(1)

Microarray  Data Analysis

Date:  2012.07.17

What  is Microarray Technique?

(2)

Principle of microarray

ID

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000

0 1000002000003000004000005000006000007000008000009000001000000

Target Preparation

& Deposition

Probe Preparation

& Hybridization

Scanning & Data Analysis

Array Spotter

Array Scanner

General Workflow of Data Analysis

Data preprocessing

– Normalization – Probe summarization

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

Find differentially expressed genes

– Statistical analysis

– p-Value correction

Biological interpretation

– Gene annotation – Gene set analysis – Gene network analysis

Array profiles

A few D.E. genes

Large‐scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

(3)

From  signal to expression level..

Dual  color system 

Probe  flag

– Background  correction

– R/G  Normalization(lowess)

– Signal  flooring

– Convert  to signal ratio.

– Log  Transform

Single  color system

– Background  correction

– Summarization

– Normalization(quantile)

– Log  Transform

– Signal  flooring

One gene One gene

One gene

Experimental Errors and Batch Effects

Gene Test sample Control

Gene 1 1230 122

Gene 2 323 25

Gene 3 32 3

Gene 4 541 32

…. 37 37

…. …. ….

Gene 3 have 10 fold up-regulated in test sample?!

(4)

Normalization

Assumption  

– Assume the signal of global or a subset genes is similar among each  samples.

– Assume the distribution of  signal value is normal distribution.

Method

Algorithm

• Mean shift

• Regression: Linear, Lowess, Print‐tips Lowess

• Quantile –

Basal  line

• Whole, Spike in, house‐keeping gene

• Global and local (print tips)

Variation  Sources

Intensity  = F (mRNA conc, E

labeling,

E

hybridization

,  Probe

affinity ..

)

0 5 10 15

Gene A Gene B Gene C Gene D

mRNA conc.2 Labeling hybridization Probe Affinity Batch effect Unpredictable Signal

(5)

Single or Dual Color System

0 1 2 3 4 5 6 7 8

Gene 1 Gene 2 Gene 3 Gene 4 Sample 2

Sample 1

0 2 4 6

Gene 1 Gene 2 Gene 3 Gene 4

Sample 3 Sample 2 Sample 1

0 2 4 6

Gene 1 Gene 2 Gene 3 Gene 4

0 2 4 6

Gene 1 Gene 2 Gene 3 Gene 4

Dual  color system

Intensity = F (mRNA conc, Elabeling,Ehybridization, Probeaffinity)

0 5 10 15

Chip 1 Chip 2

mRNA Conc. Labeling hybridization Probe Affinity Gene A ratio Gene A ratio

(6)

Data  preprocessing ‐ Normalization

Dual  color system 

– Probe  flag

– R/G  Normalization(lowess)

– Signal  flooring

– Convert  to signal ratio.

– Log  Transform

Single  color system

– Background  correction

– Summarization

– Normalization(quantile)

– Log  Transform

– Signal  flooring

Single  Color system

Intensity  = F (mRNA conc, E

labeling,

E

hybridization

,  Probe

affinity ..

)

0 5 10 15

Gene A Gene B Gene C Gene D

mRNA conc.2 Labeling hybridization Probe Affinity Signal Should be equal between chips

(7)

Data  preprocessing ‐ Normalization

Dual  color system 

– Probe  flag

– R/G  Normalization(lowess)

– Signal  flooring

– Convert  to signal ratio.

– Log  Transform

Single  color system

– Background  correction

– Summarization

– Normalization(quantile)

– Log  Transform

– Signal  flooring

RMA

Signal flooring

Gene Test sample Control Ratio

Gene 1 150 122 1.23

Gene 2 323 25 12.8

Gene 3 242 0.5 484

Gene 4 541 230 2.35

Gene 5 0.1 25 250

…. …. ….

Gene 3 is the most significantly changed genes!?

(8)

Data  preprocessing – Signal flooring

Dual  color system 

– Probe  flag

– R/G  Normalization

– Signal  flooring

– Convert  to signal ratio

– Log  Transform

Single  color system

– Background  correction

– Summarization

– Normalization

– Log  transform

– Signal  flooring

General Workflow of Data Analysis

Data preprocessing

– Normalization – Probe summarization

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

Find differentially expressed genes

– Statistical analysis

– p-Value correction

Biological interpretation

– Gene annotation – Gene set analysis – Gene network analysis

Array profiles

A few D.E. genes

Large‐scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

(9)

Microarray  Data 

Quality  Control and Assessment

Is  hybridization reaction OK?

Is  signal comparable? 

Is  expected variations observed?

Is  hybridization reaction OK?

• Spike ‐in control

– Agilent  system

– Affymetrix system

(10)

Is  signal comparable? 

Data  distribution

Histogram

Box  plot

Is  expected variations observed?

• Distance  measurement

HCL

PCA or MDS

Distance matrics

• Euclidean

• Manhattan

• Pearson dissimilarity

(11)

Data  quality is the first and the most 

important  issue in microarray analysis!

Is  hybridization reaction OK?

Is  array signal comparable? 

Is  expected variations observed?

Problem  from Experiment Design

• Where is experiment variation?

Signal: treatment or not.

Noise: batch, gender, age, cell cycle,  growth environment…..

• System  errors

– Technique repeats – Biological repeats – How many repeats?

Good Fat Normal

Batch Age Batch Age

1 1 6

2 2 9

3 3 18

Bad Fat Normal

Batch Age Batch Age

1 1 18

2 1 24

3 1 18

4 3 6

5 3 8

6 3 6

(12)

General Workflow of Data Analysis

Data preprocessing

– Normalization – Probe summarization

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

Find differentially expressed genes

– Statistical analysis

– p-Value correction

Biological interpretation

– Gene annotation – Gene set analysis – Gene network analysis

Array profiles

A few D.E. genes

Large‐scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

Look  for differentially Expressed 

genes

(13)

Hypothesis  testing

Differential  analysis :  comparison of 2 populations (or more) 

according  to a variable of interest (expression level).

Statistical  hypothesis :  assumption about a population  parameter :

2  types of statistical hypotheses :

H

0

:  Expression level is the same between the 2 populations

H

1

:  Expression level differs between the 2 populations

25

Power

H0accepted H0rejected

H0TRUE True negative False positive (type I error) H0FALSE False negative

(type II error) True positive

Type I error rate :  (0.05)

Type II error rate :  Power= 1 –

Ability of a test to detect a gene as  differentially expressed, given that  this gene is actually differentially 

expressed.

(14)

Test Variance modeling Reference Package R Welch’s T‐test ‐ Fixed variance

‐ Heterodasticity Welch CIT (internal)

ANOVA ‐ Fixed variance

‐ homoscedasticity Fisher CIT (internal)

Wilcoxon Non parametric Wilcoxon stats

SAM Non parametric Tusher et al

2001 samr

RVM Inverse gamma distribution on the 

variance (estimated from all the data set) Wright & Simon

2003 CIT (internal) Limma Moderate t‐test. Usual variance replaced 

by a conditional variance. Bayesian  approach

Smyth

2004 limma

VARMIXT Gamma mixture model on the variance Delmar et al 

2005 varmixt

SMVAR Mixed model (fixed condition effect and 

random gene effect) Jaffrézic et al 

2007 SMVar

27

28

Summary  table

Type I error rate Power

Stability Ease of use Calculation time Small

sample size

Large  sample

size

Small sample

size

Large sample

size

t‐test + +++ + +++ +++ +++ +++

anova +++ +++ + +++ +++ +++ +++

wilcoxon + + + ++ ++ +++ ++

SAM +++ +++ + ++ ++ ++ ++

RVM + + +++ +++ ++ + +

Limma +++ +++ +++ +++ +++ ++ +++

VarMixt +++ +++ +++ +++ + + +

SMVar + + ++ +++ + ++ +++

Use

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933223/?tool=pubmed

(15)

Find  differentially expressed genes

Statistics

Two  samples (ratio)

One  class sample

Two  sample pools 

• t‐test(Welch’s t test)

• Limma (a modified t‐test)

• SAM (based on permutation)

>2  group (1‐way ANOVA)

Pair  comparison (pair t‐test)

Multi ‐factor analysis (n‐way ANOVA)

Time  course

Select  D.E. genes

Multiple  test correction

p‐value correction

Filtration

• Fold changes‐cut‐off

p‐Value cut‐off

Volcano  plot

1 2 3 4 5

T ‐test

Equal sample sizes, equal variance Unequal sample sizes, equal variance

Unequal sample sizes, unequal variance

(16)

Analysis  of Variance (ANOVA)

Is  gene expression variation true?

• Multiple  test issue

– About  P < 0.05?

– Multiple ‐test p‐value correction

– False  positive discovery rate of p‐value

• Validation  by different methodology

– Northern  blot

– Real ‐time QPCR

(17)

Multiple  test correction

Assume: k gene in a chip.

Experiment‐wise significant level: αe Comparison‐wise significance level: αc Bonferroni correction

K independent test, so all α is equal. so overall number of type I errors = K x α αe = αc/K => αc = αe x K

Corrected p = p x K Dunn‐Sidak

αe = 1‐(1‐αc) => αc = 1‐(1‐αe) 1/k Corrected p = 1‐(1‐p) 1/k Bonferroni Step‐down (Holm) Sort p‐values (ascending) 

1st. corrected p‐value = p‐value x n(gene number) 2nd. corrected p‐value = p‐value x n‐1

3rd. corrected p‐value = p‐value x n‐2

nth. corrected p‐value = p‐value x 1

False  positive discovery (FDR)

Family‐wise error rate (FWER) is the probability of making one or more false discoveries. (p‐value of p‐ value)

Step up False Discovery Rate p‐Value sort ascending 1st. Corrected p = p x n/(n‐0) 2nd. Corrected p = p x n/(n‐1) 3rd. Corrected p = p x n/(n‐2) ...

nth. Corrected p = p x n /1 Step down False Discovery Rate p‐Value sort descending 1st. Corrected p = p x n/(n) 2nd. Corrected p = p x n/(n+1‐2) 3rd. Corrected p = p x n/(n+1‐3) ...

nth. Corrected p = p x n /1 q‐Value

(18)

General Workflow of Data Analysis

Data preprocessing

– Normalization – Probe summarization

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

Find differentially expressed genes

– Statistical analysis

– p-Value correction

Biological interpretation

– Gene annotation – Gene set analysis – Gene network analysis

Array profiles

A few D.E. genes

Large‐scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

Gene  Annotation

X  gene is a W kinase

X  gene is involved in Y function

X  gene is involved in Z pathway

(19)

Gene  Ontology

• Systemic  biological function vocabulary

• Acyclic  network structure

• Gene  association

www.geneontology.org

KEGG:  Kyoto Encyclopedia of Genes and Genomes

www.genome.jp/kegg/

(20)

The  Reactome Project

OMIM   ‐ Online Mendelian Inheritance in Man 

http://www.ncbi.nlm.nih.gov/omim

(21)

Biological  interpretation

Functional  annotation of individual gene

– Gene  ontology (GO, KEGG..)

– Pathway  (KEGG, BIOCARTA, GenMAPP)

– OMIM

– Database  gateway  (GeneCards, Uniprot..)

D.E. Genes Array:

Probe description

Annotation  Database

Annotation  Database

Annotation 

Database

Gateway

GeneCards

http://www.genecards.org/

(22)

Uniprot

• http://www.uniprot.org/

What happen  in cellular function?

Individual  gene studies

Find  differential expressed gene (DEG)

Look  up gene annotation

Validate  function by genetic manipulations 

• over‐express, knock‐down..

The  risks of the straight forward strategy

Which  DEG?

• A try and error game –

Which  function?

• Do you check right functional assay?  –

Do  it work?

• Function regulation is a team work.

Function 2 Function 1

Cross‐talk 3

(23)

ClassA ClassB

ttest cut‐off

FDR<0.05

FDR<0.05 Biological meaning?

Expression  Analysis Systematic Explorer

• Cut  off issues

• Gene  set issues

Expression  Analysis Systematic Explorer

Also:

• Functional annotation enrichment analysis

• GeneOntology analysis

Interpretation of gene list – Gene Ontology

– KEGG pathway – Biocarta pathway

Statistic  methods

– Fish exact test – Hypergeometric test – Binomial test

(24)

Fisher  Exact Test

• When members of two independent groups can fall into one of two mutually exclusive  categories, Fisher Exact test is used to determine whether the proportions of those  falling into each category differs by group. In DAVID annotation system, Fisher Exact is  adopted to measure the gene‐enrichment in annotation terms.

A Hypothetical Example:

In human genome background (30,000 gene total), 40 genes are involved in p53  signalling pathway. A given gene list has found that 3 out of 300 belong to p53  signalling pathway. Then we ask the question if 10/300 is more than random chance  comparing to the human background of 40/30000.

A 2x2 contigency table is built on above numbers:

Fisher Exact P‐Value = 0.008 and since P‐Value <= 0.01, this user gene list is specifically  associated (enriched) in p53 signalling pathway than random chance.

User Genes Genome

In Pathway 3 40

Not In Pathway 297 29960

Adapted from DAVID (http://niaid.abcc.ncifcrf.gov/)

The Database  for Annotation, Visualization  and 

Integrated Discovery (DAVID)

• ID  conversion

• Gene  annotation

• EASE  analysis

• ….

http://david.abcc.ncifcrf.gov/

(25)

Gene  set approaches

From Gene list (IGA)

Expression Analysis Systematic Explorer

From Gene set score(GSA)

Gene Set Enrichment Analysis

event‐based Gene Set Analysis

Gene   Set   Enrichment   Analysis

http://www.pnas.org/content/102/43/15545.abstract

(26)

ES/NES statistic

-

+

ClassA ClassB

Gene Set 1

ttest cut‐off

Gene Set 2

Gene Set 3

Gene set 3 enriched in Class B

Gene set 2 enriched in Class A

Gene  Set Enrichment Analysis

Dataset distribution

Number of genes

Gene Expression Level

The Kolmogorov–Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples.

The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions.

The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.

Gene set 1 distribution

Gene set 2 distribution

http//:Fbioinfo.cnio.es/files/training/Fourth_Functional_Analysis_of_Gene_Expression/CNIO_GSEA_16_02_2010.p pt

(27)

NES NES

pval pval

FDR FDR

Benjamini‐Hochberg

ES(S) = Max(|Phit‐Pmiss|)   Kolmogorov‐Smirnoff (K‐S) distance

Gene  Set Enrichment Analysis

(28)

GSEA   Implementation

http://www.broadinstitute.org/gsea/msigdb/index.jsp

• Gene  set (.gmt files)

• GSEA  in MeV

• GSEA  for JAVA

• GSEA  for R

Something  else.. 

• Microarray  is dead! NGS is new king!

• Great!  My magic gene can link to cell cycle, 

apoptosis   and P53 pathway!

• Follow  what you’re interested and biological 

clues.

(29)

General  Workflow of Data Analysis

Data preprocessing – Normalization – Probe summarization 

– Signal adjustment (flag, flooring…)

– Signal transform (fold change, log transform)

Data QC/QA

– QC index (Hybridization control) – Visualization (PCA, MDS, Box plot..)

Find differentially expressed genes – Statistical analysis

p‐Value correction

Biological interpretation – Gene annotation – Gene set analysis – Gene network analysis

Array profiles

A few D.E. genes

Large‐scaled screening

Literature studies Biomarkers

Cellular Functions

Genes

The  End

(30)

Gene  expression omnibus

• NCBI  GEO ( http://www.ncbi.nlm.nih.gov/geo/)

参照

関連したドキュメント

Furthermore, the same techniques are applied to determine the tail probability density function for a ratio statistic, and for a sum with more than two lognormally distributed

One dimension is whether they are valid in the plane or in space, a second is whether we consider quasiballs, quasiconformal mappings and Apollonian bilipschitz mappings or

But before maximizing the entropy function one has to see whether the given moment values are consistent or not i.e whether there is any probability distribution which corresponds

This paper presents an investigation into the mechanics of this specific problem and develops an analytical approach that accounts for the effects of geometrical and material data on

Keywords: Cramér-Wold theorem, random projections, Hilbert spaces, goodness-of-fit tests, Kolmogorov-Smirnov projected test, single null hypothesis, two samples.. Mathematical

Exact asymptotic behavior is established for (a) the transition probability density of a real-valued Lévy process; (b) the transition probability density and the invariant

When dealing with both SDEs and RDEs, the main goals are to compute, exact or numerically, the solution stochastic process, say x(t), and its main statistical functions (mostly mean,

We will study the spreading of a charged microdroplet using the lubrication approximation which assumes that the fluid spreads over a solid surface and that the droplet is thin so