Func%onal annota%on and gene
set analysis
General workflow
• Data preprocessing
– Normaliza)on
– Signal adjustment (flag, flooring…) – Probe summariza)on
– Signal transform (fold change, log transform)
• Data QC/QA
– QC index (Hybridiza)on control) – Visualiza)on (PCA, MDS, Box plot..)
• Find differen%ally expressed genes
– Sta)s)cal analysis – Set Cut‐off
• Biological interpreta%on
– Gene annota)on – Gene set analysis
– Gene network analysis
Array profiles
A few D.E. genes
Large‐scaled screening
Literature studies Biomarkers
Cellular Func)ons
Genes
Biological interpreta%on
• Func%onal annota%on of individual gene
– Gene ontology (GO, KEGG..)
– Pathway (KEGG, BIOCARTA, GenMAPP)
– OMIM
– Database gateway (GeneCards, Uniprot..)
• Gene set analysis
– Func)onal annota)on enrichment analysis
• DAVID, GOTM, …
– Gene set enrichment analysis
• GSEA
Gene Ontology
• Systemic biological func)on vocabulary
• Acyclic network structure
• Gene associa)on
www.geneontology.org
KEGG: Kyoto Encyclopedia of Genes and Genomes
www.genome.jp/kegg/
The Reactome Project
OMIM ‐ Online Mendelian Inheritance in Man
h[p://www.ncbi.nlm.nih.gov/omim
Gene expression omnibus
• NCBI GEO ( h[p://www.ncbi.nlm.nih.gov/geo/)
Biological interpreta%on
• Func%onal annota%on of individual gene
– Gene ontology (GO, KEGG..)
– Pathway (KEGG, BIOCARTA, GenMAPP)
– OMIM
– Database gateway (GeneCards, Uniprot..)
D.E. Genes Array:
Probe descrip)on
Annota)on Database
Annota)on Database
Annota)on
Database
Gateway
GeneCards
h[p://www.genecards.org/
Uniprot
• h[p://www.uniprot.org/
Biological interpreta%on
• Func%onal annota%on of individual gene
– Gene ontology (GO, KEGG..)
– Pathway (KEGG, BIOCARTA, GenMAPP)
– OMIM
– Database gateway (GeneCards, swissport..)
• Gene set analysis
– Func)onal annota)on enrichment analysis
• DAVID, GOTM, …
– Gene set enrichment analysis
• GSEA
What happen in cellular function?
• Individual gene studies
– Find differential expressed gene (DEG)
• Gene-phenotype correlation (t-test, ANOVA, limma…)
– Predict functional role from its annotation (PubMed,OMIM, Genecards…)
– Validate function by genetic manipulations (over-express, knock- down..).
• The risks of the straight forward strategy
– Which DEG?
• A try and error game
– Which function?
• Do you check right functional assay?
– Do it work?
• Function regulation is a team work.
Function 2 Function 1
Cross-talk 3
Expression Analysis Systema)c Explorer
• Also:
• Func)onal annota)on enrichment analysis
• GeneOntology analysis
• Interpreta%on of gene list
– Gene Ontology – KEGG pathway – Biocarta pathway
• Sta)s)c methods
– Fish exact test
– Hypergeometric test – Binomial test
The Database for Annota)on, Visualiza)on and
Integrated Discovery (DAVID)
• ID conversion
• Gene annota)on
• EASE analysis
• ….
Fisher Exact Test
• When members of two independent groups can fall into one of two mutually exclusive categories, Fisher Exact test is used to determine whether the propor)ons of those falling into each category differs by group. In DAVID annota)on system, Fisher Exact is adopted to measure the gene‐enrichment in annota)on terms.
A Hypothe%cal Example:
In human genome background (30,000 gene total), 40 genes are involved in p53 signalling pathway. A given gene list has found that 3 out of 300 belong to p53
signalling pathway. Then we ask the ques)on if 10/300 is more than random chance comparing to the human background of 40/30000.
A 2x2 con)gency table is built on above numbers:
•
Fisher Exact P‐Value = 0.008 and since P‐Value <= 0.01, this user gene list is specifically associated (enriched) in p53 signalling pathway than random chance.
User Genes Genome
In Pathway 3 40
Not In Pathway 297 29960
Adapted from DAVID (h[p://niaid.abcc.ncifcrf.gov/)
ClassA ClassB
[est cut‐off
FDR<0.05
FDR<0.05
Biological meaning?
Pihalls of EASE
• Cut off issues
• Gene set issues
Gene set approaches
• From Gene list (IGA)
• Expression Analysis Systematic Explorer
• From Gene set score(GSA)
• Gene Set Enrichment Analysis
• event-based Gene Set Analysis
Gene Set Enrichment Analysis
h[p://www.pnas.org/content/102/43/15545.abstract
ES/NES statistic
-
+
ClassA ClassB
Gene Set 1
[est cut‐off
Gene Set 2
Gene Set 3
Gene set 3 enriched in Class B
Gene set 2 enriched in Class A
Gene Set Enrichment Analysis
Dataset distribution
Number of genes
Gene Expression Level
The Kolmogorov–Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples.
The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions.
The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.
Gene set 1 distribution
Gene set 2 distribution
h[p//:Fbioinfo.cnio.es/files/training/Fourth_Func)onal_Analysis_of_Gene_Expression/ CNIO_GSEA_16_02_2010.ppt
NES
pval
FDR
Benjamini-Hochberg
ES(S) = Max(|Phit-Pmiss|) Kolmogorov-Smirnoff (K-S) distance
Gene Set Enrichment Analysis
Practice of GSEA
http://www.broadinstitute.org/gsea/msigdb/index.jsp
• Gene set (.gmt files)
• GSEA in MeV
• GSEA for JAVA
• GSEA for R
The problem of heterogeneous gene expression
• Simple regulatory mechanism
• Multiple regulatory routine
– Multiple regulatory pathway
• Pathway 1
• Pathway 2
– Multiple regulatory points
• Pathway 1 (point 1-3)
• Pathway 2 (point 1-n)
• Noising crosstalk
Functional change
others
A gene set
Pathway 1
Pathway 2 Noising crosstalk
Point 2 Point 1
Point 3
event‐based Gene Set Analysis
Regulatory event‐based approach
Gene 1
Up‐RE Down‐RE
Gene1 Gene2 Gene3 Gene4
…… GeneX Gene4
Sample 1‐6
Func)onal gene set
RE f (3‐1)/6
Control
Test
Func)onal change
others A gene set
Pathway 1
Pathway 2 Noising crosstalk
Point 2 Point 1
Point 3
event‐based Gene Set Analysis
Q1: sample randomized
Organizing HCC func)onal map
• Map structure
– GO term structure
• Map refinement
– Brach reducing logic
• Map Presenta)on
– Cytoscape
4 stages of HCV‐induced hepatocarcinogenesis
Event table of HCC1 data set
KEGG Cell Cycle Pathway
B. Function pattern of HCV induced HCCs C. Event heatmap of GO:0045087 A. Workflow of eGSA-R
WriteGoMap()
FetchEvent() eGSA()
eGSA.Read()
Event () NormDist () GeneSummary
User data () ex. GSE6764
z y
Gene signal table
p-Values table
Gene set table
1
2 3
4 Events
Changed p-values Gene signal Sample info
eGSA.Dataset x
Ci Ds
Ec Ac