Introduction to Bioconductor
I. Overview of the Bioconductor Project
Bioinformatics and Biostatistics Lab., Seoul National Univ. Seoul, Korea
Eun-Kyung Lee
Outline
z What is R?
z Overview of the Biocondcutor Project
z Microarray
z Installing R and Bioconductor
z R package structure
z Vignette
z Sweave
z Bioconductor software design
What is R?
z A language and environment for statistical computing and graphics
z GNU project
z Combined Math and Stat library
z Fine graphics
z Easy and efficient handling of data
z Rich modern statistical routines
Strengths of R
z Available on a wide variety of UNIX platforms and similar systems, windows and MacOS
z Easy to produce well-designed publication quality plots, including math. Symbols and formulae
z For computationally-intensive tasks, C, C++, and Fortran code can be linked and called at run time.
z Able to write C code to manipulate R objects directly
Differences between R and the other statistical SW
z R environment
z characterize as a fully planned and coherent system
z command line interface (CLI)
z Preferred for power users
z Intimidating for beginners
z Longer learning curve
z Other SW
z an incremental accretion of very specific and inflexible tools
z Graphical user interface (GUI)
==> For Novice user, GUI is needed!
Bioconductor Project
z An open-source and open-development software project for the analysis of omics data
z R add-on packages
z Provide access to powerful statistical and graphical methods for the analysis
z Facilitate the integration of biological metadata from WWW
z Promote high-quality documentation and reproducible research
Bioconductor Packages
z Release 1.9 (Oct. 4, 2006) : 188 packages
z Designed for R 2.4.0
z Statistical method : cluster analysis, estimation and multiple testing for linear and non-linear models, resampling,
visualization, etc
z Biological assays : cell-based assay, DNA microarray, proteomics, SAGE, SNP, etc
z Biological metadata from WWW : GenBank, GO, KEGG, PubMed, etc
z Interfaces with other languages : C, Java, per, Python, XML, etc..
Bioconductor Task View : SW
z Microarray
z one channel, two channel, data input, quality control,
preprocessing, transcription, DNA copy number, SNPs and genetic variability
z affy, marray, multtest, limma, etc.
z Annotation
z GO, Pathways, Proprietary platforms, Report writing
z annotate, AnnBuilder,etc.
z Visualization
z chromoViz, arrayQCplot, etc
Bioconductor Task View : SW
z Statistics
z differential expression, clustering, classification, multiple comparison, time course, sequence matching
z limma, qvalue, affylmGUI, timecourse, etc
z GraphAndNetworks
z GeneTS, Rgraphviz, etc.
z Technology
z microarray, proteomics, Mass spectrometry, SAGE, Cell based assays, genetics
z Intrastructure
z Biobase, Rdbi, tkWidgets, widgetTools, etc.
Microarray
z DNA microarray chip
z cDNA chip : usually custom based chip
z Two channel
z One channel
z Oligonucleotide chip
z Affymetrix
z One channel eg) Agilent, Illumina
Central Dogma
cDNA(complementary DNA) A DNA molecule made in vitro using mRNA as a template and the enzyme reverse
transcriptase. A cDNA molecule therefore corresponds to a gene, but lacks the introns present in the DNA of the genome.
cDNA Microarrays
cDNA Microarrays
Affymetrix Microarrays
Affymetrix Microarrays
Microarray
z
Preprocessing
z Background correction
z Normalization
z Summarization
Affimetrix : affy
Two-channel cDNA : marray
Microarray
z exprSet : class for microarray data and methods for processing them
z exprs : the observed expression levels.
z annotation : character string identifying the annotation that may be used for the exprSet instance
z description
z notes
z phenoData : containing the patient (or case) level data
Microarray
z
Analysis
z Differential expression
z Graph and Networks
z Clustering
z Classification
z Multiple comparison
z Time course
z Sequence matching
Installing R and Bioconductor
z
R 2.4.0
z Download from CRAN
http://cran.r-project.org
z
Bioconductor 1.9
z Download from Bioconductor website or http://www.bioconductor.org
z From R
> source(http://www.bioconductor.org/getBioC.R)
> getBioC()
Starting and quitting R
z Start : R command
z Quit : q()
z Save : save current env. With save.image
z Working directory : getwd, setwd
z List objects : ls, objects
z Remove objects : rm, remove
z Search path : search, attach, detach
z Help : help(), ?
R package structure
z A structured collection of code, documentation and/or data
z files : DESCRIPTION, INDEX
z subdirectories
z R
z man
z doc
z src, data, demo, exec, inst
* package.skeleton
How to make R package
z Linux/Unix
z R CMD check
z R CMD build
z Windows (need to install a couple of SW)
z Rcmd check
z Rcmd build
z Rcmd build --binary
* http://cran.r-project.org/doc/contrib/cross-build.pdf
How to install and load R package
INSTALL
z Linux/Unix
z R CMD INSTALL ---.tar.gz
z Windows
z click LOAD
z library(package.name)
z packageDescription()
z .find.package()
z system.file()
Vignettes
z new documentation paradigm
z an executable document consisting of a collection of code chunks and documentation text chunks
z provide dynamic, integrated, and reproducible statistical documents that can be automatically updated if either data or analyses are changed
z Vignettes can be generated using the “sweave” function from the R utils package
Vignettes
z Each Bioconductor package should contain at least one vignette, providing task-oriented descriptions of the package’s functionality.
z located in the doc subdirectory of an installed package
z accessible from the help browser, via the help.start function
z available separately from the Bioconductor website
Vignettes
z Biobase package – openVignette function
z menu of available vignettes and interface for viewing vignettes (PDF)
z tkWidgets package – vExplorer function
z interactive use of vignettes, stepping through code chunks
z reposTools package
Sweave
z allow the generation of dynamic, integrated and
reproducible statistical documents, intermixing text, code, and code output(text and graphics)
z source file : an executable document consisting of a collection of code chunks and documentation text chunks.
z utils package : functions
z Sweave, Stangle
Sweave
z
Input (noweb file)
z Documentation text chunks
z start with @
z text in a markup language like Latex
z code chunks
z start with <<name>>=
z R code
z file extension
z .rnw, .Rnw, .snw, .Snw
Sweave
z Sweave function
z extract the code chunks, run them and includes their
output(text and graphs) in a .tex file and .ps or .PDF files
z Stangle function
z concatenates all the code chunks into a .R file
z Output (.tex or .pdf)
z a single document containing the documentation text, the R code, the code output (text and graphs)
z automatically regenerated whenever the data, code or documentation text change
Sweave
main.Rnw main.R
main.tex fig.pdf fig.eps
main.dvi
main.ps
main.pdf
Sweave
Stangle
latex pdflatex
dvips
Biocouductor SW design
z programming approaches used in Biocouductor
z Object-oriented S4 class/method framework
z to deal with data complexity,
z to represent and manipulate various data types
z Environments
z to provide mappings between different gene identifies in the annotation metadata packages
z closures
z for software modularity and extensibility
Experimental Metadata
z gene expression measures
z scanned image (TIFF)
z image quantitation data (.gpr or .CEL )
z normalized gene * array matrices of expression measures ( log ratios or summary measures)
z reliability/quality information
z probe sequence information
z information on the target samples hybridized to the arrays : clinical covariates, experimental condition, etc.
Experimental Metadata
z Standard form
z MIAME : minimum information about a microarray experiment
z MAGE-ML : microarray gene expression
Annotation Metadata
z Biological attributes that can be applied to the experimental data
z for gene,
z chromosome location
z gene annotation (LocusLink, GO)
z relative literature (PubMed)
z Biological metadata sets are large, of different
types, evolving rapidly, and typically distributed via the WWW
annotate, annaffy, AnnBuilder
Data complexity
z large p, small n
z dynamic/evolving data
z multiple data source : WWW, in-house
z multiple data type
z quantitative
z qualitative
z text, graphical
z image, sound
z censored, missing, erroneous data
z various levels of processing
Object-Oriented Programming
z adapt OOP paradigm in order to deal with the
complexity of experimental and annotation metadata
z S4 class/method design allows efficient and reliable representation and manipulation of large and
complex biological datasets of multiple types
z advantages of class/method design
z keep all relevant information in one object
z print, summary, accessor/assignment, subsetting, and more specialized methods
z Tools for programming using S4 : methods package
OOP : Classes
z provide a software abstraction of a real world object.
z It reflects how we think of certain objects and what information these objects should contain
z defined in terms of slots which contain the relevant data
z An object is an instance of a class
z A class defines the structure, inheritance, and initialization of objects
OOP : Methods and Documentation
Methods
z function that performs an action on data
z define how a particular function should behave depending on the class of its arguments
z allow computations to be adapted to particular data types
Documentation
z special commands can be used to provide and access
documentation for S4 classes and methods, using the type ? topic syntax.
z Methods available for a particular class are listed in the class help file
OOP : exprSet class
z defined in the Biobase package
z used to represent processed expression measures from either Affymetrix or two-color spotted
microarrays
z slots
z exprs : matrix of expression measures
z se.exprs : matrix of SEs for expression measures
z phenoData : sample level covariates and responses
z description : MIAME information
z annotation : name of annotation data
z notes : any notes
OOP : phenoData class
z defined in the Biobase package
z used to keep track of information on target samples hybridized to the microarray
z varLabels : list of variable labels
z pData : dataframe of sample level variables arrays * variables
affy : Affymetrix Oligonucleotide chips
z affy package
z class definitions for probe-level data and basic methods for manipulating microarray objects
(printing, plotting, subsetting, class conversions, etc.)
z AffyBatch : probe-level intensity data for a batch of arrays
z ProbeSet : PM, MM intensities for individual probe-sets
marray: two-channel spotted microarrays
z marray package
z class definitions for two-color spotted DNA microarray data and basic methods for
manipulating microarray objects
z marrayLayout : information on microarry layout
z marrayRaw : pre-normalization intensity data for a batch of arrays (same layout)
z marrayNorm : post-normalization intensity data for a batch of arrays.
pubMedAbst class
z annotate package
z provide a pubMedAbst class for storing PubMed abstracts
z pmid
z authors
z abstText
z articleTitle
z journal
z pubDate
Annotation : matching IDs
z Accessing annotation information from databases such as GeneBank, GO, or PubMed, presupposes the ability to perform the following essential
bookkeeping task
z mapping between the different identifiers (IDs) for a given gene.
z one GENENAME
z one GenBank accession number
z several different GO term IDs
z several different PubMed IDs
R Environments
z provide key-value mappings
z similar to hash tables in other languages
z The term key refers to the name of variable, which can have different values in different environments.
z functions for working with environments include
z ls
z get, mget(base), multiget(Biobase)
z assign(base), multiassign(Biobase)
R Environments
z keys can be accessed using
z ls(name of the environment)
z Values can be accessed using
z get(key, envir=name of the environment)
z mget(keys, envir=name of the environment)
Closures
z consists of the body of the function along with an enclosing environment containing all variable
bindings needed for evaluating the function.
z Closures facilitate software modularity and extensibility
Summary
z Overview of the Biocondcutor Project
z Microarray
z R package structure
z Vignette
z Sweave
z Bioconductor software design
z Next session : focusing on Microarray data analysis