I. Overview of the Bioconductor Project

(1)

Introduction to Bioconductor

I. Overview of the Bioconductor Project

Bioinformatics and Biostatistics Lab., Seoul National Univ. Seoul, Korea

Eun-Kyung Lee

(2)

Outline

z What is R?

z Overview of the Biocondcutor Project

z Microarray

z Installing R and Bioconductor

z R package structure

z Vignette

z Sweave

z Bioconductor software design

(3)

What is R?

z A language and environment for statistical computing and graphics

z GNU project

z Combined Math and Stat library

z Fine graphics

z Easy and efficient handling of data

z Rich modern statistical routines

(4)

Strengths of R

z Available on a wide variety of UNIX platforms and similar systems, windows and MacOS

z Easy to produce well-designed publication quality plots, including math. Symbols and formulae

z For computationally-intensive tasks, C, C++, and Fortran code can be linked and called at run time.

z Able to write C code to manipulate R objects directly

(5)

Differences between R and the other statistical SW

z R environment

z characterize as a fully planned and coherent system

z command line interface (CLI)

z Preferred for power users

z Intimidating for beginners

z Longer learning curve

z Other SW

z an incremental accretion of very specific and inflexible tools

z Graphical user interface (GUI)

==> For Novice user, GUI is needed!

(6)

Bioconductor Project

z An open-source and open-development software project for the analysis of omics data

z R add-on packages

z Provide access to powerful statistical and graphical methods for the analysis

z Facilitate the integration of biological metadata from WWW

z Promote high-quality documentation and reproducible research

(7)

Bioconductor Packages

z Release 1.9 (Oct. 4, 2006) : 188 packages

z Designed for R 2.4.0

z Statistical method : cluster analysis, estimation and multiple testing for linear and non-linear models, resampling,

visualization, etc

z Biological assays : cell-based assay, DNA microarray, proteomics, SAGE, SNP, etc

z Biological metadata from WWW : GenBank, GO, KEGG, PubMed, etc

z Interfaces with other languages : C, Java, per, Python, XML, etc..

(8)

Bioconductor Task View : SW

z Microarray

z one channel, two channel, data input, quality control,

preprocessing, transcription, DNA copy number, SNPs and genetic variability

z affy, marray, multtest, limma, etc.

z Annotation

z GO, Pathways, Proprietary platforms, Report writing

z annotate, AnnBuilder,etc.

z Visualization

z chromoViz, arrayQCplot, etc

(9)

Bioconductor Task View : SW

z Statistics

z differential expression, clustering, classification, multiple comparison, time course, sequence matching

z limma, qvalue, affylmGUI, timecourse, etc

z GraphAndNetworks

z GeneTS, Rgraphviz, etc.

z Technology

z microarray, proteomics, Mass spectrometry, SAGE, Cell based assays, genetics

z Intrastructure

z Biobase, Rdbi, tkWidgets, widgetTools, etc.

(10)

Microarray

z DNA microarray chip

z cDNA chip : usually custom based chip

z Two channel

z One channel

z Oligonucleotide chip

z Affymetrix

z One channel eg) Agilent, Illumina

(11)

Central Dogma

(12)

cDNA(complementary DNA) A DNA molecule made in vitro using mRNA as a template and the enzyme reverse

transcriptase. A cDNA molecule therefore corresponds to a gene, but lacks the introns present in the DNA of the genome.

cDNA Microarrays

(13)

cDNA Microarrays

(14)

Affymetrix Microarrays

(15)

Affymetrix Microarrays

(16)

Microarray

z

Preprocessing

z Background correction

z Normalization

z Summarization

Affimetrix : affy

Two-channel cDNA : marray

(17)

Microarray

z exprSet : class for microarray data and methods for processing them

z exprs : the observed expression levels.

z annotation : character string identifying the annotation that may be used for the exprSet instance

z description

z notes

z phenoData : containing the patient (or case) level data

(18)

Microarray

z

Analysis

z Differential expression

z Graph and Networks

z Clustering

z Classification

z Multiple comparison

z Time course

z Sequence matching

(19)

Installing R and Bioconductor

z

R 2.4.0

z Download from CRAN

http://cran.r-project.org

z

Bioconductor 1.9

z Download from Bioconductor website or http://www.bioconductor.org

z From R

> source(http://www.bioconductor.org/getBioC.R)

> getBioC()

(20)

Starting and quitting R

z Start : R command

z Quit : q()

z Save : save current env. With save.image

z Working directory : getwd, setwd

z List objects : ls, objects

z Remove objects : rm, remove

z Search path : search, attach, detach

z Help : help(), ?

(21)

R package structure

z A structured collection of code, documentation and/or data

z files : DESCRIPTION, INDEX

z subdirectories

z R

z man

z doc

z src, data, demo, exec, inst

* package.skeleton

(22)

How to make R package

z Linux/Unix

z R CMD check

z R CMD build

z Windows (need to install a couple of SW)

z Rcmd check

z Rcmd build

z Rcmd build --binary

* http://cran.r-project.org/doc/contrib/cross-build.pdf

(23)

How to install and load R package

INSTALL

z Linux/Unix

z R CMD INSTALL ---.tar.gz

z Windows

z click LOAD

z library(package.name)

z packageDescription()

z .find.package()

z system.file()

(24)

Vignettes

z new documentation paradigm

z an executable document consisting of a collection of code chunks and documentation text chunks

z provide dynamic, integrated, and reproducible statistical documents that can be automatically updated if either data or analyses are changed

z Vignettes can be generated using the “sweave” function from the R utils package

(25)

Vignettes

z Each Bioconductor package should contain at least one vignette, providing task-oriented descriptions of the package’s functionality.

z located in the doc subdirectory of an installed package

z accessible from the help browser, via the help.start function

z available separately from the Bioconductor website

(26)

Vignettes

z Biobase package – openVignette function

z menu of available vignettes and interface for viewing vignettes (PDF)

z tkWidgets package – vExplorer function

z interactive use of vignettes, stepping through code chunks

z reposTools package

(27)

Sweave

z allow the generation of dynamic, integrated and

reproducible statistical documents, intermixing text, code, and code output(text and graphics)

z source file : an executable document consisting of a collection of code chunks and documentation text chunks.

z utils package : functions

z Sweave, Stangle

(28)

Sweave

z

Input (noweb file)

z Documentation text chunks

z start with @

z text in a markup language like Latex

z code chunks

z start with <<name>>=

z R code

z file extension

z .rnw, .Rnw, .snw, .Snw

(29)

Sweave

z Sweave function

z extract the code chunks, run them and includes their

output(text and graphs) in a .tex file and .ps or .PDF files

z Stangle function

z concatenates all the code chunks into a .R file

z Output (.tex or .pdf)

z a single document containing the documentation text, the R code, the code output (text and graphs)

z automatically regenerated whenever the data, code or documentation text change

(30)

Sweave

main.Rnw main.R

main.tex fig.pdf fig.eps

main.dvi

main.ps

main.pdf

Sweave

Stangle

latex pdflatex

dvips

(31)

Biocouductor SW design

z programming approaches used in Biocouductor

z Object-oriented S4 class/method framework

z to deal with data complexity,

z to represent and manipulate various data types

z Environments

z to provide mappings between different gene identifies in the annotation metadata packages

z closures

z for software modularity and extensibility

(32)

Experimental Metadata

z gene expression measures

z scanned image (TIFF)

z image quantitation data (.gpr or .CEL )

z normalized gene * array matrices of expression measures ( log ratios or summary measures)

z reliability/quality information

z probe sequence information

z information on the target samples hybridized to the arrays : clinical covariates, experimental condition, etc.

(33)

Experimental Metadata

z Standard form

z MIAME : minimum information about a microarray experiment

z MAGE-ML : microarray gene expression

(34)

Annotation Metadata

z Biological attributes that can be applied to the experimental data

z for gene,

z chromosome location

z gene annotation (LocusLink, GO)

z relative literature (PubMed)

z Biological metadata sets are large, of different

types, evolving rapidly, and typically distributed via the WWW

annotate, annaffy, AnnBuilder

(35)

Data complexity

z large p, small n

z dynamic/evolving data

z multiple data source : WWW, in-house

z multiple data type

z quantitative

z qualitative

z text, graphical

z image, sound

z censored, missing, erroneous data

z various levels of processing

(36)

Object-Oriented Programming

z adapt OOP paradigm in order to deal with the

complexity of experimental and annotation metadata

z S4 class/method design allows efficient and reliable representation and manipulation of large and

complex biological datasets of multiple types

z advantages of class/method design

z keep all relevant information in one object

z print, summary, accessor/assignment, subsetting, and more specialized methods

z Tools for programming using S4 : methods package

(37)

OOP : Classes

z provide a software abstraction of a real world object.

z It reflects how we think of certain objects and what information these objects should contain

z defined in terms of slots which contain the relevant data

z An object is an instance of a class

z A class defines the structure, inheritance, and initialization of objects

(38)

OOP : Methods and Documentation

Methods

z function that performs an action on data

z define how a particular function should behave depending on the class of its arguments

z allow computations to be adapted to particular data types

Documentation

z special commands can be used to provide and access

documentation for S4 classes and methods, using the type ? topic syntax.

z Methods available for a particular class are listed in the class help file

(39)

OOP : exprSet class

z defined in the Biobase package

z used to represent processed expression measures from either Affymetrix or two-color spotted

microarrays

z slots

z exprs : matrix of expression measures

z se.exprs : matrix of SEs for expression measures

z phenoData : sample level covariates and responses

z description : MIAME information

z annotation : name of annotation data

z notes : any notes

(40)

OOP : phenoData class

z defined in the Biobase package

z used to keep track of information on target samples hybridized to the microarray

z varLabels : list of variable labels

z pData : dataframe of sample level variables arrays * variables

(41)

affy : Affymetrix Oligonucleotide chips

z affy package

z class definitions for probe-level data and basic methods for manipulating microarray objects

(printing, plotting, subsetting, class conversions, etc.)

z AffyBatch : probe-level intensity data for a batch of arrays

z ProbeSet : PM, MM intensities for individual probe-sets

(42)

marray: two-channel spotted microarrays

z marray package

z class definitions for two-color spotted DNA microarray data and basic methods for

manipulating microarray objects

z marrayLayout : information on microarry layout

z marrayRaw : pre-normalization intensity data for a batch of arrays (same layout)

z marrayNorm : post-normalization intensity data for a batch of arrays.

(43)

pubMedAbst class

z annotate package

z provide a pubMedAbst class for storing PubMed abstracts

z pmid

z authors

z abstText

z articleTitle

z journal

z pubDate

(44)

Annotation : matching IDs

z Accessing annotation information from databases such as GeneBank, GO, or PubMed, presupposes the ability to perform the following essential

bookkeeping task

z mapping between the different identifiers (IDs) for a given gene.

z one GENENAME

z one GenBank accession number

z several different GO term IDs

z several different PubMed IDs

(45)

R Environments

z provide key-value mappings

z similar to hash tables in other languages

z The term key refers to the name of variable, which can have different values in different environments.

z functions for working with environments include

z ls

z get, mget(base), multiget(Biobase)

z assign(base), multiassign(Biobase)

(46)

R Environments

z keys can be accessed using

z ls(name of the environment)

z Values can be accessed using

z get(key, envir=name of the environment)

z mget(keys, envir=name of the environment)

(47)

Closures

z consists of the body of the function along with an enclosing environment containing all variable

bindings needed for evaluating the function.

z Closures facilitate software modularity and extensibility

(48)

Summary

z Overview of the Biocondcutor Project

z Microarray

z R package structure

z Vignette

z Sweave

z Bioconductor software design

z Next session : focusing on Microarray data analysis