HITS-‐CLIP

(Licatalosi+ 2010)

データをどうやって処理するか？

•  Base calling

•  Read mapping

•  Data complexity reducnon

–  Peek detecnon

•  Data integranon

–  Data visualizanon

–  Unsupervised integranon

–  Supervised integranon

Base calling

• 

次世代シークエンサーはエラー率が比較的高い。（

Illumina GAII

の場合

1%

ぐらい）

•  base quality

を一緒に出力する。

• 

エラー補正アルゴリズムが多数開発されている。

Read mapping

• 

ここ数年で最も開発競争が激しい分野

•  Hash

を基にしたツール

–  SeqMap, MAQ, SOAP, Stampy, Novoalign

•  Suﬃx Array/BWT

を基にしたツール

–  SOAP2, Bowne, BWA, BWA-‐SW

Read mapping

• 

しかし、課題はまだ沢山ある。

– 

ギャップ付アラインメント

–  pair-‐end read

のアラインメント

–  base quality

を考慮したアラインメント

– 

長いリードのアラインメント

–  SOLiD

が出力する

color-‐code read

–  bisulﬁte

処理されたリードのアラインメント

–  spliced read, muln read

のアラインメント

Peek detecnon

• 

シグナルプロファイルを作る。

• 

バックグランドモデルを用意する。

• 

バックグランドと比較して有意な領域をコールする。

(Pepke+ 2009)

Data visualizanon

(Hawkins+ 2010)

Database biology

• 

ゲノムブラウザを見るといろいろなことがわかる。

– 

配列に対して様々なデータが張り付けられている。

– 

ゲノムはただの座標軸に過ぎない。

Unsupervised integranon

• 

与えられたデータにおける特徴的なパターンを教師なし学習アルゴリズムで発見する。

– 

クラスタリング

– 

主成分分析

– 

頻出パターンマイニング

Supervised integranon

• 

発見したパターンを仮説としてモデルを構築し、教師あり学習アルゴリズムで検証する。

– 

相関分析

– 

回帰分析

– 

ベイジアンネットワーク

どのようなことがわかるのか？

• 

ゲノムの機能アノテーション

• 

遺伝多型の機能推定

• 

遺伝子制御メカニズムの解明

ゲノムの機能アノテーション

• 

転写因子結合部位やヒストン修飾によりゲノムに機能アノテーションをすることができる。

(Hawkins+ 2010)

遺伝多型の機能推定

•  Genome-‐wide associanon study (GWAS)

– 

一塩基多型

(SNPs)

などの遺伝多型をゲノムワイドに解析する。

– 

パーソナルゲノムによる

mulnple rare variant

解析

•  non-‐coding region

の

SNPs

が

ChIP-‐seq

で多く見つかっている。

à regulatory SNPs

(Hawkins+ 2010)

遺伝子制御メカニズムの解明

•  transcriptomic data

と

epigenomic data

の統合はとても強力

à

エピゲノム

– 

ヒストンコードの解読

– 

ゲノムインプリンティング

–  iPS

細胞、ガン遺伝子

•  RNA-‐seq

による

exon expression data

と、

HITS-‐

CLIP

による

splicing factor

と

mRNA

の

interactome data

を統合して、スプライシング

機構を解明する。

Case study (1)

• 

ヒストン修飾データを用いると転写因子結合部位の予測精度が向上する。

(Sato+ 2010)

転写因子結合部位予測

• 

これまでの方法

–  Posinon Weight Matrix (PWM)

を使って配列データから予測する。

–  False posinve

（間違えて予測する）がとても多い。

– 

組織特異的な発現を説明できない。

本手法

• 

ヒストン修飾や配列保存度などを統合して予測精度の向上を図る。

– 

ヒストン修飾（メチル化、アセチル化など）

–  RNA Pol II

の結合

– 

転写開始点（

AceView

）

– 

配列保存度（

phastCons

）

–  PWM

スコア

本手法

• 

機械学習を使ってデータを統合する。

–  Naïve Bayes

–  Random Forest with Naïve Bayes Trees

結果

• 

予測精度の

AUC

（

1

に近いほど良い）

結果

Case Study (2)

• 

ヒストン修飾データを多変量

HMM

でモデル化し、ヒトゲノムの機能アノテーションを行う。

This resulted in 90 chromatin maps corresponding to ,2,400,000,000 reads covering ,100,000,000,000 bases across nine cell types, which we set out to interpret computationally.

Learning a common set of chromatin states across cell types To summarize these data sets into nine readily interpretable annota-tions, one per cell type, we applied a multivariate hidden Markov model that uses combinatorial patterns of chromatin marks to distin-guish chromatin states⁸. The approach explicitly models mark com-binations in a set of ‘emission’ parameters and spatial relationships between neighbouring genomic segments in a set of ‘transition’ para-meters (Methods). It has the advantage of capturing regulatory ele-ments with greater reliability, robustness and precision than is possible by studying individual marks⁸.

We learned chromatin states jointly by creating a virtual conca-tenation of all chromosomes from all cell types. We selected 15 states that showed distinct biological enrichments and were consistently recovered (Fig. 1a, b and Supplementary Fig. 1). Even though states

were learnedde novosolely on the basis of the patterns of chromatin marks and their spatial relationships, they showed distinct associa-tions with transcriptional start sites (TSSs), transcripts, evolutionarily conserved non-coding regions, DNase hypersensitive sites¹², binding sites for the regulators c-Myc¹³ (MYC) and NF-kB¹⁴, and inactive genomic regions associated with the nuclear lamina¹⁵(Fig. 1c).

We distinguished six broad classes of chromatin states, which we refer to as promoter, enhancer, insulator, transcribed, repressed and inactive states (Fig. 1c). Within them, active, weak and poised⁴ promo-ters (states 1–3) differ in expression level, strong and weak candidate enhancers (states 4–7) differ in expression of proximal genes, and strongly and weakly transcribed regions (states 9–11) also differ in their positional enrichments along transcripts. Similarly, Polycomb-repressed regions (state 12) differ from heterochromatic and repetitive states (states 13–15), which are also enriched for H3K9me3 (Sup-plementary Figs 2–4).

The states vary widely in their average segment length (,500 base pairs (bp) for promoter and enhancer states versus 10 kb for inactive

0 10 20 30

Luciferase relative light units

b c d

Chromatin mark observation frequency (%) (%) (fold) (kb) (%) Functional enrichments (fold)

Candidate state annotation

CTCF

State H3K27me3 H3K36me3 H4K20me1 H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac WCE Median Median length ±2 kb TSS Conserved non-exon DNase (K562) c-Myc (K562) NF-κB (GM12878) Transcript Nuclear lamina

HepG2 enhancer activity Coverage

H1 ES GM

16 2 2 6 17 93 99 96 98 2 0.6 1.0 83 3.8 23.3 82.0 40.7 0.2 0.15

12 2 6 9 53 94 95 14 44 1 0.5 0.4 58 2.8 15.3 12.6 5.8 0.6 0.30

13 72 0 9 48 78 49 1 10 1 0.2 0.6 49 4.3 10.8 3.1 1.0 0.4 0.68

11 1 15 11 96 99 75 97 86 4 0.7 0.6 23 2.7 23.1 31.8 49.0 1.3 0.05

5 0 10 3 88 57 5 84 25 1 1.2 0.6 3 1.8 13.6 6.3 15.8 1.4 0.10

7 1 1 3 58 75 8 6 5 1 0.9 0.2 17 2.4 11.9 5.7 7.0 1.1 0.31

2 1 2 1 56 3 0 6 2 1 1.9 0.4 4 1.5 5.1 0.6 2.4 1.3 0.20

92 2 1 3 6 3 0 0 1 1 0.5 0.4 3 1.5 12.8 2.5 1.2 1.1 0.61

5 0 43 43 37 11 2 9 4 1 0.7 0.8 4 1.1 4.5 0.7 0.8 2.4 0.02

1 0 47 3 0 0 0 0 0 1 4.3 3.0 1 0.9 0.3 0.0 0.0 2.5 0.11

0 0 3 2 0 0 0 0 0 0 12.5 2.6 2 0.9 0.3 0.0 0.1 1.9 0.24

1 27 0 2 0 0 0 0 0 0 4.1 2.8 5 1.4 0.3 0.0 0.1 0.8 0.63

0 0 0 0 0 0 0 0 0 0 71.4 10.0 1 0.9 0.1 0.0 0.0 0.7 1.30

22 28 19 41 6 5 26 5 13 37 0.1 0.6 3 0.4 1.9 0.3 0.2 0.4 1.44

85 85 91 88 76 77 91 73 85 78 0.1 0.2 1 0.2 5.9 9.5 7.4 0.4 1.30

Active promoter Weak promoter

Inactive/poised promoter Strong enhancer Strong enhancer Weak/poised enhancer Weak/poised enhancer Insulator

Transcriptional transition Transcriptional elongation Weak transcribed Polycomb repressed Heterochrom; low signal Repetitive/CNV Repetitive/CNV 12

34 56 78 109 1112 1314

15 State 4 in

HepG2 State 7 in

HepG2 State 4 in GM12878 0.5 1.2

1.2 1.3 4.0 1.0 0.1 1.1 0.2 0.7 1.3 1.0 1.2 1.1 1.4 1.0 1.3 1.0 0.6 1.2 1.3 0.8 0.3 0.7 1.0 1.0 0.9 1.2 0.9 1.0

H1 ES K562 GM12878 HepG2 HUVEC HSMM NHLF NHEK HMEC Genes

GM12878 H1 ES

H3K27me3CTCF H3K36me3 H4K20me1 H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac ChromatinWCE state

Chromatin states

WLS gene

DIRAS3 GNG12 GADD45A

WLS in H1 ES cells (poised) WLS in GM12878 cells (repressed) WLS in HUVEC (active) WLS in NHLF (active)

RPE65 DEPDC1

HUVEC NHLF

100 kb

n = 7 n = 7

n = 8

(NHLF)

Figure 1|Chromatin state discovery and characterization. a, Top: profiles for nine chromatin marks (greyscale) are shown across the WLS gene in four cell types, and summarized in a single chromatin state annotation track for each (coloured according tob). WLS is poised in ESCs, repressed in GM12878 and transcribed in HUVEC and NHLF. Its TSS switches accordingly between poised (purple), repressed (grey) and active (red) promoter states; enhancer regions within the gene body become activated (orange, yellow); and its gene body changes from low signal (white) to transcribed (green). These chromatin state changes summarize coordinated changes in many chromatin marks; for example, H3K27me3, H3K4me3 and H3K4me2 jointly mark a poised

promoter, whereas loss of H3K27me3 and gain of H3K27ac and H3K9ac mark promoter activation. WCE, whole-cell extract. Bottom: nine chromatin state tracks, one per cell type, in a 900-kb region centred at WLS, summarizing 90 chromatin tracks in directly interpretable dynamic annotations and showing activation and repression patterns for six genes and hundreds of regulatory regions, including enhancer states.b, Chromatin states learned jointly across

cell types by a multivariate hidden Markov model. The table shows emission parameters learnedde novoon the basis of genome-wide recurrent

combinations of chromatin marks. Each entry denotes the frequency with which a given mark is found at genomic positions corresponding to the chromatin state.c, Genome coverage, functional enrichments and candidate annotations for each chromatin state. Blue shading indicates intensity, scaled by column. CNV, copy number variation; GM, GM12878.d, Box plots depicting enhancer activity for predicted regulatory elements. Sequences 250 bp long corresponding either to strong or weak/poised HepG2 enhancer elements or to GM12878-specific strong enhancer elements were inserted upstream of a luciferase gene and transfected into HepG2. Reporter activity was measured in relative light units. Robust activity is seen for strong enhancers in the matched cell type, but not for weak/poised enhancers or for strong enhancers specific to a different cell type. Boxes indicate 25th, 50th and 75th percentiles, and whiskers indicate 5th and 95th percentiles.

ドキュメント内レポート課題 : 次世代シークエンサーが産出する大規模シークエンシングデータを生命科学にどう活かすか? というお題で A4レポート用紙 1 枚程度の文章 ( 講義の感想自分で調べたことなどなど)を書いてください提出 (ページ 33-54)

HITS-­‐CLIP

(Licatalosi+ 2010)

データをどうやって処理するか？

• Base calling

• Read mapping

• Data complexity reducnon

– Peek detecnon

• Data integranon

– Data visualizanon

– Unsupervised integranon

– Supervised integranon

Base calling

•

Illumina GAII

1%

• base quality

•

Read mapping

•

• Hash

– SeqMap, MAQ, SOAP, Stampy, Novoalign

• Suﬃx Array/BWT

– SOAP2, Bowne, BWA, BWA-­‐SW

Read mapping

•

–

– pair-­‐end read

– base quality

–

– SOLiD

color-­‐code read

– bisulﬁte

– spliced read, muln read

Peek detecnon

•

•

•

(Pepke+ 2009)

Data visualizanon

(Hawkins+ 2010)

Database biology

•

–

–

Unsupervised integranon

•

–

–

–

Supervised integranon

•

–

–

–

どのようなことがわかるのか？

•

•

•

ゲノムの機能アノテーション

•

(Hawkins+ 2010)

遺伝多型の機能推定

• Genome-­‐wide associanon study (GWAS)

–

(SNPs)

–

mulnple rare variant

• non-­‐coding region

SNPs

ChIP-­‐seq

à regulatory SNPs

(Hawkins+ 2010)

遺伝子制御メカニズムの解明

• transcriptomic data

epigenomic data

à

–

–

– iPS

• RNA-­‐seq

•  Base calling

•  Read mapping

•  Data complexity reducnon

–  Peek detecnon

•  Data integranon

–  Data visualizanon

–  Unsupervised integranon

–  Supervised integranon

• 

•  base quality

• 

• 

•  Hash

–  SeqMap, MAQ, SOAP, Stampy, Novoalign

•  Suﬃx Array/BWT

–  SOAP2, Bowne, BWA, BWA-‐SW

• 

– 

–  pair-‐end read

–  base quality

– 

–  SOLiD

color-‐code read

–  bisulﬁte

–  spliced read, muln read

• 

• 

• 

• 

– 

– 

• 

– 

– 

– 

• 

– 

– 

– 

• 

• 

• 

• 

•  Genome-‐wide associanon study (GWAS)

– 

– 

•  non-‐coding region

ChIP-‐seq

•  transcriptomic data

– 

– 

–  iPS

•  RNA-‐seq

HITS-‐

• 

• 

–  Posinon Weight Matrix (PWM)

–  False posinve

– 

• 

– 

–  RNA Pol II

– 

– 

–  PWM

• 

–  Naïve Bayes

–  Random Forest with Naïve Bayes Trees

• 

•