ファイル置き場日本Cassandraコミュニティ

(1)

Lucandra _{を使ってみ}

る

2010/6/25

佐藤史彦　

(2)

Agenda

Lucandra _{ってなに？}

Lucandra _の構成

できること

使ってみる

まとめ

(3)

Lucandra _{ってなに？}

(4)

A Cassandra-based Lucene backend

Author : Jake Luciani

(5)

カサンドラベースのルシーンバックエン

ド

作者 : ジェイクルシアーニ

(6)

Cassandra _{にインデックス機能}

追加する、というよりを

Lucene/Solr _{のインデックスを}

リアルタイムに作成、かつ

手軽にスケールさせる目的で

インデックスのストア先に

Cassandra _{を採用したもの}

(7)

実装例 http://sparse.ly/

(8)

Lucandra _の構成

(9)

Disk

Java

Application

HitsHits ^DocumentDocument Document Document

Lucene

Document Document

Field Field

インデックス作

成

QueryParser QueryParser Document

Document Document Document

Document Document

検索

Analyzer Analyzer

Query Query

Lucene Index Lucene Index IndexReader

IndexReader

IndexWriter IndexWriter Analyzer

Analyzer IndexSearcher

IndexSearcher

(10)

Cassandra

Java

Application

Lucandra

Document Document

Field Field

インデックス作

成

Document Document

検索

Analyzer Analyzer

Query Query

IndexReader

IndexWriter IndexWriter Analyzer

Analyzer IndexSearcher

IndexSearcher

(11)

Index _構成

Keyspace : Lucandra

ColumnFamily : Document

Key : インデックス名のハッシュ + ドキュメント ID Column Name : _{フィールド名}

Value : _{フールド値} SuperColumnFamily : TermInfo

Key :( インデックス名 + フィールド名 ) のハッシュ + フィールド名 + 単語SuperColumn : ドキュメント ID

Column Name :Frequencies

Value : 当該文書中の当該単語の出現頻度 Column Name :Norms

Value : 当該単語における文書のノルム Column Name :Offsets

Value : 当該文書中の当該単語のバイト位置オフセット Column Name :Position

Value : 当該文書中の当該単語の出現位置

(12)

できること

(13)

README _より

1 Real-Time indexing

(documents become available almost

immediately)

2 No optimizing

3 Search

4 Sort

5 Range Queries

6 Delete

7 Wildcards and other Lucene magic

8 Faceting/Highlighting

4,5,7 ->　RandomPartitioner _では不可

(14)

現状できないこと

You can't walk the documents with

index reader.

現状遅いこと

Indexes with many documents and

very dense terms.

(15)

使ってみる

(16)

環境

Cassandra は 0.6.2 ( 単体 )

(17)

ビルド

下記より tar ball を DL します

http://github.com/tjake/Lucand

ra

ant で lucandra.jar をビルドしま

す

対応バージョン

Lucene-2.9.1, Cassandra-0.6

$ tar xztf ls tjake-Lucandra-c632677.tar.gz

$ cd tjake-Lucandra-c632677

$ ant lucandra.jar

(18)

storage-conf.xml _{の差し替え}

storage-conf.xml _{を差し替えて}

Cassandra _{を立ち上げます}

※Cassandra _{のデータが空である前}

提

$ cp config/storage-conf.xml ¥

/usr/local/cassandra/conf/

$ /usr/local/cassandra/bin/cassandra

(19)

storage-conf.xml _{のポイント}

<Keyspace Name="Lucandra">

<ColumnFamily CompareWith="BytesType" Name="Documents" KeysCached="10%" />

<ColumnFamily ColumnType="Super" CompareWith="BytesType" CompareSubcolumnsWith="BytesType"

Name="TermInfo" KeysCached="10%" /> :

:

</Keyspace>

<Partitioner>

org.apache.cassandra.dht.OrderPreservingPartitioner

</Partitioner>

クラスタノードでは InitialToken も適切に設定すべき

(20)

Cassandra

Bookmarks

Demo

Demo(BookmarksDemo) _を試す

Document Document

Field:url Field:url Field:title Field:title Field:tags Field:tags

-index

Document Document

-search

SimpleAnalyzer SimpleAnalyzer

Query Query

IndexReader

IndexWriter IndexWriter SimpleAnalyzer SimpleAnalyzer IndexSearcher

IndexSearcher

TSV File TSV File

(21)

動作確認

$ ./run_demo.sh -index bookmark.tsv

$ ./run_demo.sh -search title:linu* Search matched: 5 item(s)

1. ZFS on FUSE/Linux

http://zfs-on-fuse.blogspot.com/

2. Set Up Postfix For Relaying Emails Through Another Mailserver | HowtoForge - Linux Howtos and Tutorials

http://www.howtoforge.com/postfix_relaying_through_ another_mailserver

3. Debian GNU/Linux System Administration Resources http://www.debian-administration.org/

4. Linux Scalability

http://www.cs.wisc.edu/condor/condorg/linux_scalabi lity.html

5. LinuxDevCenter.com -- Cache-Friendly Web Pages

http://www.linuxdevcenter.com/pub/a/linux/2002/02/ 28/cachefriendly.html

(22)

ひとまず動作することが確認できた

ので、日本語のサンプルを作ってみ

る。

(23)

サンプルデータ

某飲食店検索 API を使ってこの近辺の

データを 980 件、 TSV にしておく

id docID, INDEX, STORE

name INDEX(ANALYZED), STORE

url STORE

address INDEX(ANALYZED), STORE

tel INDEX(ANALYZED), STORE

budget INDEX, STORE

(24)

サンプルプログラム

BookmarksDemo _{をコピーして、}

下記の変更を加えます

* Analyzer _を変更

SimpleAnalyzer →

CJKAnalyzer

* _{インデックス名を変更}

bookmarks → shopsearch

* ドキュメントフィールドを

サンプルデータにあわせて変更

(25)

サンプル実行

$ ./run_shop.sh -index data.tsv

$ ./run_shop.sh -search name: 丸の内

Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8

12:08:03,436 INFO CassandraProxyClient:145 - Connected to cassandra at 127.0.0.1:9160

name:" 丸のの内 "

12:08:03,863 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name 丸の to in 95ms

12:08:03,863 DEBUG LucandraTermEnum:246 - name _{丸の has} 11512:08:03,869 DEBUG LucandraTermEnum:246 - name 丸ノ has 10 12:08:03,871 DEBUG LucandraTermEnum:285 - loadTerms:

OxSo2Td8name 丸の (3) took 103ms

12:08:03,872 INFO IndexReader:153 - docFreq() took: 232ms

12:08:03,916 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name の内 to in 43ms

12:08:03,916 DEBUG LucandraTermEnum:246 - name _{の内 has} 11512:08:03,925 DEBUG LucandraTermEnum:246 - name _{の勘 has 2}

(26)

サンプル実行

12:08:03,927 DEBUG LucandraTermEnum:285 - loadTerms: OxSo2Td8name の内 (3) took 54ms

12:08:03,927 INFO IndexReader:153 - docFreq() took: 54ms 12:08:03,947 DEBUG LucandraTermEnum:176 - Found

OxSo2Td8name 丸の in cache

12:08:03,953 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name の内 in cache

Search matched: 0 item(s)

あれ？ヒットしない。。。　

（途中まではいい感じにみえるけど）

(27)

要因調査

CJKAnalyzer _{を使用した場合、}

QueryParser.parse() _{は CJK 文字}

列を bi-gram に分割した Query を返却

する

name: _丸の内

↓ name:" 丸のの内 "

(28)

要因調査

この際の Query は、 PhraseQuery

のインスタンスになっている

PhraseQuery _{が使用される場合、}

LucandraTermDocs.nextPositio

n() _が

うまく機能しない (?) ためか、 Hit

したドキュメントが抽出できていない

(29)

要因調査

そこが問題のようだが、つっこんで

調査しないと影響範囲とか読めない

ので、回避方法を検討。。。

解明しました。

詳細は付け足し資料 ( 補足編 ) にて。

(30)

回避方法

そもそもなぜ bi-gram が

PhraseQuery

として扱われるのかを調べていた

ら、下記の情報がありました

関口宏司の Lucene ブログ

http://lucene.jugem.jp/?cid=5

(31)

回避方法

これによると、 Lucene3.1 からは

Analyzer により複数の単語が生成さ

れる場合、 PhraseQuery が生成さ

れる仕様を BooleanQuery に変え

るべし

と提案されており、

patch _{が提供されている}

(32)

回避方法

このパッチを強引にも 2.9.1 にあて

ます

$ tar xzf lucene-2.9.1.tar.gz

$ cd lucene-2.9.1

$ curl -O

https://issues.apache.org/jira/secure/attachment /12445136/LUCENE-2458.patch

$ patch -b -p1 < LUCENE-2458.patch

(33)

回避方法

このままではビルドが通らないので

下記 2 ファイルを Lucene の

レポジトリからとってきます

org/apache/lucene/util/

Version.java

VirtualMethod.java

※ メソッドのバージョニング関連クラスで

本処理にはあまり影響なさそう？

(34)

回避方法

パッチのあたったソースは Java5 以

降の記述になっているため、 javac

のオプションを変更してビルドします

common-build.xml:

61: <property name="javac.source" value="6"/> 62: <property name="javac.target" value="6"/> 63: 64: <property name="javadoc.link"

value="http://java.sun

.com/javase/6/docs/api/"/>

$ ant

(35)

回避方法

build/lucene-core-2.9.1-dev.jar

を Lucandra の lib/lucene-core-

2.9.1.jar _と

差し替えます

QueryParser _{のデフォルトオペ}

レータを AND にして、再チャレンジ！！

ShopSearchDemo.java: QueryParser qp =

new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);

qp.setDefaultOperator( Operator.AND );

(36)

$ ./run_shop.sh -search name: 丸の内

Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8

12:08:03,436 INFO CassandraProxyClient:145 - Connected to cassandra at 127.0.0.1:9160

+name: 丸の +name: の内

18:03:39,127 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name 丸の to in 109ms

18:03:39,127 DEBUG LucandraTermEnum:246 - name 丸の has 115 18:03:39,128 DEBUG LucandraTermEnum:246 - name 丸ノ has 10 18:03:39,130 DEBUG LucandraTermEnum:285 - loadTerms:

OxSo2Td8name 丸の (3) took 112ms

18:03:39,131 INFO IndexReader:153 - docFreq() took: 222ms 18:03:39,189 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name の内 to in 57ms

18:03:39,189 DEBUG LucandraTermEnum:246 - name の内 has 115 18:03:39,190 DEBUG LucandraTermEnum:246 - name の勘 has 2 18:03:39,190 DEBUG LucandraTermEnum:285 - loadTerms:

OxSo2Td8name の内 (3) took 58ms

18:03:39,190 INFO IndexReader:153 - docFreq() took: 59ms

18:03:39,196 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name 丸の in cache

18:03:39,202 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name の内 in cache

(37)

Search matched: 115 item(s)

09:52:16,739 DEBUG IndexReader:293 - Document read took: 10ms1. Ｌｕｘｏｒ丸の内

http://r.gnavi.co.jp/g763393/ ¥9000

09:52:16,741 DEBUG IndexReader:293 - Document read took: 1ms 2. ｔｈｅＰａｎｔｒｙ丸の内店

09:52:16,743 DEBUG IndexReader:293 - Document read took: 1ms 3. ＭＡＩＳＯＮ・ＢＡＲＳＡＣ丸の内

09:52:16,750 DEBUG IndexReader:293 - Document read took: 2ms 4. 丸の内やんも

09:52:16,751 DEBUG IndexReader:293 - Document read took: 1ms 5. Ｖｉｎｐｉｃｏｅｕｒ～丸の内～

09:52:16,753 DEBUG IndexReader:293 - Document read took: 1ms 6. ＤＥＡＮ＆ＤＥＬＵＣＡ～丸の内～

09:52:16,755 DEBUG IndexReader:293 - Document read took: 2ms 7. Ｓ．Ｓｔｅｆａｎｏ～丸の内～

http://r.gnavi.co.jp/g763359/ ¥4500 : :

(38)

おお、なんかできてるっぽい

(39)

ソートも試してみる

sort オプションを指定した場合に、

IndexSearcher.search() _メソッド

にて budget( 予算 ) フィールド値の降順

でソートされるようにしてみます

動かしてみます ☞

ShopSearchDemo.java:

TopDocs docs = indexSearcher.search(q, null, 10,

new Sort(new SortField("budget", SortField.INT, true)));

(40)

$ ./run_shop.sh -search name: 丸の内 sort Search matched: 115 item(s)

09:59:17,396 DEBUG IndexReader:293 - Document read took: 9ms1. レストランモナリザ丸の内店～丸ビル～

09:59:17,398 DEBUG IndexReader:293 - Document read took: 1ms2. センチュリーコート丸の内

09:59:17,399 DEBUG IndexReader:293 - Document read took: 1ms3. Ｌｕｘｏｒ丸の内

09:59:17,401 DEBUG IndexReader:293 - Document read took: 1ms4. 丸の内やんも

09:59:17,402 DEBUG IndexReader:293 - Document read took: 1ms5. たまさか丸の内店

http://r.gnavi.co.jp/e533319/ ¥8000

09:59:17,403 DEBUG IndexReader:293 - Document read took: 1ms6. ワインショップエノテカ丸の内ザ・ラウンジ

09:59:17,405 DEBUG IndexReader:293 - Document read took: 1ms7. 寿し屋の勘八旬～丸の内～

http://r.gnavi.co.jp/g763366/ ¥6000 :

(41)

ソートされてるっぽい

(42)

まとめ

(43)

わかったこと 1

Lucandra は謳い文句通り Lucene

のバックエンドに Cassandra を採用

したものであり、アプリケーションは

Lucene の資産 (API) をほぼそのま

ま利用することができる

# PhraseQuery _は要調査

(44)

わかったこと 2

Lucene の機能を十分に使うには、

OrderPreservingPartitioner _を選

択する必要がある

Partitioner は現状 Cluster で共通

であり RandomPartitoner _{のシンプルで}

効果的なデータ分散の恩恵を受けら

れないので、共用環境への導入は要

検討

(45)

わかったこと 3

Cassandra の内部特性を利用するこ

とでインデックスの最適化を不要とし、

リアルタイム性を高める構造である

Twitter クライアントや RSS リー

ダーのような、ユーザーごとにイン

デックスが分かれていて総データ量

が多く、即時に検索が必要な場面に向いてい

ると思われる

(46)

わかったこと 4

当然だが、 Java でしか使えない

他環境では、同梱の Solrandra を

使って HTTP で利用するのだろう

Java でも SolrJ を使って Solr のイ

ンデックス管理、キャッシュ機構を

利用するのがベターなのかも

(47)

今後の課題

もう少し Lucene/Solr 勉強したら？

Solrandra _{ベースでの実用性検証}

データ量とパフォーマンス検証

RandomPartitioner _{での動作検証}

PhraseQuery . . .

(48)

おしまい

(49)

参考

■ A Cassandra-based Lucene backend

http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-

lucene-backend/

■ slideshare - Lucandra

http://www.slideshare.net/otisg/lucandra

■ Cassandra: RandomPartitioner vs OrderPreservingPartitioner

http://ria101.wordpress.com/2010/02/22/cassandra-

randompartitioner-vs-orderpreservingpartitioner/

■ 関口宏司の Lucene ブログ

http://lucene.jugem.jp/

(50)

Lucandra を使ってみる〜補足

編〜

PhraseQuery を調べました ...

2010/7/15

佐藤史彦　

(51)

前回のあらすじ

➲

_Lucandra とは、 Lucene の使い勝手はそのまま

で Index を Cassandra に格納するようにしたも

の

➲

_Lucandra では CJKAnalyzer （のように複数単語

が生成される Analyzer ）を使ってパースされる

PhraseQuery ではうまく検索ができなかった。

➲

PhraseQuery ではなく BooleanQuery を生成す

るパッチを Lucene にあてて、現象を回避した。

(52)

でもなんかひっかかる ...

➲

実装例の http://sparse.ly/ では問題なさそう。

( 自前でクエリをパースしているとも思えない

し )

➲

_solrandra （ Solr の実装）に持ってったら

なんだかうまくいかない。。

➲

_Lucandra を見直す方が早い気がしてきた。。。

(53)

ということでソースを追いかけ ...

TermInfo （転置インデックス）に

Position （単語の出現位置）がないと

PhraseQuery _{が機能しない}

ことがわかりました。

※Position を記録するには、インデクシング時に

　 _{指定する必要がある。}

※ 本家 Lucene はなくてもいけるのに ...

(54)

Keyspace : Lucandra

ColumnFamily : Document

Key : インデックス名のハッシュ + ドキュメント ID Column Name : _{フィールド名}

Value : _{フールド値} SuperColumnFamily : TermInfo

Key :( インデックス名 + フィールド名 ) のハッシュ + フィールド名 + 単語SuperColumn : ドキュメント ID

Column Name :Frequencies

Value : 当該文書中の当該単語の出現頻度 Column Name :Norms

Value : 当該単語における文書のノルム Column Name :Offsets

Value : 当該文書中の当該単語のバイト位置オフセット Column Name :Position

Value : 当該文書中の当該単語の出現位置

Index 構成（前回資料より抜粋）

コレコレ

(55)

で、どうする？

➲ _Lucandra の場合 (.java)

➲ _Solrandra の場合 (schema.xml)

doc.add(new Field("name", name,

Store.YES,

Index.ANALYZED,

Field.TermVector.WITH_POSITIONS));

<field name="name" type="text_cjk"

indexed="true" stored="true"

termPositions="true" />

インデックス生成時に Positions を指定す

る

(56)

実行結果 ( 一部省略 ⁾

$ ./run_shop -index data.tsv

$ ./run_shop.sh -search name: 丸の内 name:" 丸のの内 "

Search matched: 115 item(s) 1. Ｌｕｘｏｒ丸の内

http://r.gnavi.co.jp/g763393/ ¥9000 2. ｔｈｅＰａｎｔｒｙ丸の内店

http://r.gnavi.co.jp/g763381/ ¥1300 3. ＭＡＩＳＯＮ・ＢＡＲＳＡＣ丸の内

http://r.gnavi.co.jp/g763375/ ¥5500 4. 丸の内やんも

http://r.gnavi.co.jp/g763373/ ¥8000 5. Ｖｉｎｐｉｃｏｅｕｒ～丸の内～

http://r.gnavi.co.jp/g763372/ ¥3500 6. ＤＥＡＮ＆ＤＥＬＵＣＡ～丸の内～

http://r.gnavi.co.jp/g763365/ ¥1500 7. Ｓ．Ｓｔｅｆａｎｏ～丸の内～

ファイル置き場 日本Cassandraコミュニティ

Lucandra を使ってみ

る

2010/6/25

佐藤 史彦

Agenda

Lucandra ってなに？

Lucandra の構成

できること

使ってみる

まとめ

Lucandra ってなに？

A Cassandra-based Lucene backend

Author : Jake Luciani

カサンドラベースのルシーンバックエン

ド

作者 : ジェイク ルシアーニ

Cassandra にインデックス機能

追加する、というより を

Lucene/Solr のインデックスを

リアルタイムに作成、かつ

手軽にスケールさせる目的で

インデックスのストア先に

Cassandra を採用したもの

実装例 http://sparse.ly/

Lucandra の構成

Disk

Java

Application

Lucene

インデックス作

成

検索

Cassandra

Java

Application

Lucandra

インデックス作

成

検索

Index 構成

できること

README より

1 Real-Time indexing

(documents become available almost

immediately)

2 No optimizing

3 Search

4 Sort

5 Range Queries

6 Delete

7 Wildcards and other Lucene magic

8 Faceting/Highlighting

4,5,7 -> RandomPartitioner では不可

現状できないこと

You can't walk the documents with

index reader.

現状遅いこと

Indexes with many documents and

very dense terms.

使ってみる

環境

Cassandra は 0.6.2 ( 単体 )

ビルド

下記より tar ball を DL します

http://github.com/tjake/Lucand

ra

ant で lucandra.jar をビルドしま

す

対応バージョン

Lucene-2.9.1, Cassandra-0.6

$ tar xztf ls tjake-Lucandra-c632677.tar.gz

$ cd tjake-Lucandra-c632677

$ ant lucandra.jar

storage-conf.xml の差し替え

storage-conf.xml を差し替えて

Cassandra を立ち上げます

※Cassandra のデータが空である前

提

$ cp config/storage-conf.xml ¥

ファイル置き場日本Cassandraコミュニティ

Lucandra _{を使ってみ}

佐藤史彦　

Lucandra _{ってなに？}

Lucandra _の構成

Lucandra _{ってなに？}

作者 : ジェイクルシアーニ

Cassandra _{にインデックス機能}

追加する、というよりを

Lucene/Solr _{のインデックスを}

Cassandra _{を採用したもの}

Lucandra _の構成

Index _構成

README _より

4,5,7 ->　RandomPartitioner _では不可

storage-conf.xml _{の差し替え}

storage-conf.xml _{を差し替えて}

Cassandra _{を立ち上げます}

※Cassandra _{のデータが空である前}

storage-conf.xml _{のポイント}

Demo(BookmarksDemo) _を試す

BookmarksDemo _{をコピーして、}

* Analyzer _を変更

* _{インデックス名を変更}

あれ？ヒットしない。。。　

CJKAnalyzer _{を使用した場合、}

QueryParser.parse() _{は CJK 文字}

name: _丸の内

↓ name:" 丸のの内 "

のインスタンスになっている

PhraseQuery _{が使用される場合、}

n() _が

したドキュメントが抽出できていない

ら、下記の情報がありました