実験結果と考察

第 6 章実験 35

6.2 CSE 法の応用に関する実験

6.2.2 実験結果と考察

k-グラム統計の計算速度の計測結果を図6.3–6.7に示す．図6.3と図6.4より，小さなkについてはCOUNT_gzipに比べCOUNT_cstの方が高速であることが確認できる．また，図6.5 と図6.6より，ある程度大きなkについては，ファイルサイズが小さければCOUNT_gzipが高速であることが確認できる．しかし，ファイルサイズが大きい場合はCOUNT_cstの方が高速となっている．加えて，ファイルサイズの増加に伴う計算速度の低下も，COUNT_gzip に比べCOUNT_cstの方が小さいことが確認できる．ただし，図6.7より，大きなkについてはある程度の大きさのファイルサイズについてもCOUNT_gzipの方が高速であることが確認できる．以上より，kが十分小さく，更にファイルサイズが大きいほどCOUNT_cst

は高速であり，COUNT_gzipと比較し十分高速なことが確認された．また，圧縮されていない元データを用いて数え上げを行うCOUNT_naiveと比較してもCOUNT_cst は，kが十分小さく，更にファイルサイズが大きい場合は高速なことが確認された．

0 0.05 0.1 0.15 0.2 0.25 0.3

0 200 400 600 800 1000

Computation time [sec]

File size [KB]

COUNTnaive COUNTgzip COUNTcst

図 6.3: 2-グラム統計の計算時間

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

0 200 400 600 800 1000

Computation time [sec]

File size [KB]

COUNTnaive COUNTgzip COUNTcst

図 6.4: 3-グラム統計の計算時間

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0 200 400 600 800 1000

Computation time [sec]

File size [KB]

COUNTnaive COUNTgzip COUNTcst

図 6.5: 4-グラム統計の計算時間

0 0.1 0.2 0.3 0.4 0.5 0.6

0 200 400 600 800 1000

Computation time [sec]

File size [KB]

COUNTnaive COUNTgzip COUNTcst

図 6.6: 5-グラム統計の計算時間

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 200 400 600 800 1000

Computation time [sec]

File size [KB]

COUNTnaive COUNTgzip COUNTcst

図 6.7: 6-グラム統計の計算時間

第 7 ^章まとめ

7.1 ^{まとめと今後の課題}

本論文では，CSE法の拡張に対する効率的な実現の手法とCSE法の応用について提案を行った．まず，BWT行列を用いたCSE法に改良を加えることで，k-フェーズCSE法と今まで具体的な実装手法が示されていなかった多値アルファベット版CSE法に対する BWT行列を用いた効率的な実現の手法を提案した．また，それぞれについて圧縮率と圧縮速度に関する実験を行った．BWT行列を用いたk-フェーズCSE法では，BWT行列を用いた通常のCSE法に比べ計算時間がほとんど変化することなく，圧縮率が向上することが実験的に確認された．しかし，BWT行列を用いた多値アルファベット版CSE法については復元速度はCSE法に比べ高速だったが，圧縮率や計算速度についてBWT行列を用いたCSE法やk-フェーズCSE法に比べ優位性を示すことができなかった．よって，今後の課題としては，まず多値アルファベット版CSE法の数え上げアルゴリズムの改良が考えられる．現在のアルゴリズムの時間計算量はアルファベットサイズに大きく依存しており，同じ文字列を複数回数え上げている部分が存在するなどの理由で通常の CSE法に比べ圧縮速度が悪化している．そのため，より効率的に多値アルファベット版 CSE法における部分文字列の数え上げを行うことができるアルゴリズムが求められる．

また，現状で多値アルファベット版CSE法は圧縮率の点でもCSE法やk-フェーズCSE 法に劣っているが，本手法ではIwataらの符号化を簡略化した符号化を用いているため，

部分文字列の出現数の符号化に用いる符号化を工夫することで圧縮率の向上を目指すことも今後の課題の一つである．現在はCSE法全般において復号に圧縮部分文字列木を用

いているため，圧縮時と同様に圧縮部分文字列木を用いない復号アルゴリズムを開発することも今後の課題のひとつとして考えられる．

CSE法の応用については，多値アルファベット版CSE法の効率的な実現について提案したことにより，多値アルファベットで構成されるテキストをそのまま圧縮することができるようになり，多値アルファベット版CSE法により圧縮されたデータを用いた部分文字列の数え上げが，条件によっては他の圧縮法で圧縮されたデータ全体を復元してから愚直な数え上げを行う手法より高速なことを実験的に示したことで，k-グラム統計への応用の可能性を示した．

7.2 ^発表論文

本論文の研究成果は以下の学会，研究会で発表を行った．

• 佐久間俊平, 成澤和志, 篠原歩:“部分文字列数え上げ圧縮法の効率的な実現の一般化—多値化とフェーズの導入—”コンピュテーション研究会 2015年9月.

• Shumpei SAKUMA, Kazuyuki NARISAWA, and Ayumi SHINOHARA: “Gener-alization of Eﬃcient Implementation of Compression by Substring Enumeration”, Data Compression Conference 2016 (poster) .

• Shumpei Sakuma, Kazuyuki Narisawa, Ayumi Shinohara: “Eﬃcient Implementa-tion and ApplicaImplementa-tion for CSE with Finite Alphabet”, Workshop on Compression, Text and Algorithms 2016.

参考文献

[1] The canterbury corpus. http://corpus.canterbury.ac.nz/index.html, 1987.

[2] The large corpus. http://corpus.canterbury.ac.nz/index.html, 1987.

[3] Tetsuo Asano, Sergey Bereg, and David Kirkpatrick. Finding nearest larger neigh-bors. In Eﬃcient Algorithms, pp. 249–260. 2009.

[4] Mathieu B´eliveau and Danny Dub´e. Improving compression via substring enumer-ation by explicit phase awareness. In Data Compression Conference (DCC), 2014, pp. 399–399. IEEE, 2014.

[5] Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. Computing the longest common prefix array based on the burrows–wheeler transform. Journal of Discrete Algorithms, Vol. 18, pp. 22–31, 2013.

[6] Michael A. Bender, Mart´ın Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. Lowest common ancestors in trees and directed acyclic graphs.

Journal of Algorithms, Vol. 57, No. 2, pp. 75–94, 2005.

[7] Francisco Claude and Gonzalo Navarro. The wavelet matrix. In Proceedings of the 19th International Conference on String Processing and Information Retrieval, SPIRE’12, pp. 167–179, Berlin, Heidelberg, 2012. Springer-Verlag.

[8] John G Cleary and Ian Witten. Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on, Vol. 32, No. 4, pp. 396–

402, 1984.

[9] Danny Dub´e and Vincent Beaudoin. Lossless data compression via substring enu-meration. InData Compression Conference (DCC), 2010, pp. 229–238. IEEE, 2010.

[10] Johannes Fischer. Combined data structure for previous- and next-smaller-values.

Theoretical Computer Science, Vol. 412, No. 22, pp. 2451–2456, 2011.

[11] Roberto Grossi, Ankur Gupta, and Jeﬀrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pp. 841–850, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics.

[12] Roberto Grossi and Jeﬀrey Scott Vitter. Compressed suﬃx arrays and suﬃx trees with applications to text indexing and string matching. SIAM Journal on Comput-ing, Vol. 35, No. 2, pp. 378–407, 2005.

[13] Kenichi Iwata and Mitsuharu Arimura. Lossless data compression via substring enumeration for k-th order markov sources with a finite alphabet. IEICE Trans.

Fundamentals of Electronics, Communications and Computer Sciences, Vol. E99.A, No. 12, pp. 2130–2135, 2016.

[14] Mark Adler Jean-loup Gailly. The gzip compresser. http://www.gzip.org/.

[15] Sho Kanai, Hidetoshi Yokoo, Kosumo Yamazaki, and Hideaki Kaneyasu. Eﬃcient implementation and empirical evaluation of compression by substring enumeration.

IEICE Trans. Fundamentals of Electronics, Communications and Computer Sci-ences, Vol. E99-A, No. 2, pp. 601–611, 2016.

[16] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suﬃx arrays and its applications. In Combinatorial pattern matching, pp. 181–192, 2001.

[17] Udi Manber and Gene Myers. Suﬃx arrays: A new method for on-line string searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete

Algorithms, SODA ’90, pp. 319–327, Philadelphia, PA, USA, 1990. Society for In-dustrial and Applied Mathematics.

[18] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suﬃx array construction by almost pure induced-sorting. In Data Compression Conference (DCC), 2009, pp.

193–202. IEEE, 2009.

[19] Takahiro Ota and Hiroyoshi Morita. On a universal antidictionary coding for station-ary ergodic sources with finite alphabet. InInformation Theory and its Applications (ISITA), 2014 International Symposium on, pp. 294–298, Oct 2014.

[20] Julian Seward. bzip2. http://www.bzip.org/.

ドキュメント内部分文字列数え上げ圧縮法の効率的な実現と応用に関する研究 (ページ 43-51)

第 6 章 実験 35

6.2 CSE 法の応用に関する実験

6.2.2 実験結果と考察

第 7 章 まとめ

7.1 まとめと今後の課題

7.2 発表論文

参考文献

第 6 章実験 35

第 7 ^章まとめ

7.1 ^{まとめと今後の課題}

7.2 ^発表論文