Efficient String Dictionary Compression
Using String Dictionaries
♥ ♦ ♦
Shunsuke KANDA Kazuhiro MORITA Masao FUKETA
Trie
Front-Coding Re-Pair
Re-Pair
A string dictionary is a data structure to store a set of strings. Recently, instances have emerged in prac-tice where the size of string dictionaries has become a critical problem in many applications. Consequently, compressed string dictionaries have been proposed by leveraging efficient implementation techniques, such as Trie and Front-Coding, and powerful text compres-sion techniques, such as Re-Pair. In this paper, we propose new dictionary structures based on a strategy using string dictionaries for the compression in order to improve existing compressed ones. We show that our string dictionaries can be constructed in a shorter time compared to the Re-Pair versions with competi-tive space usage and operation speed, through exper-iments on real-world datasets.
1.
ID ID Lookup ID Access 2 [1] [1–5] ♥ [email protected] ♦ {kam,fuketa}@is.tokushima-u.ac.jp Mart´ınez-Prieto [1] Front-Coding [6] Grossi Ottaviano [2] Trie [7] Path Decomposion [8]Trie Front-Coding Re-Pair [9] Lookup/Access Re-Pair Re-Pair Re-Pair Re-Pair 1 [5] Trie Trie Re-Pair
2.
Σ σ =|Σ| 2 Rank/Select B[1...n] Rank/Select 2 • Rankb(B, i) B[1...i] b ∈ {0, 1} • Selectb(B, i) i b ∈ {0, 1} o(n) [10]LOUDS LOUDS Level-Ordered Unary Degree Se-quence [10] 1 LOUDS d d ‘1’ 1 ‘0’ “10” Rank/Select n 2n + o(n)
DFUDS DFUDS Depth-First Unary Degree
Se-quence [11] LOUDS 1
DFUDS d d ( 1 ) ( DFUDS o(n) [10, 12] n 2n + o(n) Elias-Fano Elias-Fano [13,14] n m 2m + m⌈log n m⌉ + o(m) Elias-Fano Succinct [15]
3.
S ⊂ Σ∗ ID [1, |S |] 2 • Lookup(s) s ∈ S ID • Access(i) ID i ⟨s1,s2, ...,sn⟩ ⟨i1,i2, ...,in⟩ Trie Front-CodingCompact Trie C-Trie Path-Decomposed Trie PD-Trie Front-Coding 3
3. 1 Compact Trie C-Trie
Trie
C-Trie Trie
1a Sex={ideal, ideas, ideology, tea,
techie, technology, tie, trie} C-Trie C-Trie
MARISA Matching Algorithm with Recursively Implemented StorAge [5]
MARISA C-Trie C-Trie
MARISA C-Trie C-Trie MARISA 1 2 v L[v] L L′ F • |L[v]| = 1 L′[v] ← L[v] F[v] ← 0 • |L[v]| ≥ 2 L[v] ID iv L′[v] ← iv F[v] ← 1 1a C-Trie MARISA 1b MARISA LOUDS L L L′ F B 1 7 11 13 2 3 4 6 8 12 14 t rie ie e ide ch a nology ie a ology 9 10 l s 5 (a) C-Trie Dictionary encoding _BtaEeCFlsaACD 01001011000111 L’ F 12345678901234 |ide|t|a|ology|e|ie|rie|l|s|a|ch|ie|nology L 12___3_4_5_____6_7__8___9_0_1_2__3__4 ch ide ie nology ology rie A B C D E F 10110110111011001100000011000 00001011111011 B T 12345678901234567890123456789 (b) C-Trie 1: Sex C-Trie LOUDS T v T[v] = 1 T Rank/Select ID [1, |S |] ID Lookup/Access Lookup(s) s v Rank1(T, v) ID Access(i) Select1(T, i) ID F[v] = 1 L′[v] ID
3. 2 Path-Decomposed Trie PD-Trie
PD-Trie Trie Path
Decomposion 2 Grossi Ottaviano [2] PD-Trie 2b 2a Trie PD-Trie Trie 1, 2, ... Σ PD-Trie Trie π π Trie Trie ∈ {1, 2, ...} PD-Trie vπ vπ π Trie PDT Trie Trie Heavy Path Heavy Path
8 6 7 4 5 3 1 t rie ie e a c h 2 ie nology i d e a l s ology (a) Trie 3| 6|de1a1l 2|ology i n a r 8|logy o s i 4|e 5|ie 7| 1|1t2e1ch1ie (b) PD-Trie ID Dictionary encoding GFADCBAE L’ 12345678 1t2e1ch1ie|ology||ie|e|de1a1l||logy L 1__________2____3_4__5_6_____7_8 de1a1l e ie logy ology 1t2e1ch1ie A B C D E F G irianos ((((((()))))))(())) E B 1234567890123456789 (c) PD-Trie 2: Sex Trie PD-Trie
Heavy Path Trie
Centroid Path Decomposition CPD CPD PD-Trie O(log |S |) CPD Trie CPD PD-Trie PD-Trie v Lv v Ev v DFUDS Bv Lv,Ev,Bv L, E, B PD-Trie 2c PD-Trie Lookup/Access [2] L Σ′= Σ∪ {1, 2, ..., σ − 1} Σ =[0, 256) Σ′ =[0, 511) L [2] 2 1 L Vbyte [16] Σ′ L 0 1 1 L Re-Pair [9] 1 [2] [0, 128) ideal ideas ideology tea techie technology tie trie ideal 4 s 3 ology 0 tea techie 4 nology 1 ie 1 rie ideal 4 E 3 C 0 F techie 4 B 1 A 1 D Dictionary encoding Front-Coding ideal$techie$ 17 430411 ECFBAD H P C L 1234567890123 ie nology ology rie s tea A B C D E F 3: Sex Front-Coding Re-Pair Approximate Re-Pair [17] L Elias-Fano PD-Trie L 2c L′ L ID Vbyte L′ 3. 3 Front-Coding Front-Coding URL ID 3 4 Sex Front-Coding Lookup Access ID Front-Coding [1] Plain Front-Coding PFC PFC Vbyte [1] Re-Pair PFC 2 Navarro Re-Pair https://www.dcc.uchile.cl/∼gnavarro/software/ 2 [1] Hu-Tucker [7] Re-Pair
Front-Coding PFC ID 3 Front-Coding H P C ID L
4.
Trie Front-Coding ID Lookup/Access • Restore(i) ID i • Compare(i, q) Restore(i) q Restore Access Compare Lookup Restore Compare Restore Compare• Lookup Access Compare
Restore
• Trie Front-Coding
4. 1
ID
TAIL [4, 5]
Reverse Trie Back-Coding
4. 2 Reverse Trie
Trie Reverse Trie Reverse Trie ID
Restore Compare
Com-pact Trie Reverse Trie Reverse Compact Trie RC-Trie PD-Trie Reverse Path-Decomposed Trie RPD-Trie 2
RC-Trie MARISA
RPD-Trie RC-Trie
Reverse Compact Trie RC-Trie RC-Trie Reverse Trie ID ID T RC-Trie MARISA e h y i i r F $ E g o l o A c C B d D n
(a) Reverse Trie
1|$eir 3|golon 4|i y d 2|c h $eirhcygolondi 00001010000010 112 L B P __CF_A____ED_B 12345678901234 (b) RPD-Trie
4: Reverse Trie RPD-Trie
RC-Trie TAIL
Reverse Path-Decomposed Trie RPD-Trie RC-Trie Reverse Trie Path Decomposion [2] Compare PD-Trie RPD-Trie Reverse Trie RPD-Trie [2] 1b Reverse Trie 4a RPD-Trie 4b Reverse Trie $ 4b RPD-Trie Reverse Trie CPD ID 1, 2, ... RPD-Trie 3 L, B, P L RPD-Trie ID B L[i] B[i] = 1 B[i] = 0 P L ID RPD-Trie L ID RPD-Trie Restore(i) (1) str (2) L[i] = $ str (3) str L[i]
(4) B[i] = 1 i ← P[Rank1(B, i)] B[i] = 0 i
(2) Reverse Trie n L n⌈log σ⌉ B n + o(n) P Elias-Fano |P| = m P 2m + m⌈logn m⌉ + o(m) Reverse
Trie 2n+n⌈log σ⌉+o(n)
2m + m⌈logn
m⌉ < n RPD-Trie
m Reverse Trie
1 4 n
4. 3 Back-Coding Front-Coding Back-Coding Back-Coding Front-Coding Back-Coding PFC [18] Fast Back-Coding FBC FBC 2
5.
5. 1Intel Core i7 4.0 GHz CPU 16 GB RAM (L2 cache 256 KB, L3 cache 8 MB)
OS OS X 10.12 C++
-O3 Apple LLVM version 8.0.0 (clang-800.0.42.1)
std::chrono::duration cast
C-Trie PD-Trie Front-Coding
TAIL RC-Trie RPD-Trie Back-Coding FBC 5
RC-Trie 1 RC-Trie TAIL
2 RC-Trie TAIL
RCT1 RCT2 Back-Coding FBC
4 8 16 3 BC4 BC8 BC16 FBC4 FBC8 FBC16
TAIL RC-Trie RPD-Trie
Back-Coding FBC PD-Trie Front-Coding
PD-Trie
Vbyte Re-Pair 2
Plain Re-Pair Front-Coding PFC
Re-Pair 2 Plain Re-Pair Re-Pair Re-Pair Front-Coding 8 3 • SW Wikipedia https://dumps. wikimedia.org/enwiki/ • SI URL http://data.law.di.unimi.it/webdata/ indochina-2004/indochina-2004.urls.gz • SU .uk URL http:// data.law.di.unimi.it/webdata/uk-2005/uk-2005.urls.gz 1 MiB σ 1: σ SW 227.2 11,519,354 20.7 199 SI 612.9 7,414,866 86.7 98 SU 2,723.3 39,459,925 72.4 103 2: (a) SW - -C-Trie 90.9 11,580,413 3,027,555 12.8 PD-Trie 90.3 11,519,354 3,762,732 14.4 Front-Coding 83.0 10,079,434 2,921,168 13.5 (b) SI - -C-Trie 135.8 6,472,460 1,251,149 39.1 PD-Trie 134.6 7,414,866 1,300,192 44.7 Front-Coding 121.6 6,488,007 1,204,600 42.5 (c) SU - -C-Trie 663.3 41,001,910 11,396,164 32.5 PD-Trie 655.4 39,459,925 11,958,074 35.2 Front-Coding 591.7 34,527,434 10,756,301 34.2
C-Trie PD-Trie Front-Coding 2 MiB -2 18–33% 5. 2 3 % Lookup Access 1 Lookup/Access 100 100 ID 10 C-Trie 3a TAIL PRDT 1.4 RCT2 Lookup/Access TAIL SW RPDT SI SU FBC4 TAIL RCT1 C-Trie PD-Trie 3b Plain Plain 3.4 Re-Pair Plain 15 RCT2 BC16 Lookup/Access
3:
(a) C-Trie
SW SI SU
Lookup Access Lookup Access Lookup Access TAIL 8.5 34.0 1.11 1.13 8.3 11.7 1.47 1.87 48.1 19.0 2.46 2.65 RCT1 10.1 26.1 1.61 1.59 9.1 8.1 2.87 3.20 57.0 13.4 4.47 4.51 RCT2 10.4 24.9 1.68 1.67 9.3 7.2 3.44 3.82 59.5 12.1 5.22 5.22 RPDT 11.8 27.6 1.39 1.42 9.6 9.5 2.25 2.73 65.2 15.0 3.53 3.70 BC4 11.2 30.8 2.05 1.37 9.2 9.6 4.26 2.55 61.4 15.5 5.24 3.59 BC8 11.1 29.2 2.23 1.45 9.2 9.1 5.67 3.56 61.3 14.6 5.87 3.95 BC16 11.1 28.3 2.59 1.63 9.2 8.8 7.45 4.66 61.1 14.1 7.07 4.64 FBC4 11.1 31.6 1.46 1.32 9.2 9.9 2.07 2.34 61.7 15.9 3.28 3.19 FBC8 11.1 30.9 1.53 1.36 9.3 9.8 2.37 2.61 61.2 15.5 3.47 3.29 FBC16 11.0 30.9 1.65 1.43 9.2 9.9 2.86 2.84 61.4 15.5 3.83 3.50 (b) PD-Trie SW SI SU
Lookup Access Lookup Access Lookup Access Plain 2.1 49.6 1.25 1.42 2.5 24.5 1.39 1.78 12.8 27.1 1.89 2.26 Re-Pair 29.0 31.6 1.31 1.48 20.3 11.8 1.63 2.00 185.6 17.5 2.06 2.42 TAIL 4.7 41.7 1.13 1.26 4.0 13.5 1.23 1.56 24.1 20.7 1.67 1.96 RCT1 6.7 31.3 2.51 2.57 4.8 9.1 2.91 3.16 32.3 14.8 4.14 4.28 RCT2 7.2 29.3 3.66 3.63 5.0 8.2 4.55 4.75 35.5 13.4 6.55 6.57 RPDT 7.2 32.4 1.51 1.60 4.8 10.7 1.68 1.92 33.8 16.4 2.48 2.67 BC4 6.5 36.0 3.38 3.66 4.5 10.9 5.24 5.75 31.0 16.9 5.71 6.24 BC8 6.5 33.8 3.71 3.96 4.5 10.3 6.09 6.64 31.1 15.9 6.58 7.12 BC16 6.5 32.7 4.48 4.77 4.5 10.0 7.73 8.27 30.9 15.4 8.28 8.79 FBC4 6.5 37.0 1.61 1.74 4.5 11.3 1.78 2.08 31.1 17.3 2.42 2.69 FBC8 6.5 35.9 1.73 1.85 4.5 11.2 2.00 2.30 31.2 16.9 2.64 2.92 FBC16 6.5 35.9 1.97 2.12 4.5 11.3 2.40 2.73 31.2 16.9 3.03 3.33 (c) Front-Coding SW SI SU
Lookup Access Lookup Access Lookup Access Plain 0.8 59.6 1.11 0.37 0.8 34.9 1.31 0.33 3.8 37.3 1.60 0.37 Re-Pair 470.9 36.5 2.09 1.31 1363.8 18.2 2.67 1.66 5861.6 22.3 3.29 1.91 TAIL 3.7 45.3 1.39 0.69 2.7 25.0 1.72 0.78 17.7 31.4 2.26 0.88 RCT1 5.2 37.4 2.42 1.50 3.6 21.1 2.30 1.22 26.7 25.7 3.84 2.13 RCT2 5.6 36.0 2.63 1.69 3.8 20.1 2.76 1.62 29.2 24.4 4.66 2.85 RPDT 6.2 38.7 1.95 1.10 3.9 22.3 2.13 1.08 30.6 27.1 3.31 1.73 BC4 5.7 41.9 1.82 1.03 3.6 22.5 2.27 1.22 27.3 27.7 3.00 1.54 BC8 5.6 40.2 2.01 1.19 3.6 22.0 2.60 1.54 27.2 26.8 3.40 1.87 BC16 5.6 39.4 2.36 1.56 3.6 21.7 3.37 2.30 27.3 26.4 4.31 2.72 FBC4 5.7 42.7 1.73 0.95 3.6 22.9 1.98 0.98 27.4 28.1 2.64 1.22 FBC8 5.7 42.0 1.78 1.01 3.6 22.7 2.04 1.04 27.4 27.7 2.72 1.30 FBC16 5.6 42.0 1.90 1.14 3.6 22.9 2.20 1.21 27.3 27.7 2.92 1.49 RPDT FBC4 FBC8 Re-Pair SW FBC RPDT FBC4 FBC8 Plain TAIL Lookup/Access Elias-Fano 1 TAIL SI SU Re-Pair 2–3% TAIL Lookup/Access TAIL Front-Coding 3c Lookup/Access Plain Plain Re-Pair SW 127 SI 505 SU 331 Front-Coding Re-Pair Re-Pair SI SU Lookup/Access TAIL 1.5–2.2 FBC4 1.2–1.7 RPDT FBC4 Re-Pair RPDT FBC4 2–6% Lookup/Access 1.1–1.7 76–379 Re-Pair
6.
PD-Trie Front-Coding Re-Pair Re-Pair RPDT FBC Re-Pair Lookup/Access 1 PD-Trie Elias-Fano[
]
[1] Miguel A Mart´ınez-Prieto, Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, and Gonzalo Navarro. Prac-tical compressed string dictionaries. Information Sys-tems, 56:73–108, 2016.
[2] Roberto Grossi and Giuseppe Ottaviano. Fast com-pressed tries through path decompositions. ACM Jour-nal of Experimental Algorithmics, 19(1):Article 1.8, 2014.
[3] Julian Arz and Johannes Fischer. LZ-compressed string dictionaries. In DCC, pages 322–331, 2014.
[4] Shunsuke Kanda, Kazuhiro Morita, and Masao Fuketa. Compressed double-array tries for string dictionaries supporting fast lookup. Knowledge and Information Systems, 51(3):1023–1042, 2017.
[5] . Prefix/patricia trie . In
, 2011.
[6] Ian H Witten, Alistair Moffat, and Timothy C Bell. Managing gigabytes: compressing and indexing docu-ments and images. Morgan Kaufmann, San Francisco, CA, USA, 1999.
[7] Donald E Knuth. The art of computer programming, 3: sorting and searching. Addison Wesley, Redwood City, CA, USA, 2nd edition, 1998.
[8] Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, and Jeffrey Scott Vitter. On searching com-pressed string collections cache-obliviously. In PODS, pages 181–190. ACM, 2008.
[9] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proc. IEEE, 88(11):1722–1732, 2000.
[10] Guy Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549–554. IEEE, 1989.
[11] David Benoit, Erik D Demaine, J Ian Munro, Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Repre-senting trees of higher degree. Algorithmica, 43(4):275– 292, 2005.
[12] J Ian Munro and Venkatesh Raman. Succinct represen-tation of balanced parentheses and static trees. SIAM Journal on Computing, 31(3):762–776, 2001.
[13] Peter Elias. Efficient storage and retrieval by con-tent and address of static files. Journal of the ACM, 21(2):246–260, 1974.
[14] Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, Cambridge, MA, 1971.
[15] Roberto Grossi and Giuseppe Ottaviano. Design of prac-tical succinct data structures for large data collections. In SEA, pages 5–17, 2013.
[16] Hugh E Williams and Justin Zobel. Compressing inte-gers for fast file access. Computer Journal, 42(3):193– 201, 1999.
[17] Francisco Claude and Gonzalo Navarro. Fast and com-pact web graph representations. ACM Transactions on the Web, 4(4):16, 2010.
[18] Ingo M ¨uller, Cornelius Ratsch, and Franz Faerber. Adaptive string dictionary compression in in-memory column-store database systems. In EDBT, pages 283– 294, 2014.