TY - GEN
T1 - Collapsing the Hierarchy of Compressed Data Structures
T2 - 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023
AU - Kempa, Dominik
AU - Kociumaka, Tomasz
N1 - Publisher Copyright: © 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The last two decades have witnessed a dramatic increase in the amount of highly repetitive datasets consisting of sequential data (strings, texts). Processing these massive amounts of data using conventional data structures is infeasible. This fueled the development of compressed text indexes, which efficiently answer various queries on a given text, typically in polylogarithmic time, while occupying space proportional to the compressed representation of the text. There exist numerous structures supporting queries ranging from simple 'local' queries, such as random access, through more complex ones, including longest common extension (LCE) queries, to the most powerful queries, such as the suffix array (SA) functionality. Alongside the rich repertoire of queries followed a detailed study of the trade-off between the size and functionality of compressed indexes (see: Navarro; ACM Comput. Surv. 2021). It is widely accepted that this hierarchy of structures tells a simple story: the more powerful the queries, the more space is needed. On the one hand, random access, the most basic query, can be supported using O(Δ log n log ΣΔ log n) space (where n is the length of the text, Σ is the alphabet size, and Δ is the text's substring complexity), which is known to be the asymptotically smallest space sufficient to represent any string with parameters n, Σ, and Δ (Kociumaka, Navarro, and Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the suffix array queries. The currently smallest one takes O(r log nr) space, where r ≥ Δ is the number of runs in the Burrows-Wheeler Transform of the text (Gagie, Navarro, and Prezza; J. ACM 2020). We present a new compressed index, referred to as Δ SA, that supports the powerful SA functionality and needs only O(Δ log n log ΣΔ log n) space. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient to efficiently support the full SA functionality. Since suffix array queries are the most widely utilized queries in string processing and data compression, our result immediately improves the space complexity of dozens of algorithms, which can now be executed in Δ-optimal compressed space. The Δ-SA supports both suffix array and inverse suffix array queries in O(log 4+ϵ n) time (where ϵ > 0 is any predefined constant). Our second main result is an O(Δ polylog n)-time construction of the Δ-SA from the Lempel-Ziv (LZ77) parsing of the text. This is the first algorithm that builds an SA index in compressed time, i.e., time nearly linear in the compressed input size. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm, which builds an O(r log nr)-size index in O(√Δ n polylog n) time. To obtain our results, we develop numerous new techniques of independent interest. This includes deterministic restricted recompression, Δ-compressed string synchronizing sets, and their construction in compressed time. We also improve many other auxiliary data structures; e.g., we show the first O(Δ log n log ΣΔ log n)-size index for LCE queries along with its efficient construction from the LZ77 parsing.
AB - The last two decades have witnessed a dramatic increase in the amount of highly repetitive datasets consisting of sequential data (strings, texts). Processing these massive amounts of data using conventional data structures is infeasible. This fueled the development of compressed text indexes, which efficiently answer various queries on a given text, typically in polylogarithmic time, while occupying space proportional to the compressed representation of the text. There exist numerous structures supporting queries ranging from simple 'local' queries, such as random access, through more complex ones, including longest common extension (LCE) queries, to the most powerful queries, such as the suffix array (SA) functionality. Alongside the rich repertoire of queries followed a detailed study of the trade-off between the size and functionality of compressed indexes (see: Navarro; ACM Comput. Surv. 2021). It is widely accepted that this hierarchy of structures tells a simple story: the more powerful the queries, the more space is needed. On the one hand, random access, the most basic query, can be supported using O(Δ log n log ΣΔ log n) space (where n is the length of the text, Σ is the alphabet size, and Δ is the text's substring complexity), which is known to be the asymptotically smallest space sufficient to represent any string with parameters n, Σ, and Δ (Kociumaka, Navarro, and Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the suffix array queries. The currently smallest one takes O(r log nr) space, where r ≥ Δ is the number of runs in the Burrows-Wheeler Transform of the text (Gagie, Navarro, and Prezza; J. ACM 2020). We present a new compressed index, referred to as Δ SA, that supports the powerful SA functionality and needs only O(Δ log n log ΣΔ log n) space. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient to efficiently support the full SA functionality. Since suffix array queries are the most widely utilized queries in string processing and data compression, our result immediately improves the space complexity of dozens of algorithms, which can now be executed in Δ-optimal compressed space. The Δ-SA supports both suffix array and inverse suffix array queries in O(log 4+ϵ n) time (where ϵ > 0 is any predefined constant). Our second main result is an O(Δ polylog n)-time construction of the Δ-SA from the Lempel-Ziv (LZ77) parsing of the text. This is the first algorithm that builds an SA index in compressed time, i.e., time nearly linear in the compressed input size. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm, which builds an O(r log nr)-size index in O(√Δ n polylog n) time. To obtain our results, we develop numerous new techniques of independent interest. This includes deterministic restricted recompression, Δ-compressed string synchronizing sets, and their construction in compressed time. We also improve many other auxiliary data structures; e.g., we show the first O(Δ log n log ΣΔ log n)-size index for LCE queries along with its efficient construction from the LZ77 parsing.
KW - compressed indexing
KW - data compression
KW - suffix array
KW - text indexing
UR - https://www.scopus.com/pages/publications/85180399816
U2 - 10.1109/FOCS57990.2023.00114
DO - 10.1109/FOCS57990.2023.00114
M3 - Conference contribution
T3 - Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS
SP - 1877
EP - 1886
BT - Proceedings - 2023 IEEE 64th Annual Symposium on Foundations of Computer Science, FOCS 2023
PB - IEEE Computer Society
Y2 - 6 November 2023 through 9 November 2023
ER -