|
ZPAQ is an open source (GPL) command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms (LZ77, BWT, and context mixing) depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents. == Archive format == Files are saved in the ZPAQ level 2 journaling format.〔Mahoney, M. () The ZPAQ Open Standard for Highly Compressed Data - Level 2, June 3, 2013〕 The standard defines two formats - streaming and journaling. Only the journaling format supports deduplication, directory attributes, and multiple dated file versions. The streaming archive format is designed to be extracted in a single pass. An archive is divided into a sequence of blocks that can be decompressed independently in parallel. Blocks are divided into segments that must be decompressed sequentially in order. Each block header contains a description of the decompression algorithm. Each segment has a header containing an optional file name and an optional comment for meta-data such as size, date, and attributes, and an optional trailing SHA-1 checksum of the original data for integrity checking. If the file name is omitted, it is assumed to be a continuation of the last named file, which may be in the previous block. Thus, inserting, removing, or reordering the blocks in a streaming archive has the effect of performing the same operations on the data that the blocks represent. The journaling format consists of a sequence of transactions, or updates. An update contains 4 types of blocks: a transaction header block, a sequence of data blocks, a corresponding sequence of fragment tables, and a sequence of index blocks. A transaction header block contains the transaction date and a pointer skipping over the data blocks to allow the archive index to be read quickly. The data blocks contain a sequence of file fragments compressed together. The fragment tables give the size and SHA-1 hash of each fragment. The index blocks contain a list of edits to the global archive index. An edit is either a file update or a file deletion. An update includes a file name, last modified date, attributes, and a list of fragment pointers into the current and previous transactions. Fragments may be shared by more than one file. A deletion does not remove any data from the archive, but rather indicates that the file is not to be extracted unless the archive is rolled back to an earlier date. The ZPAQ standard does not specify a compression algorithm. Rather, it specifies a format for representing the decompression algorithm in the block headers. Decompression algorithms are written in a language called ZPAQL and stored as a byte code which can either be interpreted or converted directly to 32 or 64 bit x86 code and executed. A ZPAQL program has 3 parts. * COMP - An optional chain of context modeling components. * HCOMP - Machine code for computing contexts for the COMP components. * PCOMP - Optional machine code for post-processing the decoded data. The COMP models are based on PAQ, which compresses one bit at a time using arithmetic coding. There are 9 types of components. Each component takes a context and possibly the predictions of earlier components, and outputs a prediction or probability that the next bit will be a 1. The output of the last component is arithmetic coded. The component types are: * CONST - A fixed prediction. * CM - Context model. The context is used to look up a prediction in a table. On update, the selected entry is adjusted to reduce the prediction error. * ICM - Indirect context model. The context is used to look up an 8 bit state representing a recent bit history. The history selects a prediction as with a CM. * MIX - A group of predictions are combined by weighted averaging in the logistic domain, or log(p/(1-p)). The weights are selected by a context. On update, the weights are adjusted to favor the more accurate inputs. * MIX2 - A 2 input MIX with weights constrained to add to 1. * AVG - A MIX2 with fixed weights. * SSE - Secondary symbol estimator. Looks up a prediction from an interpolated table given a context and quantized prediction from another component. * ISSE - Indirect secondary symbol estimator. The context selects a bit history as with an ICM, and then the bit history selects a pair of weights to mix the input with a constant 1. * MATCH - Searches for the previous occurrence of the context and predicts whatever bit followed, with a strength depending on the length of the match. The HCOMP section computes the contexts for the components in the COMP section. It is a virtual machine whose state is 4 32-bit registers (A, B, C, D), a 16 bit program counter, a condition flag bit, and two memory arrays, one of bytes (M) and one of 32 bit words (H). The beginning of H forms the array of contexts. An assembly language-like program is called once for each coded or decoded byte with that byte as input in A. The final context seen by the COMP section is the computed context combined with the previously seen bits of the current byte. The optional PCOMP section is used for post-processing the decoded data. It runs in a separate virtual machine like that of HCOMP. However, unlike the COMP and HCOMP sections which are used for both compression and decompression, the PCOMP section is run only during decompression. The compressor is responsible for performing the inverse operation on the input data prior to coding. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「ZPAQ」の詳細全文を読む スポンサード リンク
|