# Lossless Text Data Compression

Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed.

A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best-case compression ratio.

This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time.

#### Solution Preview

Please see the attached file.

=======================================================================

Improving the Efficiency of Lossless Text Data Compression Algorithms

A comparison of two reversible transforms

James R. Achuff

Penn State Great Valley

School of Graduate Professional Studies

30 East Swedesford Road, Malvern, PA 19355, USA

achuffj@safeplace.net

Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed.

A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best-case compression ratio.

This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time.

Key words: Text Compression, Lossless Compression, Reversible Transform

1. INTRODUCTION

Compression means making things smaller by applying pressure. Data compression means reducing the amount of bits needed to represent a particular piece of data. Text compression means reducing the amount of bits or bytes needed to store textual information. It is necessary that the compressed form can be decompressed to reconstitute the original text, and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of the signal may be tolerable if the compression achieved is worth the reduction in quality. [Bell, Cleary & Witten, 1990]

The immutable yardstick by which data compression is measured is the "compression ratio", or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced in size to, say, 50 KB, making it easier to store on disk and faster to transmit over a network connection. In this specific case, the data compression software reduces the size of the data file by a factor of two, or results in a "compression ratio" of 2:1.

There are "lossless" and "lossy" forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Lossless compression ratios are generally in the range of 2:1 to 8:1.

Compression algorithms reduce the redundancy in data to decrease the storage requirements for that data. Data compression offers an attractive approach to reducing communications and storage costs by using available bandwidth effectively. With the trend of increasing amounts of digital data being transmitted over public and private networks expected to continue, it makes sense to pursue research on developing algorithms that can most effectively use available network bandwidth by maximally compressing data. This paper is focused on addressing this problem for lossless compression of text files. It is well known that there are theoretical predictions on how far a source file can be losslessly compressed [Shannon, 1951], but no existing compression approaches consistently attain these bounds over wide classes of text files.

One approach to tackling the problem of developing methods to improve compression is to develop better compression algorithms. However, given the sophistication of existing algorithms such as arithmetic coding, Lempel-Ziv algorithms, Dynamic Markov Coding, Prediction by Partial Match and their variants, it seems unlikely that major new progress will be made in this area.

An alternate approach, which is taken in this paper, is to perform a lossless, reversible transformation to a source file prior to applying an existing compression algorithm. This transformation is designed to make it easier to compress the source file. Figure 1 illustrates this strategy. The original text file is provided as input to the transformation, which outputs the transformed text. This output is provided to an existing, unmodified data compression algorithm, which compresses the transformed text. To decompress, on simply reverses the process by first invoking the appropriate decompression algorithm and then providing the resulting text to the inverse transform.

Figure 1. Text compression process involving a lossless, reversible transform

There are several important observations about this strategy. The transformation must be exactly reversible, so that the overall lossless text compression requirement is not compromised. The data compression and decompression algorithms are unmodified, so they do not exploit information about the transformation while compressing. The intent is to use the strategy to improve the overall compression ratio of the text in comparison with that achieved by the compression algorithm alone. A similar strategy has been employed in the compression of images and video transmissions using the Fourier transform, Discrete Cosine Transform or wavelet transforms. In these cases, however, the transforms are usually lossy, meaning that some data can be lost without compromising the interpretation of the image by a human.

One well-known example of the text compression strategy outlined in Figure 1 is the Burrows Wheeler Transform (BWT). BWT combines ad-hoc compression techniques (Run Length Encoding, Move to Front) and Huffman coding to provide one of the best compression ratios available on a wide range of data.

1.1 Lossless Text Compression Algorithms

As stated above, text compression ought to be exact - the reconstructed message should be identical to the original. Exact compression is also called noiseless (because it does not introduce any noise into the signal), lossless (since no information is lost), or reversible (because compression can be reversed to recover the original input exactly).

The task of finding a suitable model for text is an extremely important problem in compression. Data compression is inextricably bound up with prediction. In the extreme case, if one can predict infallibly what is going to come next, one can achieve perfect compression by dispensing with transmission altogether. Even if one can only predict approximately what is coming next, one can get by with transmitting just enough information to disambiguate the prediction. Once predictions are available, the are processed by an encoder that turns them into binary digits to be transmitted.

There are three ways that the encoder and decoder can maintain the same model: static, semiadaptive, and adaptive modelling. In static modelling the encoder and decoder agree on a fixed model, regardless of the text to be encoded. This is the method employed when sending a message via Morse Code. In semiadaptive modelling, a "codebook" of the most frequently used words or phrases is transmitted first and then used to encode and decode the message. Adaptive modelling builds it's "codebook" as it progresses according to a predefined method. In this way, both the encoder and decoder use the same codebook without ever having to transmit the codes with the data.

1.1.1 Huffman Coding

In 1952, D. A. Huffman introduced his method for the construction of minimum redundancy codes - now more commonly known as "Huffman Coding". In Huffman Coding, the characters in a data file are converted to a binary code, where the most common characters in the file ...

#### Solution Summary

Full Paper: Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed.

A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best-case compression ratio.

This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time.