The goal of our adventurous endeavour run against conventional and approved conduct that compression algorithms can not make any substantial use of powerful vector processing capacity of CUDA-enabled GPUs was to explore the performance speedup that can be achieved by utilizing the multi-core Compute Unified Device Architecture.
Among the available data compressors we chose bzip2, which in its turn uses the Burrows–Wheeler transform (which again has been used for memory reduction in human genome indexing), move-to-front transform and Huffman coding, and adapted BWT to CUDA to exploit the advantages of GPGPU computing.
For comparison testing we have only performed Burrows–Wheeler transform over 2 pixel-based images and 2 PDF files of different size on one CPU core and on GPU and so far have not approached the Huffman encoding which in fact doesn't offer any chances of parallelization for GPU-computations due to sequential nature and requirement for synchronization between blocks since they are dependent on each previous block's output. Therefore, theoretically, a significant increase in performance is subject to making full use of both GPU and CPU main assets in terms of their computational faculty.
As is evident from the execution time figures below, there is a noticeable 2x to 4x speedup.
BMP, 540 Kb
Full bzip2 compression on CPU - 218 ms
BW Transform on CPU - 171 ms
Full bzip2 compression on GPU - 93 ms [ minus 53% ]
BW Transform on GPU - 46 ms [ minus 73% ]
BMP, 1112 Kb
Full bzip2 compression on CPU - 467 ms
BW Transform on CPU - 343 ms
Full bzip2 compression on GPU - 249 ms [ minus 46% ]
BW Transform on GPU - 140 ms [ minus 59% ]
PDF, 1919 Kb
Full bzip2 compression on CPU - 1513 ms
BW Transform on CPU - 731 ms
Full bzip2 compression on GPU - 1107 ms [ minus 26% ]
BW Transform on GPU - 311 ms [ minus 57% ]
PDF, 3425 Kb
Full bzip2 compression on CPU - 2168 ms
BW Transform on CPU - 793 ms
Full bzip2 compression on GPU - 1856 ms [ minus 14% ]
BW Transform on GPU - 481 ms [ minus 39% ]
Inteli7 860 2,8 GHz, 4 Gb RAM with nVidiaGeForceGTX 260.
Current bzip2 application processes one bzip2-block per time. We use one CPU core versus GPU.
At the present time we are developing parallel bzip2 for complete comparison of full-power CPU and GPU.