2011-11-08

BGZF - Blocked, Bigger & Better GZIP!

BAM files are compressed using a variant of GZIP (GNU ZIP), called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specification will have seen the terms BGZF and virtual offsets, but what you may not realise is how general purpose this is for random access to any large compressed file. The take home message is:
BGZF files are bigger than GZIP files, but they are much faster for random access.

Tucked away in samtools' sister tool set tabix is a little command line tool called bgzip (blocked gzip), which lets you compress any file in BGZF format. It also lets you decompress a BGZF file (nothing special here, they are a GZIP variant so you can use gunzip for this if you like). However, it can also decompress a requested region of the file - without the excessive IO required to do this with a traditional gzipped file.

This next paragraph assumes you are familiar with the programming ideas of seek and tell for low level file IO. If you seek on a file handle, it means jump to a particular point in the file - usually a simple offset which counts the number of bytes from the start of the file. If you then read some data, the position in the file changes. Calling tell tells you where in the file you are now. This idea also works for writing data (you can open files for writing with random access).

BGZF files (including BAM files) consist of many GZIP blocks, each limited to 64kb on disk and 64kb of compressed data. The virtual offset is an unsigned 64 bit integer which combines a 48 bit block start offset and a 16 bit within block offset. What this means is to get a particular bit of data from the BGZF file, you seek to a virtual offset, which is translated into a seek to the relevant BGZF block, which can be read and decompressed in memory (or incrementally), and then the within block offset tells you where the data requested begins. Just one disk seek is needed (except for the special case where the data continues into the subsequence block).

BGZF files are bigger than GZIP files!  (Boo!)

BGZF files are bigger than traditional GZIP files. Each block requires its own header data, and compressing many units of data separately isn't as space efficient as compressing one large data file.

As an example of this size penalty, let's look at the UniProt/TrEMBL data files - the final column shows the extra disk space taken using BGZF rather than GZIP, and this is noticeable higher for the "SwissProt" plain text format and the XML files (which have a lot of repeated text which must be stored in each BGZF block):

DatabaseFormatRawGZIPBGZFOverhead
Uniprot SProtFASTA252.7 Mb78.9 Mb82.0 Mb4%
(532,792 records)SwissProt Text2.52 Gb448.6 Mb534.9 Mb19%

UniProt XML5.26 Gb773.2 Mb896.0 Mb16%
Uniprot TrEMBLFASTA7.94 Gb3.50 Gb3.58 Gb2%
(17,651,715 records)SwissProt Text44.66 Gb7.23 Gb8.29 Gb4%

UniProt XML90.83 Gb14.21 Gb16.2114%

The GZIP sizes are for the files as downloaded from the EBI Mirror FTP site. I then decompressed them, and re-compressed as BGZF using bgzip from samtools. I presume that both the original GZIP file and the BGZIP file are using the same default ZLIB compression settings.

The 64kb block size currently used in BGZF is hurting it in the data compression of repeat rich files like XML, although this isn't an issue with compact records like FASTA, FASTQ, SAM, or BAM. This suggests an increased block size would benefit applying BGZF to XML file formats, with the significant downside that this would require a different virtual offset scheme.

BGZF files are better than GZIP files!  (Yay!)

BGZF is intended to improve on GZIP for random access. For testing my use case is to randomly select some SProt record IDs, extract those records (in this random order) from the parent file, and write them to a new subset file. Note that for the XML file, you'd also want to include an XML header/footer. And I'm going to do this with a modified version of Biopython's index support in Bio.SeqIO, currently on my bgzf branch on github.

Uniprot SProt (532,792 records)
FormatTaskRawGZIPBGZF
FASTASize252.7 Mb78.9 Mb82.0 Mb
FASTAIndex9s40s18s
FASTAExtract 100s10s0s

Extract 500s50s0s

Extract 1000s106s0s

Extract 5000s505s0s

Extract 1,0000s1,025s1s

Extract 5,0000s5,043s2s

Extract 100,0003s100,464s48s

Extract 500,00016s-242s

Extract 532,79217s-258s
FASTAShuffle 532,79226s-277s
FASTACopy/decompress2s3s3s

Uniprot SProt (532,792 records)
FormatTaskRawGZIPBGZF
SwissProt TextSize2.52 Gb448.6 Mb534.9 Mb
SwissProt TextIndex80s479s197s
SwissProt TextExtract 100s77s0s

Extract 500s369s0s

Extract 1000s787s0s

Extract 5000s3,740s0s

Extract 1,0000s-1s

Extract 5,0001s-3s

Extract 100,00017s-65s

Extract 500,00087s-325s

Extract 532,79293s-346s
SwissProt TextShuffle 532,792172s-543s
SwissProt TextCopy/decompress21s27s27s

Uniprot SProt (532,792 records)
FormatTaskRawGZIPBGZF
UniProt XMLSize5.26 Gb773.2 Mb896.0 Mb
UniProt XMLIndex170s1194s341s
UniProt XMLExtract 100s157s0s

Extract 500s642s0s

Extract 1000s1,352s1s

Extract 5000s6,508s1s

Extract 1,0000s-1s

Extract 5,0002s-6s

Extract 100,00048s-112s

Extract 500,000240s-560s

Extract 532,792256s-593s
UniProt XMLShuffle 532,792427s-934s
UniProt XMLCopy/decompress41s54s55s

The times are all in seconds, rounded to the nearest whole number. The "shuffle" time is the combination of indexing the file plus extracting all the records in the random order. The "-" entries are where I was not patient enough to let the GZIP random access task complete. The copy/decompress time used the Unix commands time, cat and for the GZIP and BGZF files gunzip to give an idea of the minimal disk IO and decompression time if we didn't need random access.

The yellow rows of the table should convincingly make the point is while indexing and random access to a large BGZF file is slower than the uncompressed file, it is reasonable. But trying to do this with a GZIP file is not a good idea. BGZF clearly trounces GZIP in random access.

Now there is a small catch - you can't do this with Biopython at home yet (update - you can with Biopython 1.60 onwards). Currently the Bio.SeqIO code doesn't support indexing compressed files. Indexing gzipped files was easy to add (we knew from earlier work performance would be poor, which is why it was never included), but support for BGZF was a bit harder to implement for this experiment. The main problem was the current code did things like taking the difference between start and end offsets to get an entry's length. You can't take the difference of BGZF virtual offsets.

Note that the "official" Bio.SeqIO code (which currently only does uncompressed files) could be a little faster to build the indexes because it assumes a normal file handle where the offsets behave naturally. Also, I'm deliberatly just timing the smaller Uniprot SProt database, because it has few enough records to safely work with the index in memory. We could ask Biopython to index Uniprot TrEMBL with an SQLite3 databse for the offsets, but it isn't necessary to demonstrate my point here, which is the raw IO side of things, not the indexing scheme.

Another proviso is it may be possible to improve the GZIP speed. I'm just using the Python gzip library via gzip.open(...) and the seek/tell methods it provides, so any optimisation or lack of it falls to the Python standard library. Interally this calls the Python zlib library, which is what I'm using to implement BGZF in Biopython. On the other hand, I'm pretty sure that my BGZF code can be sped up after some profiling, which I haven't tried yet.

Test Code

First the following quick shell commands extracted all the record names from the FASTA file (using grep and cut), randomised them (using a little Perl), and saved them as a plain text one name per line.

$ time grep "^>" uniprot_sprot.fasta
 | cut -c 2- | cut -f 1 -d " "
 | perl -MList::Util=shuffle -e 'print shuffle(<stdin>);'
 > uniprot_sprot.fasta.rnd

real 0m3.570s
user 0m5.902s
sys 0m0.523s

$ time cut -d "|" -f 2 uniprot_sprot.fasta.rnd
 > uniprot_sprot.swiss.rnd

real 0m0.361s
user 0m0.329s
sys 0m0.028s

Lines wrapped at the pipe and redirection characters. These are the input files for my random access trails - the two ID files handle the different ID formats used by default for the different filetypes in Biopython (e.g. tr|F2D7G4|F2D7G4_HORVD from the FASTA header vs just F2D7G4 from the SwissProt plain text file).

The Python code boils down to just some out loops to set filenames and formats, then builds an in memory index (with the experimental support for GZIP and BGZF compression), and another loop to time extracting various records from the file. The timings shown above are the best run of three:

import time
from Bio import SeqIO

start = time.time()
index = SeqIO.index(filename, format, compression=comp)
time_index = time.time()-start

print "SeqIO.index(%r, %r, compression=%r) took %0.1fs" \
      % (filename, format, comp, time_index)

Then I have a generator function to load the identifiers from the text file created earlier:

def load_ids(filename, limit = 10):
    count = 0
    with open(filename) as handle:
        for line in handle:
            name = line.strip()
            if name:
                yield name
                count += 1
                if count == limit:
                    break

And finally, we time using the index to extract various number of the records

for lim in [10,50,100,500,1000,5000,100000,500000,len(index)]:
    start = time.time()
    handle = open(out_filename, "w")
    for name in load_ids(id_filename, lim):
        handle.write(index.get_raw(name))
    handle.close()
    time_write = time.time()-start
    print "Writing %i in random order took %0.1fs" \
          % (limit, time_write)
    if lim == len(index):
        print "Shuffling %i took %0.1fs" \
              % (limit, time_index + time_write)

This requires the BGZF and GZIP support currently on my bgzf branch on github (update - you can with Biopython 1.60 onwards). That's about it.

BGZF is bigger but better than GZIP for random access

I'm still tweaking the BGZF code but will probably propose merging it into Biopython later this month.

However, the main point of this blog post was to try to raise awareness of BGZF as a suitable alternative to GZIP for when you want to compress a data file while retaining reasonably fast random access:
BGZF files are bigger than GZIP files, but they are much faster for random access.
For High Throughput Sequencing (HTS or HTSeq, aka Next Gen Seq or NGS) data, I still believe you should just use SAM/BAM for your unaligned reads, but for FASTQ / FASTA / QUAL files using BGZF is a more flexible alternative to plain GZIP (with a fairly small size overhead).

Also on the plus side, separate self-contained GZIP blocks as used in BGZF also offer the possibility of multi-threading decompression and data analysis at the block level, in addition to the random access benefits I've been focusing on here today.

If you are interested in the Biopython developement side, please join the Biopython-dev mailing list and read this thread, while for any discussion/feedback/comments on the general use of BGZF, I've started a thread on SEQanswers.com for this.

Update (June 2012)

Biopython 1.60 includes module Bio.bgzf for working with BGZF files, and Bio.SeqIO's indexing functions will use this to support BGZF compressed sequence files.

2 comments:

  1. do you have a snippet which can explain how to extract sequence from a bam file, without samtools, with biopython. thx

    ReplyDelete
    Replies
    1. No. The standard Biopython releases as of Biopython 1.62 do NOT include SAM/BAM support...

      Delete