Compressed text files

Theoretically, I had been aware that archiving and compressing are two separate matters, but still, when I first saw a .gz file that was not a .tar.gz file, I was surprised.

It was an old log file that contained some data I needed to grep. My first instinct? Decompress it, creating a new file, and then grep the new file.

That was completely unnecessary because gzip can print a decompressed file to the standard output instead. If I had bothered then to read the man pages of gzip, I would have known that. Fortunately, human interactions are an OK substitute for reading manuals - a coworker told me I can just use zcat.

Compression of text files is something that is certainly important when keeping around old log files. When I look into my /var/log/ directory, I can see there are a lot of .log.gz and .log.bz2 files. I want to know how to deal with those two types.

.gz

gzip is a compression/decompression tool using Lempel-Ziv coding (LZ77). It deals with files with the extension .gz.

Compress a file

This will create a new karma-police.txt.gz file:

1
$ gzip -k karma-police.txt

I’m using -k to keep the original karma-police.txt. The default behavior of gzip is to delete the input file.

Decompress a file

This will create a new karma-police.txt file:

1
$ gzip -dk karma-police.txt.gz

The -d flag means that the tool should decompress instead of compressing.

Read a compressed file

This is not a useful way to do it:

1
2
3
4
5
6
$ cat karma-police.txt.gz
�+LXkarma-police.txt�QAN�0
                          ��s��~h�>`Z���&+�٪�z\H ��R�h&�g<9�.�KI�����`Q*���0Js�d��μ�ۍ+��
��I����՟�"Z��=t���ꕔ:�Qg�x.ΥM����                                                        ¨2L����Z�J����JL��;'��W$ѡ@jx�Y�12'HJ�.X�k��^�FW��t!�-��}��5�a+
|PF5�%{��7-���(꿱Hnƻ�z�R��r?��y=b�mۿ	�۱
                                          "�angelika at MBP in ~

On Linux, there is a tool called zcat that allows reading .Z and .gz files. On Mac OS X, it does not work for .gz files:

1
2
$ zcat karma-police.txt.gz
zcat: can't stat: karma-police.txt.gz (karma-police.txt.gz.Z): No such file or directory

I have to use gzcat instead:

1
2
3
$ gzcat karma-police.txt.gz | head -n 2
Karma police, arrest this man
He talks in maths

Using zcat or gzcat is the same as using gunzip -c. The -c flag means that the output will go to the standard output stream, leaving files intact.

1
2
3
$ gunzip -c karma-police.txt.gz | head -n 4 | tail -n 2
He buzzes like a fridge
He's like a detuned radio

In turn, using gunzip is the same as using gzip -d. Instead of having to guess if the system I am currently logged into uses zcat or gzcat, I can just use:

1
2
3
4
$ gzip -dc karma-police.txt.gz | head -n 7 | tail -n 3
Karma police, arrest this girl
Her Hitler hairdo is
Making me feel ill

For reading very long files comfortably, there is zmore and zless.

Grep a compressed file

1
2
3
4
$ zgrep Karma karma-police.txt.gz
Karma police, arrest this man
Karma police, arrest this girl
Karma police

On Mac OS X, zgrep is equivalent to grep -Z.

.bz2

bzip2 is a compression/decompression tool using the Burrows–Wheeler algorithm. It deals with files with the extension .bz2.

Compress a file

1
$ bzip2 -k paranoid-android.txt

Just as gzip, bzip2 will remove input files if not told to keep them by using -k.

Decompress a file

1
$ bzip2 -dk paranoid-android.txt.bz2

Read a compressed file

Just as there is gzcat for reading .gz files, there is bzcat for reading .bz2 files:

1
2
3
$ bzcat paranoid-android.txt.bz2 | head -n 2
Please could you stop the noise, I'm trying to get some rest
From all the unborn chicken voices in my head

Using bzcat is equivalent to using bunzip2 -c:

1
2
3
$ bunzip2 -c paranoid-android.txt.bz2 | head -n 4 | tail -n 2
What's that...? (I may be paranoid, but not an android)
What's that...? (I may be paranoid, but not an android)

Using bunzip2 -c is equivalent to using bzip2 -dc:

1
2
3
$ bzip2 -dc paranoid-android.txt.bz2 | head -n 6 | tail -n 2
When I am king, you will be first against the wall
With your opinion which is of no consequence at all

For reading very long files comfortably, there is bzmore and bzless.

Grep a compressed file

1
2
3
4
5
$ bzgrep android paranoid-android.txt.bz2
What's that...? (I may be paranoid, but not an android)
What's that...? (I may be paranoid, but not an android)
What's that...? (I may be paranoid, but no android)
What's that...? (I may be paranoid, but no android)

On Mac OS X, bzgrep is equivalent to grep -J.

Other tools

Vim

You can open, read, edit and save .gz and .bz2 files using Vim with no problems.

Comparing files

For comparing compressed files, there are zdiff and zcmp, which are the equivalents of diff and cmp. On Linux, you can use them to compare .gz files. On Mac OS X, in theory, you could use them to compare many different compressed files, as this fragment of my /usr/bin/zdiff suggests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/sh -
check_suffix() {
  case "$1" in
  *[._-][Zz])
    setvar $2 "${1%??}"
    setvar $3 "gzip -cdqf"
    ;;
  *[._-]bz)
    setvar $2 "${1%???}"
    setvar $3 "bzip2 -cdqf"
    ;;
  *[._-]gz)
    setvar $2 "${1%???}"
    setvar $3 "gzip -cdqf"
    ;;
  # lots of different extensions omitted here
  esac
}

However, setenv is not a valid command in my sh and zdiff does not work at all:

1
2
3
4
5
6
$ zdiff karma-police.txt.gz karma-police-NO-FRIDGES.txt.gz
/usr/bin/zdiff: line 49: setvar: command not found
/usr/bin/zdiff: line 50: setvar: command not found
/usr/bin/zdiff: line 49: setvar: command not found
/usr/bin/zdiff: line 50: setvar: command not found
Binary files karma-police.txt.gz and karma-police-NO-FRIDGES.txt.gz differ

Fortunately, there is a different way to see that diff without recreating decompressed files:

1
2
3
4
5
$ diff <(gzip -dc karma-police.txt.gz) <(gzip -dc karma-police-NO-FRIDGES.txt.gz)
3c3
< He buzzes like a fridge
---
> He buzzes like a ******