Fast and easy gzipping on hipergator!

Published: June 16th, 2017

Category: DiBiG News, FrontPage

Compressing files with gzip is a very common operation, especially when trying to save disk space to stay within our quotas (see my previous post). Unfortunately compressing a file is a very I/O intensive operation. Combined with the fact that disk access on HiPerGator head nodes is pretty slow, this means that gzipping a large number of files can take a very long time and slow down the system for all logged in users. The right way to do it, of course, is to submit the gzip commands as jobs to the cluster. In this post I will show how to do this easily, even for a very large number of files.

The dibig_tools module include a script called gzip.qsub that handles gzipping one or more files. As with all other dibig_tools scripts, you can use the submit command with the -vv option to display usage information:

> submit -vv gzip.qsub
gzip.qsub -  Compress one or more files

Arguments:

  Files to compress...
  If the first argument starts with @, the arguments are interpreted as:
  arg1 = file containing list of filenames to compress
  arg2 = index in arg1 file of first file to compress
  arg3 = index in arg1 file of last file to compress, 
         or number of lines to read if in the form +N.

Let’s go through this in detail. The basic way of using gzip.qsub is simply to list the files to be compressed on the command line, as you would do for the regular gzip command:

> submit gzip.qsub file1 file2 file3 ...

Just run this, and your files will get compressed fast and efficiently, without tying up your terminal and affecting the head node.

If you have a lot of files to compress, it may be easier to list all their pathnames in a file, and use the ‘@’ notation. For example, assuming that you listed the names of all files you want to compress in the file bigfiles.txt, you can do:

> submit gzip.qsub @bigfiles.txt

Please note that in this case you can only supply a single argument to gzip.qsub.

If you really have a lot of files to compress, you may want to distribute the load across different jobs. To do this you should provide two additional arguments, indicating which section of the bigfiles.txt file to read. The first additional argument specifies the first line in the file to read, while the second argument specifies either the last line, or the number of lines to read (if starting with ‘+’). For example, let’s assume that bigfiles.txt contains 300 filenames. You can distribute this over three jobs, assigning 100 files to each job, in the following way:

> submit gzip.qsub @bigfiles.txt 1 100
> submit gzip.qsub @bigfiles.txt 101 200
> submit gzip.qsub @bigfiles.txt 201 300

Or:

> submit gzip.qsub @bigfiles.txt 1 +100
> submit gzip.qsub @bigfiles.txt 101 +100
> submit gzip.qsub @bigfiles.txt 201 +100

One final note. Normally, gzip writes the compressed file to the /tmp folder while running, and only copies it to the destination directory when done. On HiPerGator the /tmp partition may be too small to hold a large compressed file. A further advantage of gzip.qsub is that it writes the temporary file to the current directory instead, avoiding this problem.

Leave a comment (Gatorlink Required)

Message