Getting a handle on your storage

Published: May 7th, 2018

By: Alberto Riva

Category: BiocodingAlbertoRiva, BlogPosts

If you’re working in genomics and bioinformatics, most likely you’ve run out of disk space at some point in your life. We deal with large datasets that have a way of proliferating beyond our control, and it’s not always easy to know how much space our files are using, especially when they’re down in deeply nested subdirectories. Various combinations of the unix find, grep, and du commands can help, but they are not terribly easy to use.

The dibig_tools module includes a command called fs.py (standing for file size) that attempts to solve all these problems in a user-friendly way. As usual, you can use the -h option to display a short help message:

fs.py - Display total size of a set of files.

Usage: fs.py [options] [filespecs...]

Each filespec can be: the name of an existing file or directory, or a specifier of the form @filename, in which case the list of filespecs is read from file `filename'. If no filespecs are supplied, uses all files in the current directory.

Options:
  -u | Output sizes in the appropriate unit of measure (e.g. MB, GB) instead of number of bytes.
  -c | Print output as tab-delimited: sizes in first column, filenames in second.
  -r | Recurse into subdirectories.
  -t | Only print final total to standard output.
  -s | Read list of filespecs from standard input.

The simplest way of using fs.py is to call it with no arguments, and it will show all files in the current directory, printing their total size at the end. For example, assuming we are in a directory that contains three files:

$ fs.py 
        4994 file1
       47623 file3
        1648 file2

       54265 *** Total ***

We can display the sizes in human-readable format with the -u flag:

$ fs.py -u
      4.9 KB file1
     46.5 KB file3
      1.6 KB file2

     53.0 KB *** Total ***

If instead we need the results to be processed by another script, we can output them in tab-delimited format with the -c flag:

$ fs.py -c
4994    file1
47623   file3
1648    file2
54265   *** Total ***

We can automatically recurse into subdirs with the -r option, and fs.py will print the grand total at the end:

$ fs.py -r -u
      4.9 KB file1
     46.5 KB file3
      6.5 KB subdir1/file4
     51.4 KB subdir1/file5
      1.6 KB file2

    110.9 KB *** Total ***

Note that, in general, the output will not be sorted. The easiest way to get sorted output is to use the -c option and pipe the results to the sort command. If you’re only interested in the total size, you can use the -t option and fs.py will only print that:

$ fs.py -t
113524

So far, we’ve asked fs.py to look at all files. If we instead need to determine the total size of a specific set of files, we have several options. First of all, we can specify the filenames on the command line directly:

$ fs.py -u file1 file2
      4.9 KB file1
      1.6 KB file2

      6.5 KB *** Total ***

The second option applies if the names of the files we want to count are stored in a file (one per line). For example, let’s assume that the file FILES contains the following:

file1
file2

We can then tell fs.py to read filenames from this file using the @ prefix:

$ fs.py -u @FILES
      4.9 KB file1
      1.6 KB file2

      6.5 KB *** Total ***

Note that you can mix the two preceding options. For example:

$ fs.py -u @FILES subdir1
      4.9 KB file1
      1.6 KB file2
      6.5 KB subdir1/file4
     51.4 KB subdir1/file5

     64.4 KB *** Total ***

A caveat: fs.py does not check for duplicates in its arguments, so if you specify the same file twice (for example because you list it on the command line but it is also included in FILES), then its size will be added twice.

Finally, we can tell fs.py to read filenames from its standard input, using the -s option, or specifying a single `‘ as an argument, per unix convention. For example, to find the total size of all your fastq.gz files in the current directory and all its subdirectories, we can do:

$ find . -name \*.fastq.gz | fs.py -t -s 

- or -

$ find . -name \*.fastq.gz | fs.py -t -

But wait, there’s more!

If you are on HiPerGator and are having disk space issues, it is useful to be able to check your quota. Unfortunately the standard lfs command to do that is not particularly user-friendly, especially when you belong to multiple groups. For example, I currently belong to six groups (renamed group1 to group6 for privacy 😉 ) and for each one I would need to run the following:

$ lfs quota -g group1 .
Disk quotas for grp group5 (gid 4659):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
              . 22147764864  37580963840 37580963840       - 1800794       0       0       -

The qu command is designed to quickly give you a complete picture of your quota situation. By default, it prints quota information for all groups you belong to:

$ qu
Group        Used         Total        Perc    Over?
============ ============ ============ ======= =====
group1            2980 GB      4294 GB   69.4%
group2             256 GB      2147 GB   11.9%
group3           22147 GB     37580 GB   58.9%
group4           19244 GB     23622 GB   81.5%
group5            4293 GB      4294 GB  100.0%
group6            3087 GB      3221 GB   95.8%

For each group the output includes used space, total space (i.e., your investment), percent of space used, and a * in the last column if your usage is over 100%. If you are only interested in some of your groups, you can list them on the command line:

$ qu group1 group3 group5
Group        Used         Total        Perc    Over?
============ ============ ============ ======= =====
group1            2980 GB      4294 GB   69.4%
group3           22147 GB     37580 GB   58.9%
group5            4293 GB      4294 GB  100.0%

Tip: put the ‘qu’ command at the end of your shell initialization script (~/.bash_profile), and you will see the status of your quotas every time you log in!

About the Author

Alberto Riva

Associate Scientist, Bioinformatics

I am a bioinformatics scientist with the Bioinformatics Core of the UF Interdisciplinary Center for Biotechnology Research. I work primarily on the analysis of Next-Gen Sequencing data, developing software tools and analysis…

Find more posts by Alberto Riva »