Getting a handle on your storage
If you’re working in genomics and bioinformatics, most likely you’ve run out of disk space at some point in your life. We deal with large datasets that have a way of proliferating beyond our control, and it’s not always easy to know how much space our files are using, especially when they’re down in deeply nested subdirectories. Various combinations of the unix find, grep, and du commands can help, but they are not terribly easy to use.
The dibig_tools module includes a command called fs.py (standing for file size) that attempts to solve all these problems in a user-friendly way. As usual, you can use the -h option to display a short help message:
fs.py - Display total size of a set of files. Usage: fs.py [options] [filespecs...] Each filespec can be: the name of an existing file or directory, or a specifier of the form @filename, in which case the list of filespecs is read from file `filename'. If no filespecs are supplied, uses all files in the current directory. Options: -u | Output sizes in the appropriate unit of measure (e.g. MB, GB) instead of number of bytes. -c | Print output as tab-delimited: sizes in first column, filenames in second. -r | Recurse into subdirectories. -t | Only print final total to standard output. -s | Read list of filespecs from standard input.
The simplest way of using fs.py is to call it with no arguments, and it will show all files in the current directory, printing their total size at the end. For example, assuming we are in a directory that contains three files:
$ fs.py 4994 file1 47623 file3 1648 file2 54265 *** Total ***
We can display the sizes in human-readable format with the -u flag:
$ fs.py -u 4.9 KB file1 46.5 KB file3 1.6 KB file2 53.0 KB *** Total ***
If instead we need the results to be processed by another script, we can output them in tab-delimited format with the -c flag:
$ fs.py -c 4994 file1 47623 file3 1648 file2 54265 *** Total ***
We can automatically recurse into subdirs with the -r option, and fs.py will print the grand total at the end:
$ fs.py -r -u 4.9 KB file1 46.5 KB file3 6.5 KB subdir1/file4 51.4 KB subdir1/file5 1.6 KB file2 110.9 KB *** Total ***
Note that, in general, the output will not be sorted. The easiest way to get sorted output is to use the -c option and pipe the results to the sort command. If you’re only interested in the total size, you can use the -t option and fs.py will only print that:
$ fs.py -t 113524
So far, we’ve asked fs.py to look at all files. If we instead need to determine the total size of a specific set of files, we have several options. First of all, we can specify the filenames on the command line directly:
$ fs.py -u file1 file2 4.9 KB file1 1.6 KB file2 6.5 KB *** Total ***
The second option applies if the names of the files we want to count are stored in a file (one per line). For example, let’s assume that the file FILES contains the following:
We can then tell fs.py to read filenames from this file using the @ prefix:
$ fs.py -u @FILES 4.9 KB file1 1.6 KB file2 6.5 KB *** Total ***
Note that you can mix the two preceding options. For example:
$ fs.py -u @FILES subdir1 4.9 KB file1 1.6 KB file2 6.5 KB subdir1/file4 51.4 KB subdir1/file5 64.4 KB *** Total ***
A caveat: fs.py does not check for duplicates in its arguments, so if you specify the same file twice (for example because you list it on the command line but it is also included in FILES), then its size will be added twice.
Finally, we can tell fs.py to read filenames from its standard input, using the -s option, or specifying a single `–‘ as an argument, per unix convention. For example, to find the total size of all your fastq.gz files in the current directory and all its subdirectories, we can do:
$ find . -name \*.fastq.gz | fs.py -t -s - or - $ find . -name \*.fastq.gz | fs.py -t -
But wait, there’s more!
If you are on HiPerGator and are having disk space issues, it is useful to be able to check your quota. Unfortunately the standard lfs command to do that is not particularly user-friendly, especially when you belong to multiple groups. For example, I currently belong to six groups (renamed group1 to group6 for privacy 😉 ) and for each one I would need to run the following:
$ lfs quota -g group1 . Disk quotas for grp group5 (gid 4659): Filesystem kbytes quota limit grace files quota limit grace . 22147764864 37580963840 37580963840 - 1800794 0 0 -
The qu command is designed to quickly give you a complete picture of your quota situation. By default, it prints quota information for all groups you belong to:
$ qu Group Used Total Perc Over? ============ ============ ============ ======= ===== group1 2980 GB 4294 GB 69.4% group2 256 GB 2147 GB 11.9% group3 22147 GB 37580 GB 58.9% group4 19244 GB 23622 GB 81.5% group5 4293 GB 4294 GB 100.0% group6 3087 GB 3221 GB 95.8%
For each group the output includes used space, total space (i.e., your investment), percent of space used, and a * in the last column if your usage is over 100%. If you are only interested in some of your groups, you can list them on the command line:
$ qu group1 group3 group5 Group Used Total Perc Over? ============ ============ ============ ======= ===== group1 2980 GB 4294 GB 69.4% group3 22147 GB 37580 GB 58.9% group5 4293 GB 4294 GB 100.0%
Tip: put the ‘qu’ command at the end of your shell initialization script (~/.bash_profile), and you will see the status of your quotas every time you log in!