Lecture 2 – notes

Please download MobaXTerm from: https://mobaxterm.mobatek.net/

Note: lines starting with # are comments; lines starting with $ and in bold are commands to be executed, all other lines are command outputs.


# Go to your directory:
$ cd /ufrc/bioinf_workshop/YOURNAME

# or:
$ cd /ufrc/bioinf_workshop/share/
$ mkdir YOURNAME
$ cd YOURNAME

# Check that you are in the correct place:
$ pwd
/ufrc/bioinf_workshop/...

# Copy test file to your directory:
$ cp /ufrc/bioinf_workshop/share/genes.csv .

# Check that the file is there:
$ ls -l genes.csv
-rw-rw---- 1 ariva bioinf_workshop 2162697 Feb 23 14:50 genes.csv

# How many lines, words, characters?
$ wc genes.csv
65850 329250 2162697 genes.csv

# How many lines?
$ wc -l genes.csv
65850 genes.csv

# Read file contents
$ more genes.csv
# (press 'q' to exit)

# Show beginning of file:
$ head genes.csv
chr1 11873 14409 DDX11L1 +
chr1 17368 17436 MIR6859-1 -
...

# Show end of file:
$ tail genes.csv
...
chr1 11873 14409 DDX11L1 +
chr1 17368 17436 MIR6859-1 -

# Show first column of file:
$ cut -f 1 genes.csv
chr1
chr1
....

# Is there a gene called ABCD1?
$ grep ABCD1 genes.csv
chrX 153724867 153744762 ABCD1 +

# How many times does APOE appear?
$ grep -c APOE genes.csv
5

# Redirection: write the entries for APOE to a different file
$ grep APOE genes.csv > APOE.csv
$ wc -l APOE.csv

5

# Pipes allow the output of one command to be sent to another one.
# How many genes are on chromosome 5?
$ grep chr5 genes.csv | wc -l
2694

# How many genes are in the file?
$ cut -f 4 genes.csv | sort | uniq | wc -l
27858

# Display number of genes in each chromosome
$ cut -f 1 genes.csv | sort | uniq -c
6620 chr1
3077 chr10
3709 chr11
3350 chr12
1298 chr13
...

## Working with Fastq files
# Display first two reads from fastq file
$ zcat reads.fastq.gz | head -8
@seq_1
GCGCAATTCCTTAATCAGCGCCAGGAGAAAAAACTAACGGTCCGATTGGTTCAAACG
+
IIHGGIGIHIHIHIGHHIGGIIFDDGHIGIIIDGIEIHIIFHFGGFIFIDDGIABDI
@seq_2
CGTGCGGCGCGCTTAGACGCTAGACGTCATTTGCGTTATGGTGCTGGGGGACCGACA
+
HIIIIFIHGFIHGFGIHIEFHIFGEGIIGIEHIFDIDHIGFEEHIGCABIDBEIDCD

# How many reads?
$ zcat sample.fastq.gz | wc -l
400000
# (then divide number by 4)

# Alternatively, although slightly error-prone:
$ zcat sample.fastq.gz | grep -c ^@
100000

# Use the NGS simulator (in dibig_tools) to write simulated reads
$ module load dibig_tools

# First, let's create a synthetic reference sequence
$ simseq.py reference.fa
Random sequence of 1000000bp written to file reference.fa
$ ls -l reference.fa
-rw-rw---- 1 ariva bioinf_workshop 1016672 Feb 23 16:24 reference.fa

# Now generate 100,000 paired-end reads from the reference:
$ simreads.py -nr 100000 -p reference.fa
Writing 100000 paired-end reads to reads_R1.fastq.gz and reads_R2.fastq.gz

$ ls reads_R*
reads_R1.fastq.gz reads_R2.fastq.gz

# What average coverage can we expect?
$ simcov.py -g 1000000 -n 100000 -p
=== Configuration ===
Genome size: 1,000,000bp
Number of reads: 100,000
Read length: 150bp
Mode: paired
Insert size: 400bp
Expected average coverage: 30.0X

=== Results ===
Effective average coverage: 30.0X
Max coverage: 54
Bases covered at 1X: 1,000,000bp (100.00%)
Bases covered at 5X: 1,000,000bp (100.00%)
Bases covered at 10X: 1,000,000bp (100.00%)
Bases covered at 20X: 979,196bp (97.92%)
Bases covered at 30X: 523,396bp (52.34%)
Bases covered at 50X: 197bp (0.02%)
Bases covered at 100X: 0bp (0.00%)
25th percentile - bases covered at 14X: 999,669bp (99.97%)
50th percentile - bases covered at 27X: 734,722bp (73.47%)
75th percentile - bases covered at 40X: 43,567bp (4.36%)