Page History

...

samtools view converts each alignment record in yeast_pairedend.sort.bam to text
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
  - BAM files often contain records for both mapped and unmapped reads
  - -F filters out records where the specified bit(s) are not set (i.e., they are 0)
    - so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
- the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
  - these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
  - -f filters out records where the specified bit(s) are set (i.e., they are 1)
- alignment records that pass both filters are written to standard output
| awk '
- the pipe | connects the standard output of samtools view to the standard input of awk
- the single quote denots the start of the awk script
  - we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
'
BEGIN{ ... }{...}END{...}'
- these 3 lines of text, enclosed in single quotes, are the awk script
- the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
  - it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- { if ($9 > 0) {sum += $9; nrec++} }
  - this is the body of the awk script, which is executed for each line of input
  - $9 represents the 9th tab-delimited field of the input
    - the 9th field of an alignment record is the insert size, according to the SAM format spec
  - we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
    - since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
  - when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
- END{ print sum/nrec; }
  - the END block between the curly brackets { } is executed once after the script has processed all input data
  - this prints the average insert size (sum/nrec) to standard output

...

Page tree

Versions Compared

Old Version 65

New Version 66

Key