Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • samtools view converts each alignment record in yeast_pairedend.sort.bam to text
    • the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
      • BAM files often contain records for both mapped and unmapped reads
      • -F filters out records where the specified bit(s) are not set (i.e., they are 0)
        • so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
    • the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
      • these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
      • -f filters out records where the specified bit(s) are set (i.e., they are 1)
    • alignment records that pass both filters are written to standard output
  • | awk '
    • the pipe | connects the standard output of samtools view to the standard input of awk
    • the single quote denots the start of the awk script
      • we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
  • '
    BEGIN{ ... }{...}END{...}'
    • these 3 lines of text, enclosed in single quotes, are the awk script
    • the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
      • it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
    • { if ($9 > 0) {sum += $9; nrec++}  }
      • this is the body of the awk script, which is executed for each line of input
      • $9 represents the 9th tab-delimited field of the input
        • the 9th field of an alignment record is the insert size, according to the SAM format spec
      • we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
        • since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
      • when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
    •  END{ print sum/nrec; }
      • the END block between the curly brackets { } is executed once after the script has processed all input data
      • this prints the average insert size (sum/nrec) to standard output

...