...
- samtools view converts each alignment record in yeast_pairedend.sort.bam to text
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- BAM files often contain records for both mapped and unmapped reads
- -F filters out records where the specified bit(s) are not set (i.e., they are 0)
- so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
- the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
- these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
- -f filters out records where the specified bit(s) are set (i.e., they are 1)
- alignment records that pass both filters are written to standard output
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- | awk '
- the pipe | connects the standard output of samtools view to the standard input of awk
- the single quote denots the start of the awk script
- we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
- '
BEGIN{ ... }{...}END{...}'- these 3 lines of text, enclosed in single quotes, are the awk script
- the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
- it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- { if ($9 > 0) {sum += $9; nrec++} }
- this is the body of the awk script, which is executed for each line of input
- $9 represents the 9th tab-delimited field of the input
- the 9th field of an alignment record is the insert size, according to the SAM format spec
- we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
- since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
- when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
- END{ print sum/nrec; }
- the END block between the curly brackets { } is executed once after the script has processed all input data
- this prints the average insert size (sum/nrec) to standard output
...