...
There are many ways to measure sequence capture. You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible. Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest. These two are usually negatively correlated.
Note | ||
---|---|---|
| ||
As with the velvet tutorial, picard is not a module that is available on lonestar5 hopefully it will be soon, but we can make no promises. As such you must first copy all the programs listed below before entering an idev session |
Using Picard's "
...
CalculateHsMetrics" function to evaluate capture
Here is a link to the full documentation.
To run the program on Lonestar, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa). As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.
...
Code Block | ||
---|---|---|
| ||
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_Exome_Capture
cd BDIB_Exome_Capture
cp $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/target_intervals.chr20.reduced.withhead.intervallist .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa.fai . |
Code Block | ||||
---|---|---|---|---|
| ||||
cds mkdir BDIB_Exome_Capture cd BDIB_Exome_Capture cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa . cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai . |
The run command looks long but isn't that complicated (like most java programs):
Code Block | ||
---|---|---|
| ||
idev -A UT-2015-05-18
module load picard
java -Xmx4g -Djava.io.tmpdir=/tmp -jar $TACC_PICARD_DIR/CalculateHsMetrics.jar BI=target_intervals.chr20.reduced.withhead.intervallist TI=target_intervals.chr20.reduced.withhead.intervallist I=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam R=hs37d5.fa O=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats |
...
Code Block |
---|
grep -A 1 '^BAIT' exome.picard.stats | awk 'BEGIN {FS="\t"} {for (i=1;i<=NF;i++) {a[NR"_"i]=$i}} END {for (i=1;i<=NF;i++) {print a[1"_"i]"\t"a[2"_"i]}}' |
Here is the output of the above command, DO NOT! paste this into the command line.
No Format |
---|
## below here is the output of the above command, do not paste this into the command line
BAIT_SET target_intervals
GENOME_SIZE 3137454505
BAIT_TERRITORY 1843371
TARGET_TERRITORY 1843371
BAIT_DESIGN_EFFICIENCY 1
TOTAL_READS 4579959
PF_READS 4579959
PF_UNIQUE_READS 4208881
PCT_PF_READS 1
PCT_PF_UQ_READS 0.918978
PF_UQ_READS_ALIGNED 4114249
PCT_PF_UQ_READS_ALIGNED 0.977516
PF_UQ_BASES_ALIGNED 283708397
ON_BAIT_BASES 85464280
NEAR_BAIT_BASES 49788346
OFF_BAIT_BASES 148455771
ON_TARGET_BASES 85464280
PCT_SELECTED_BASES 0.476731
PCT_OFF_BAIT 0.523269
ON_BAIT_VS_SELECTED 0.631886
MEAN_BAIT_COVERAGE 46.363038
MEAN_TARGET_COVERAGE 46.76568
PCT_USABLE_BASES_ON_BAIT 0.245533
PCT_USABLE_BASES_ON_TARGET 0.245533
FOLD_ENRICHMENT 512.716312
ZERO_CVG_TARGETS_PCT 0.009438
FOLD_80_BASE_PENALTY 23.38284
PCT_TARGET_BASES_2X 0.849372
PCT_TARGET_BASES_10X 0.484824
PCT_TARGET_BASES_20X 0.435911
PCT_TARGET_BASES_30X 0.401622
PCT_TARGET_BASES_40X 0.36876
PCT_TARGET_BASES_50X 0.335459
PCT_TARGET_BASES_100X 0.173683
HS_LIBRARY_SIZE 5325189
HS_PENALTY_10X 232.05224
HS_PENALTY_20X -1
HS_PENALTY_30X -1
HS_PENALTY_40X -1
HS_PENALTY_50X -1
HS_PENALTY_100X -1
AT_DROPOUT 2.143632
GC_DROPOUT 10.000011
SAMPLE
LIBRARY
READ_GROUP |
If you have multiple files you could redirect this to a file and then repetitively using linux "join" or use "paste" with some more awk to select columns.
...
Since I don't actually know what capture kit was used to produce these libraries, these may or may not accurately reflect how well the library prep went, but generally speaking having >40x average coverage on your baits (the target regions) is good! A , but a lot of sequence (52.3%) was "off bait".