Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There are many ways to measure sequence capture.  You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible.  Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest.  These two are usually negatively correlated.

Note
titleLonestar 5 does not have a picard module therefore this tutorial must be completed on stampede

As with the velvet tutorial, picard is not a module that is available on lonestar5 hopefully it will be soon, but we can make no promises. As such you must first copy all the programs listed below before entering an idev session

 

Using Picard's "

...

CalculateHsMetrics" function to evaluate capture

Here is a link to the full documentation.

To run the program on Lonestar, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa).  As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.

...

Code Block
titleReference for exome metrics
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai

...

Code Block
languagebash
titleHopefully by now you can do this without having to paste the commandsIf module load picard ever works on ls then the following code block will work from the same files you used in the previous days human trios analysis data
collapsetrue
cds
mkdir BDIB_Exome_Capture
cd BDIB_Exome_Capture
cp $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/target_intervals.chr20.reduced.withhead.intervallist .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa .
cp $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa.fai .
Code Block
languagebash
titleSince we are on stampede this is the revlevant code block to use to iniailze your data
cds
mkdir BDIB_Exome_Capture
cd BDIB_Exome_Capture
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai .

 

The run command looks long but isn't that complicated (like most java programs):

Code Block
titleHow to run exactly these files on Lonestar
idev -A UT-2015-05-18
module load picard
java -Xmx4g -Djava.io.tmpdir=/tmp -jar $TACC_PICARD_DIR/CalculateHsMetrics.jar BI=target_intervals.chr20.reduced.withhead.intervallist TI=target_intervals.chr20.reduced.withhead.intervallist I=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam R=hs37d5.fa  O=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats

...

Code Block
grep -A 1 '^BAIT' exome.picard.stats | awk 'BEGIN {FS="\t"} {for (i=1;i<=NF;i++) {a[NR"_"i]=$i}} END {for (i=1;i<=NF;i++) {print a[1"_"i]"\t"a[2"_"i]}}'

Here is the output of the above command, DO NOT! paste this into the command line.

No Format

 
## below here is the output of the above command, do not paste this into the command line
BAIT_SET	target_intervals
GENOME_SIZE	3137454505
BAIT_TERRITORY	1843371
TARGET_TERRITORY	1843371
BAIT_DESIGN_EFFICIENCY	1
TOTAL_READS	4579959
PF_READS	4579959
PF_UNIQUE_READS	4208881
PCT_PF_READS	1
PCT_PF_UQ_READS	0.918978
PF_UQ_READS_ALIGNED	4114249
PCT_PF_UQ_READS_ALIGNED	0.977516
PF_UQ_BASES_ALIGNED	283708397
ON_BAIT_BASES	85464280
NEAR_BAIT_BASES	49788346
OFF_BAIT_BASES	148455771
ON_TARGET_BASES	85464280
PCT_SELECTED_BASES	0.476731
PCT_OFF_BAIT	0.523269
ON_BAIT_VS_SELECTED	0.631886
MEAN_BAIT_COVERAGE	46.363038
MEAN_TARGET_COVERAGE	46.76568
PCT_USABLE_BASES_ON_BAIT	0.245533
PCT_USABLE_BASES_ON_TARGET	0.245533
FOLD_ENRICHMENT	512.716312
ZERO_CVG_TARGETS_PCT	0.009438
FOLD_80_BASE_PENALTY	23.38284
PCT_TARGET_BASES_2X	0.849372
PCT_TARGET_BASES_10X	0.484824
PCT_TARGET_BASES_20X	0.435911
PCT_TARGET_BASES_30X	0.401622
PCT_TARGET_BASES_40X	0.36876
PCT_TARGET_BASES_50X	0.335459
PCT_TARGET_BASES_100X	0.173683
HS_LIBRARY_SIZE	5325189
HS_PENALTY_10X	232.05224
HS_PENALTY_20X	-1
HS_PENALTY_30X	-1
HS_PENALTY_40X	-1
HS_PENALTY_50X	-1
HS_PENALTY_100X	-1
AT_DROPOUT	2.143632
GC_DROPOUT	10.000011
SAMPLE	
LIBRARY	
READ_GROUP

 

If you have multiple files you could redirect this to a file and then repetitively using linux "join" or use "paste" with some more awk to select columns.

...

Since I don't actually know what capture kit was used to produce these libraries, these may or may not accurately reflect how well the library prep went, but generally speaking having >40x average coverage on your baits (the target regions) is good!  A , but a lot of sequence (52.3%) was "off bait".