Date: Mon, 18 Mar 2024 20:33:14 -0500 (CDT) Message-ID: <1656616048.12737.1710811994586@ip-10-0-60-17.ec2.internal> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_12736_1735564787.1710811994586" ------=_Part_12736_1735564787.1710811994586 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
SAMtools is a suite of commands for dealing with databases of ma= pped reads. You'll be using it quite a bit throughout the course. It includ= es programs for performing variant calling (mpileup-bcftools).
Right now, we'll be using it to call variants (find mutations) in the re= -sequenced E. coli genome from the Mapping tutorial. You will need the output SAM files fr= om that tutorial to continue here.
If you do not have the output from the Mapping tutorial, run these commands to copy over the out= put that would have been produced. Then, you can immediately start this tut= orial!
cds mkdir intro_to_mapping cd intro_to_mapping cp -r $BI/ngs_course/intro_to_mapping/bowtie . cp -r $BI/ngs_course/intro_to_mapping/bwa .=20 cp -r $BI/ngs_course/intro_to_mapping/bowtie2 .
We assume that you are still working in the main directory called =
intro_to_mapping
data that you created on $SCRATCH
.
Load the SAMtools module (if not already loaded).
Can you figure out what version of samtools is loaded on TACC and where = it is installed?
Create a new output directory called samtools_bowtie
or wha=
tever makes sense to you.
Let's copy over just the read alignment file in the SAM format and the r= eference genome in FASTA format to this new directory, so that we don't hav= e so many files cluttering our space.
First, you need to index the reference file. (This isn'= t indexing it for read mapping. It's indexing it so that SAMtools can quick= ly jump to a certain base in the reference.)
Then run this command to index the reference file.
samtool= s faidx samtools_bowtie/NC_012967.1.fasta
Take a look at the new *.fai file that was created by this command. Any = idea what some of the numbers mean?
SAM is a text file, so it is slow to access information about how any gi= ven read was mapped. SAMtools and many of the commands that we will run lat= er work on BAM files (essentially GZIP compressed binary forms of the text = SAM files). These can be loaded much more quickly. Typically, they also nee= d to be sorted, so that when the program wants to look at all reads overlap= ping position 4,129,888, it can easily find them all at once without having= to search through the entire BAM file.
Convert from SAM to BAM format.
samtool= s view -b -S -o samtools_bowtie/SRR030257.bam samtools_bowtie/SRR030257.sam
Sort and index the BAM file.
samtool= s sort samtools_bowtie/SRR030257.bam samtools_bowtie/SRR030257.sorted samtools index samtools_bowtie/SRR030257.sorted.bam
This is a really common sequence of commands, so you might want to add i= t to your personal cheat sheet.
SRR030257.sorted.bam
in the =
samtools sort
command?=20
Hint: You might be tempted to gzip
BAM fil=
es when copying them from one computer to another. Don't bother! They are a=
lready internally compressed, so you won't be able to shrink the file. On t=
he other hand, compressing SAM files will save a fair bit of space.
Now we use the mpileup
command from samtools
t=
o compile information about the bases mapped to each reference position.
Output BCF file. This is a binary form of the text Variant Call Format (= VCF).
samtool= s mpileup -u -f samtools_bowtie/NC_012967.1.fasta samtools_bowtie/SRR030257= .sorted.bam > samtools_bowtie/SRR030257.bcf
What are all the options doing?
Convert BCF to human-readable VCF:
bcftool= s view -v -c -g samtools_bowtie/SRR030257.bcf > samtools_bowtie/SRR03025= 7.vcf
What are these options doing?
Take a look at the samtools_bowtie/SRR030257.vcf
file using=
less
. It has a nice header explaining what the columns mean. =
Below this are the rows of data describing potential genetic variants.
Follow the same directions to call variants in the BWA or Bowtie2 mapped= reads.
Just be sure you don't write over your old files. Maybe create new direc=
tories like samtools_bwa
and samtools_bowtie2
for=
the output in each case.
You could also try running all of the commands from inside of the =
samtools_bwa
directory, just for a change of pace.
VCF format has alternative Allele Frequency = tags denoted by AF1. Try the following command to see what values we have i= n our files.
grep AF= 1 samtools_bowtie/SRR030257.vcf
Optional: For the data we are dealing with, predictions= with an allele frequency not equal to 1 are not really applicable. (The re= ference genome is haploid. There aren't any heterozygotes.) How can we remo= ve these lines from the file?
You will need the output from #Calling variant= s in reads mapped by BWA or Bowtie2 to complete this exercise.
Often you want to compare the results of variant calling on different sa= mples or using different pipelines. Bedtools is a suite of= utility programs that work on a variety of file formats, one of which is c= onveniently VCF format. It provides many ways of slicing, dicing, and compa= ring the information in VCF files. For example, we can use it to find out w= hat predictions are the same and which are different from the variant calli= ng on reads mapped with different programs.
Set up a new output directory and copy the respective VCF files to it, r= enaming them so that we know where they came from:
mkdir c= omparison cp samtools_bowtie/SRR030257.vcf comparison/bowtie.vcf cp samtools_bwa/SRR030257.vcf comparison/bwa.vcf cp samtools_bowtie2/SRR030257.vcf comparison/bowtie2.vcf cd comparison
Use the subcommands bedtools intersect
and bedtools s=
ubtract
we can find equal and different predictions between mappers.=
Try to figure out how to to do this on your own first. Hint: Remember that adding > output.vcf
to the end of a comma=
nd will pipe the output that is to the terminal into a file, so that you ca=
n save it.
*Look at how the reads supporting these variants were aligned to the ref=
erence genome in the Integrative Genomics Viewer (IGV) tutorial.
*Look into more sophisticated Variant+c=
alling+with+GATK.