Date: Thu, 28 Mar 2024 13:18:09 -0500 (CDT) Message-ID: <2043611036.2609.1711649889416@ip-10-0-27-248.ec2.internal> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_2608_1914560654.1711649889416" ------=_Part_2608_1914560654.1711649889416 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This is a selection of linux one-liners useful in NGS, roughly categoriz= ed. Almost none of these will run "as-is" - some are cut straight from shel= l scripts and so refer to variables, others have file names which you'd nee= d to customize.
Note also that some of the grep strings are hardwired to the UT Illumina= sequencer - the introductory "@HWI" of a fastq file is instrument-specific= . You can't just use '^@' by itself to grep a fastq because some fraction o= f the ASCII-encoded quality values start with "@".
These are intended to be useful and general-purpose; you can do almost a= ll of these examples in a more optimized way, in 10 different ways, with ot= her tools, Python, Perl, etc. etc. But that's not the point.
The point for me is to have a set of core commands I know well enough th= at they will give me the answer I want, correctly, and quickly to a very wi= de variety of questions.
grep
uses regular expressions to search for strings in fil=
es. Some good online resources to understand it are at robelle.com and at thegeekstuff.com (http://www.thegeekstuff.com/201=
1/01/advanced-regular-expressions-in-grep-command-with-10-examples-=E2=80=
=93-part-ii/)awk
is an actual programming language, like perl or python=
. But it's built-in to linux, is very general-purpose, and is ideal for par=
sing & filtering large text files.uniq
simply recognizes when two or more adjacent lines are=
the same. Some useful switches: -c
counts how many identical =
lines there were and prepends the count to the line itself, -w N only uses the first N
characters of each line to decide if =
they're the same.
sort
orders a bunch of lines. It's very fast and memory ef=
ficient - sorting a few million lines on a current linux system should only=
take a few minutes.There are others I use often, but not as much as these four. Examples: <=
code>sed, paste
, join
, cut
, <=
code>find.
Find out if you have some unusually abundant sequences:
head -1= 00000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | sort | uniq -c | s= ort -n -r | head
This will produce a ranked list of the most abundant sequences from amon= g the first 25,000 sequences (remember 4 lines per sequence in a fastq).
It can be easily to fool this algorithm, especially if your sequences ar= e long and have higher error rate at the end, in which case you can adjust = the uniq command to, say, the first 30 bp:
head -1= 00000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | sort | uniq -c -w = 30 | sort -n -r | head
If you are expecting a constant motif somewhere in all = your data, it's a good idea to see if it's there. Just add an awk to snip o= ut where the CR should be (bp 12 to 22 in this example):
head -1= 00000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | awk '{print substr= ($1,12,10)}' | sort | uniq -c | sort -n -r | head
Take two fastq files and turn them into a paired fasta file - drop the q= uality info and put each read pair one after the other:
paste $= read1s1 $read2s1 | awk 'BEGIN {c=3D0} {c++; if (tag!=3D"") {print tag "\n"= $1 "\n" tag "\n" $2; tag=3D""} if (substr($1,1,4)=3D=3D"@HWI") {tag=3D$1}}= ' \ | sed s/'^@HWI'/'>HWI'/ > $output.paired_s1.fasta 2> $output.log = &
If you'd like to filter only on read pairs with quality values no smalle= r than 10, it's a bit more work:
paste $= read1s1 $read2s1 | \ awk 'BEGIN {c=3D0} {c++; if (c=3D=3D4) {print $1 "\t" $2 "\t" seq1 "\t" seq= 2 "\t" tag; c=3D0} if (c=3D=3D2) {seq1=3D$1; seq2=3D$2;} if (c=3D=3D1) {tag= =3D$1}}' \ | awk 'BEGIN \ { convert=3D"!\"#$%&\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\= x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40"; \ convert=3Dconvert "ABCDEFGHIJKLMNOPQRSTUVWXYZ" } \ { badQual=3D0; \ for (i=3D1;i<=3Dlength($1);i++) {if (index(convert, substr($1,i,1))= -1<10) {badQual=3D1} }; \ for (i=3D1;i<=3Dlength($2);i++) {if (index(convert, substr($2,i,1))= -1<10) {badQual=3D1} }; \ if (badQual!=3D1) { print $5 "\n" $3 "\n" $5 "\n" $4 } }' | sed s/'^@HWI'/'>HWI'/ > $output.paired_s1.fasta 2> $output.log = &
Note that you could have combined the two awk commands - I don't like to= do that because it's harder for me to see what I'm doing, and harder for m= e to copy it for the next problem I need to solve. Like that handy awk scri= pt to convert ASCII quality values to numbers?
If you have, or need, bam files use samtools to interconvert:
BAM->SAM:
samtool= s view <infile.bam> > outfile.sam
SAM->BAM:
samtool= s view -S -b <infile.sam> > outfile.bam
If you need to globally change the names of all your references in a sam= file, you can use something like this:
cat not= rrna.genome.sam | awk 'BEGIN {FS=3D"\t"; OFS=3D"\t"} \ {if ($3=3D=3D"mitochondria") {$3=3D"ChrM"} if ($3=3D=3D"chloroplast") {$= 3=3D"ChrC"} \ if ($3=3D=3D"2") {$3=3D"Chr2"} if ($3=3D=3D"1") {$3=3D"Chr1"} if ($3=3D= =3D"3") {$3=3D"Chr3"} \ if ($3=3D=3D"4") {$3=3D"Chr4"} if ($3=3D=3D"5") {$3=3D"Chr5"} \ print}' > notrrna.genome.tair.sam
Of course you can also use samtools reheader to accompl= ish the same thing. It works more reliably on large .bam files with many co= ntigs.
If you only need to change the name of one reference, I'd use sed and a = trick like this to pipe straight from binary to text and back to binary:
samtool= s view -h notrrna.genome.bam | sed s/'mitochondria'/'ChrM'/g | samtools vie= w -S -b - > blah.bam
You can put any combination of grep/sed/awk you like in between the two = samtools commands; this might be useful if you have a mixed population. You= should map all the data at once to a concatenated reference, but then migh= t want to separate the BAM file into two - one for organism A and one for o= rganism B.
You can also use this for things like searching for indels quickly by sc= anning the CIGAR string, or parsing any of the custom fields that your mapp= er/genotyper have produced.
samtool= s view input.bam | \ awk 'BEGIN {FS=3D"\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > output.= bam.fastq
You can also do this with bamtools and picard.
This seqanswers post has a great tw=
o-liner to add a read group to a BAM
file, and contrasts it wi=
th a perl program (many more than 2 lines) to do the same thing.
If you'd like to find out the most abundant sequence start site, try thi= s:
cat in.= sam | head -100000 | awk '{print $3"_"$4}' | sort | uniq -c | sort -n -r | = head
Note that this samples just the first 100,000 mapped sequences - works f= ine on an unsorted sam file, but doesn't make sense on a sorted bam/sam fil= e. On a sorted file, you have to process the whole file. Even so, doing thi= s command on a few tens of millions of sequences should take only a few min= utes.
If you'd like to join mated sequences in a sam/bam file onto one line (a= nd discard the unmated sequences) so you can process them jointly, this met= hod is pretty robust:
cat in.= sam | awk '{if (and(0x0040,$2)&&!(and(0x0008,$2))&&!(and(0x= 0004,$2))) {print $1 "\t" $2 "\t" $3"\t"$4}}' > r1 cat in.sam | awk '{if (and(0x0080,$2)&&!(and(0x0008,$2))&&!= (and(0x0004,$2))) {print $1 "\t" $2 "\t" $3"\t"$4}}' > r2 paste r1 r2 > paired.out
It uses the SAM file's flag field (field 2) and some bitwise operations = to confirm that both the read and it's mate are mapped, puts the separate a= lignments into separate files, then uses paste to join them. Since mappers = like BWA may output a variable number of fields for each read, I've just us= ed the first four fields. You can also do this with a one-line awk command = and do more processing within that command if you want to (left as an exerc= ise to the reader).
It's very easy to get some quick stats on VCF files with linux utilities= .
cat *.v= cf | grep -v '^#' | wc -l
cat *.v= cf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -d | wc = -l
cat *.r= aw.vcf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -c |= grep ' 3 ' | wc -l
grep -c= INDEL *.vcf
cat *.v= cf | awk 'BEGIN {FS=3D";"} {for (i=3D1;i<=3DNF;i++) {if (index($i,"AF1")= !=3D0) {print $i} }}' | \ awk 'BEGIN {FS=3D"=3D"} {print int($2*10)/10}' | sort | uniq -c | sort -n -= r | head
Take Illumina CASAVA >=3D 1.8 and add /1 or /2 to the end of the firs= t field of the tag, and throw out the second field of the tag.
gawk '{= print((NR % 4 =3D=3D 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags= _R1.fq gawk '{print((NR % 4 =3D=3D 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_= newtags_R2.fq
If a FASTQ file has paired-end reads intermingled, and you want to separ= ate them into separate /1 and /2 files. This script assumes the /1 reads pr= ecede the /2 reads.
seqtk s= eq -l0 Shuffled.fq | gawk '{if ((NR-1) % 8 < 4) print >> "Separate= _1.fq"; else print >> "Separate_2.fq"}'
After some operations on separate R1 and R2 FASTQ files, paired-end reco= rds will no longer be matched, and you need to eliminate records that are n= o longer matched.
gawk '{= printf((NR % 4 =3D=3D 0) ? $0"\n" : $0"\t")}' Sample.fastq > Sample_1-li= ne.tab
sort Sa= mple_1-line_R1.fq > Sample_1-line_sorted_R1.tab sort Sample_1-line_R2.fq > Sample_1-line_sorted_R2.tab join Sample_1-line_sorted_R1.tab Sample_1-line_sorted_R2.tab > Sample_1-= line_joined.tab
gawk '{= printf($1"\n"$2"\n"$3"\n"$4"\n") >> "Sample_matched_R1.fq"; printf($1= "\n"$5"\n"$6"\n"$7"\n") >> "Sample_matched_R2.fq"}' Sample_1-line_joi= ned.tab