09 September 2016

Playing with #magicblast, the #NCBI Short read mapper. My notebook

NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.



Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).

First, here is an extract of the manual for magicblast.
USAGE
DESCRIPTION
   Short read mapper

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
    * Incompatible with:  sra
 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
   Input format for sequences
   Default = `fasta'
    * Incompatible with:  sra
 -paired
   Input query sequences are paired
 -query_mate <File_In>
   FASTA file with mates for query sequences (if given in another file)
    * Requires:  query
 -sra <String>
   Comma-separated SRA accessions
    * Incompatible with:  query, infmt

 *** Formatting options
 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
   alignment view options:
   sam = SAM format,
   tabular = Tabular format,
   text ASN.1
   Default = `sam'

Indexing the reference for magicblast

I've indexed the human chr22 using the good old NCBI makeblastdb:
$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22

Mapping Paired-End reads with magicblast

First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.
gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gz
The reads are then mapped with magicblast using the following command:
magicblast -db chr22 \
 -infmt fastq -paired \
 -query R1.fq.gz \
 -query_mate R2.fq.gz |\
 sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\
 samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM
As far as I could see, there was no option to specify the read group (RG).

Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:

bwa mem  chr22.fa R1.fq.gz R2.fq.gz |\
 samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM

BAM headers

Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.

@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz

And the BAM header for bwa:
@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz

Samtools samflags

for magicblast:
24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)

for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Validationg with picard

Validation of child.magic.bam with picard generates many errors:
$ java -jar picard.jar  ValidateSamFile  I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)
while there is ~no error for child.bwa.bam:
$ java -jar picard.jar  ValidateSamFile  I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200



Variant Calling

Both BAMs are mapped with samtools/vcftools (min DP=10)
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.bwa.vcf -
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.magic.vcf -
Number of variants:
$ grep -v "#" child.magic.vcf  | wc -l
111
$ grep -v "#" child.bwa.vcf  | wc -l
166

Here is the diff for CHROM/POS/REF/ALT:
  • Uniq to BWA: 'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 64 variants
  • Uniq to Magic: 'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 9 variants
  • Common variants: 'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 102 variants

The Makefile

SHELL=/bin/bash
data.dir=${HOME}/src/DATA
ncbi.bin=${HOME}/packages/magicblast/bin
REF=chr22.fa
samtools.exe=${HOME}/packages/samtools/samtools 
bwa.exe=${HOME}/packages/bwa-0.7.15/bwa 

all: child.magic.bam child.bwa.bam

R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz 
 gunzip -c $< | sed 's/\/1$$//' | gzip > $@
R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz 
 gunzip -c $< | sed 's/\/2$$//' | gzip > $@

child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz 
 ${bwa.exe} mem  ${REF} $(word 2,$^) $(word 3,$^) |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 


child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz 
 ${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\
 sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 

${REF}.bwt : ${REF}
 ${bwa.exe} index $<

${REF}.nin : ${REF}
 ${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF})

${REF} :
 wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz"
 gunzip -f $@.gz

That's it,
Pierre

1 comment:

Jeremy said...

Hi Pierre

Thanks for posting the results. How come you are getting a weird result with magic-BLAST indicating a few read pairs mapping to different chromosomes when your ref sequence database contain only one chromosome?

Jeremy