26 July 2013

g1kv37 vs hg19

In order to create a class to translate the chromosome names from one naming convention to another. I've compared the MD5 sums of the human genome versions g1k/v37 and ucsc/hg19. Here is the java program to create the MD5s:

The MD5 sums were extracted as follow:



Here are the common chromosomes, joined on the hash-sum:


And here are the unpairable data:


I knew the problem for chrY ( http://www.biostars.org/p/58143/) but not for chr3.. What is the problem for this chromosome ?

Edit: Here are the number of bases for UCSC/chr3:

{T=58760485, G=38670110, A=58713343, C=38653197, N=3225295}
and for g1kv37:
{T=58760485, G=38670110, A=58713343, R=2, C=38653197, M=1, N=3225292}

That's it,



Pierre.

18 July 2013

Running a picard tool in the #KNIME workflow engine

http://www.knime.org/ is "a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting". In this post, I'll show how to invoke an external java program, and more precisely a tool from the picard library from with knime. The workflow: load a list of BAM filenames, invoke SortSam and display the names of the sorted files.

Construct the following workflow:



Edit the FileReader node and load a list of paths to the BAMs


Edit the properties of the java node, in the "Additional Libraries" tab, load the jar of SortSam.jar



Edit the java snippet node, create a new column SORTED_BAM for the output.



and copy the following code:

// Your custom imports:
import net.sf.picard.sam.SortSam;
import java.io.File;
----------------------------------------------------------
// Enter your code here:


File input=new File(c_BAM);

//build the output filename 
out_SORTED = input.getName();
if(!(out_SORTED.endsWith(".sam") || out_SORTED.endsWith(".bam")))
{
 throw new Abort("not a SAM/BAM :"+c_BAM);
}
int dot=out_SORTED.lastIndexOf('.');
out_SORTED=new File(input.getParentFile(),out_SORTED.substring(0, dot)+"_sorted.bam").getPath();

//create a new instance of SortSam
SortSam cmd=new SortSam();

//invoke the instance
int ret=cmd.instanceMain(new String[]{
 "I="+input.getPath(),
 "O="+out_SORTED,
 "SO=coordinate",
 "VALIDATION_STRINGENCY=LENIENT",
 "CREATE_INDEX=true",
 "MAX_RECORDS_IN_RAM=500000"
 });

if(ret!=0)
{
 throw new Abort("SortSam failed with: "+c_BAM+" "+out_SORTED);
}
Execute KNIME, picard runs the job, and get the names of the sorted BAMs:



Edit:

The workflow was uplodaded on MyExperiment at http://www.myexperiment.org/workflows/3654.html.


That's it,

Pierre


Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.

Let's annotate a VCF file with the data from the NCBI biosystem.

First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"


It produces a tabix-inded BED mapping the data of 'NCBI biosystem':

$ gunzip -c ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction

I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \
   TABIXFILE=~/ncbibiosystem.bed.gz \
   TAG=NCBIBIOSYS \
   FMT='($4|$5|$6|$7)'|\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40

That's it,

Pierre

15 July 2013

Playing with the "UCSC Genome Browser Track Hubs". my notebook

The UCSC has recently created the Genome Browser Track Hubs: " Track hubs are web-accessible directories of genomic data that can be viewed on the UCSC Genome Browser. ". I've created a Hub for the Rotavirus Genome hosted on github at:https://github.com/lindenb/genomehub.
My data were primarily described as a XML file. It contains a description of the genome, of the tracks, the path to the fasta sequence etc... The FASTA sequence was provided by Dr Didier Poncet (CNRS/Gig). As far as I understand, it is not currently possible to specify that a track describes a protein.

<?xml version="1.0" encoding="UTF-8"?>
<genomeHub >
  <name>Rotavirus</name>
  <shortLabel>Rotavirus</shortLabel>
  <longLabel>Rotavirus</longLabel>
  (...)
  <accessions id="set1">
   <acn>GU144588</acn>
 <acn source="uniprot">Q0H8C5</acn>
 <acn source="uniprot">Q45UF6</acn>
 (..)
  <genome id="rf11">
    <description>Rotavirus RF11</description>
    <organism>Rotavirus</organism>
    <defaultPos>RF01:1-10</defaultPos>
    <scientificName>Rotavirus</scientificName>
    <organism>Rotavirus</organism>
    <orderKey>10970</orderKey>
    <fasta>rotavirus/rf/rf.fa</fasta>
     (...)
 <group id="active_site"><accessions ref="set1"/><include>active site</include></group>
 <group id="calcium-binding_region"><accessions ref="set1"/><include>calcium-binding region</include></group>
 <group id="chain"><accessions ref="set1"/><include>chain</include></group>
   (...)
This XML file is then processed with the following xsl stylsheet: https://github.com/lindenb/genomehub/blob/master/data/genomehub.xml : it generates a Makefile that will translate the fasta sequence to 2bit, create the bed files by aligning some annotated files to the reference with blast and convert them to bigbed.
At the end, my directory contains the following files:
./data/genomehub.xml
./data/genomehub2make.xsl
./data/sequence2fasta.xsl
./data/hub.txt
./data/genomes.txt
./data/rotavirus
./data/rotavirus/rf
./data/rotavirus/rf/signal_peptide.bed
./data/rotavirus/rf/CDS.bed
./data/rotavirus/rf/turn.bb
./data/rotavirus/rf/chrom.sizes
./data/rotavirus/rf/site.bed
./data/rotavirus/rf/coiled-coil_region.bed
./data/rotavirus/rf/mutagenesis_site.bb
./data/rotavirus/rf/UTR.bed
./data/rotavirus/rf/reference.fa~
./data/rotavirus/rf/misc_feature.bed
./data/rotavirus/rf/CDS.bb
./data/rotavirus/rf/helix.bed
./data/rotavirus/rf/strand.bb
./data/rotavirus/rf/sequence_conflict.bb
./data/rotavirus/rf/modified_residue.bb
./data/rotavirus/rf/coiled-coil_region.bb
./data/rotavirus/rf/topological_domain.bb
./data/rotavirus/rf/active_site.bed
./data/rotavirus/rf/sequence_variant.bb
./data/rotavirus/rf/transmembrane_region.bb
./data/rotavirus/rf/zinc_finger_region.bed
./data/rotavirus/rf/region_of_interest.bb
./data/rotavirus/rf/glycosylation_site.bb
./data/rotavirus/rf/domain.bb
./data/rotavirus/rf/region_of_interest.bed
./data/rotavirus/rf/misc_feature.bb
./data/rotavirus/rf/topological_domain.bed
./data/rotavirus/rf/sequence_conflict.bed
./data/rotavirus/rf/UTR.bb
./data/rotavirus/rf/compositionally_biased_region.bed
./data/rotavirus/rf/chain.bed
./data/rotavirus/rf/glycosylation_site.bed
./data/rotavirus/rf/trackDb.txt
./data/rotavirus/rf/modified_residue.bed
./data/rotavirus/rf/disulfide_bond.bed
./data/rotavirus/rf/strand.bed
./data/rotavirus/rf/helix.bb
./data/rotavirus/rf/compositionally_biased_region.bb
./data/rotavirus/rf/transmembrane_region.bed
./data/rotavirus/rf/rf.fa
./data/rotavirus/rf/rf.2bit
./data/rotavirus/rf/splice_variant.bed
./data/rotavirus/rf/short_sequence_motif.bed
./data/rotavirus/rf/rf.fa.nsq
./data/rotavirus/rf/ALL.bed.blast.xml~
./data/rotavirus/rf/gene.bed
./data/rotavirus/rf/sequence_variant.bed
./data/rotavirus/rf/disulfide_bond.bb
./data/rotavirus/rf/signal_peptide.bb
./data/rotavirus/rf/rf.fa.nin
./data/rotavirus/rf/short_sequence_motif.bb
./data/rotavirus/rf/turn.bed
./data/rotavirus/rf/domain.bed
./data/rotavirus/rf/mutagenesis_site.bed
./data/rotavirus/rf/zinc_finger_region.bb
./data/rotavirus/rf/chain.bb
./data/rotavirus/rf/rf.fa.nhr
./data/rotavirus/rf/splice_variant.bb
./data/rotavirus/rf/active_site.bb
./data/rotavirus/rf/site.bb
./data/rotavirus/rf/description.html
./README.md

The files required by the UCSC are then pushed on github and the URL pointing to hub.txt (https://raw.github.com/lindenb/genomehub/master/data/hub.txt) is registered at http://genome.ucsc.edu/cgi-bin/hgHubConnect. And a few clicks later...





That's it,

Pierre




12 July 2013

Inside the Variation Toolkit: Gene Ontology for VCF, GUI for VCF

A quick note about three java-based tools for VCF files I wrote today.

VcfViewGui

VcfViewGui : a Simple java-Swing-based VCF viewer.


VCFGeneOntology

vcfgo reads a VCF annotated with VEP or SNPEFF, loads the data from GeneOntology and GOA and adds a new field in the INFO column for the GO terms for each position.
Example:
$ java -jar dist/vcfgo.jar I="https://raw.github.com/arq5x/gemini/master/test/tes.snpeff.vcf" |\
    grep -v -E '^##' | head -n 3

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
chr1    30860   .   G   C   33.46   .   AC=2;AF=0.053;AN=38;BaseQRankSum=2.327;DP=49;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||||85|FAM138A|protein_coding|CODING|ENST00000417324|),DOWNSTREAM(MODIFIER|||||FAM138A|processed_transcript|CODING|ENST00000461467|),DOWNSTREAM(MODIFIER|||||MIR1302-10|miRNA|NON_CODING|ENST00000408384|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000469289|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000473358|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000430492|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000538476|);FS=3.128;HRun=0;HaplotypeScore=0.6718;InbreedingCoeff=0.1005;MQ=36.55;MQ0=0;MQRankSum=0.217;QD=16.73;ReadPosRankSum=2.017 GT:AD:DP:GQ:PL  0/0:7,0:7:15.04:0,15,177    0/0:2,0:2:3.01:0,3,39   0/0:6,0:6:12.02:0,12,143    0/0:4,0:4:9.03:0,9,119
chr1    69270   .   A   G   2694.18 .   AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;GOA=OR4F5|GO:0004984&GO:0005886&GO:0004930&GO:0016021;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;QD=32.86   GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0

VCFFilterGeneOntology

vcffiltergo reads a VCF annotated with VEP or SNPEFF, loads the data from GeneOntology and GOA and adds a filter in the FILTER column if a gene at the current genomic location is a descendant of a given GO term.
Example:
$  java -jar dist/vcffiltergo.jar I="https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf"  \
    CHILD_OF=GO:0005886 FILTER=MEMBRANE  |\
    grep -v "^##"   | head -n 3

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
chr1    30860   .   G   C   33.46   PASS    AC=2;AF=0.053;AN=38;BaseQRankSum=2.327;DP=49;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||||85|FAM138A|protein_coding|CODING|ENST00000417324|),DOWNSTREAM(MODIFIER|||||FAM138A|processed_transcript|CODING|ENST00000461467|),DOWNSTREAM(MODIFIER|||||MIR1302-10|miRNA|NON_CODING|ENST00000408384|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000469289|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000473358|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000430492|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000538476|);FS=3.128;HRun=0;HaplotypeScore=0.6718;InbreedingCoeff=0.1005;MQ=36.55;MQ0=0;MQRankSum=0.217;QD=16.73;ReadPosRankSum=2.017 GT:AD:DP:GQ:PL  0/0:7,0:7:15.04:0,15,177    0/0:2,0:2:3.01:0,3,39   0/0:6,0:6:12.02:0,12,143    0/0:4,0:4:9.03:0,9,119
chr1    69270   .   A   G   2694.18 MEMBRANE    AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;QD=32.86 GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0



That's it,

Pierre

09 July 2013

Mapping the annotations of a query sequence on a BLAST hit, my notebook.

This post is the answer to my own question on biostar "BLASTN / TBLASTN : mapping the features of the query to the hit.". I wrote a java program to map the annotations of a sequence to the Hit of a Blast result. The tool is available on github at https://github.com/lindenb/jvarkit.
For example, say you want to map the features of the Uniprot record for Rotavirus NSP3 (http://www.uniprot.org/uniprot/P04514) on Genbank http://www.ncbi.nlm.nih.gov/nuccore/AY065842.1 (RNA).

Download P04514 as XML

TblastN P04514(protein) vs AY065842 (RNA) and save the BLAST result as XML:

<BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.28+</BlastOutput_version>
(...)
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|18139606|gb|AY065842.1|</Hit_id>
  <Hit_def>AY065842</Hit_def>
  <Hit_accession>AY065842</Hit_accession>
  <Hit_len>1078</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>546.584</Hsp_bit-score>
      <Hsp_score>1407</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>313</Hsp_query-to>
      <Hsp_hit-from>26</Hsp_hit-from>
      <Hsp_hit-to>964</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>2</Hsp_hit-frame>
      <Hsp_identity>260</Hsp_identity>
      <Hsp_positive>260</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>313</Hsp_align-len>
      <Hsp_qseq>MLKMESTQQMASSIINTSFEAAVVAATSTLELMGIQYDYNEIYTRVKSKFDYVMDDSGVKNNLLGKAATIDQALNGKFGSVMRNKNWMTDSRTVAKLDEDVNKLRMMLSSKGIDQKMRVLNACFSVKRIPGKSSSVIKCTRLMKDKIERGAVEVDDSFVEEKMEVDTVDWKSRYDQLERRFESLKQRVNEKYTTWVQKAKKVNENMYSLQNVISQQQNQIADLQNYCSKLEADLQNKVGSLVSSVEWYLKSMELPDEVKTDIEQQLNSIDTISPINAIDDLEILIRNLIHDYDRTFLMFKGLLRQCNYEYAYE</Hsp_qseq>
      <Hsp_hseq>MLKMESTQQMASSIINSSFEAAVVAATSTLELMGIQYDYNEVYTRVKSKFDLVMDDSGVKNNLIGKAITIDQALNGKFSSAIRNRNWMTDSRTVAKLDEDVNKLRIMLSSKGIDQKMRVLNACFSVKRIPGKSSSIVKCTRLMKDKLERGEVEVDDSFVEEKMEVDTIDWKSRYEQLEKRFESLKHRVNEKYNHWVLKARKVNENMNSLQNVISQQQAHINELQMYNNKLERDLQSKIGSVVSSIEWYLRSMELSDDVKSDIEQQLNSIDQLNPVNAIDDFESILRNLISDYDRLFIMFKGLLQQCNYTYTYE</Hsp_hseq>
      <Hsp_midline>MLKMESTQQMASSIIN SFEAAVVAATSTLELMGIQYDYNE YTRVKSKFD VMDDSGVKNNL GKA TIDQALNGKF S  RN NWMTDSRTVAKLDEDVNKLR MLSSKGIDQKMRVLNACFSVKRIPGKSSS  KCTRLMKDK ERG VEVDDSFVEEKMEVDT DWKSRY QLE RFESLK RVNEKY  WV KA KVNENM SLQNVISQQQ  I  LQ Y  KLE DLQ K GS VSS EWYL SMEL D VK DIEQQLNSID   P NAIDD E   RNLI DYDR F MFKGLL QCNY Y YE</Hsp_midline>
    </Hsp>
   (..)
   
Now generate a BED file to map the features of P04514 on AY065842:
$ java -jar dist/blastmapannots.jar I=P04514.xml B=blast.xml

AY065842 25 961 Non-structural_protein_3 943 + 25961 255,255,255 1 936 25
AY065842 34 469 RNA-binding 970 + 34 469 255,255,255 1 435 34
AY065842 472 640 Dimerization 947 + 472 640 255,255,255 1 168 472
AY065842 532 724 Interaction_with_ZC3H7B 917 + 532 724 255,255,255 1 192 532
AY065842 646 961 Interaction_with_EIF4G1 905 + 646 961 255,255,255 1 315 646
AY065842 520 733 coiled-coil_region 916 + 520 733 255,255,255 1 213 520

That's it,

Pierre