21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "
I've recently discovered that pubmed now integrates ORCID identfiers.

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:

You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
<?xml version="1.0" encoding="UTF-8"?>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    <Author orcid="0000-0002-4449-1863">
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <journal>Genome Res.</journal>
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.
<?xml version="1.0"?>
    exclude-result-prefixes="xalan str"
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
and those sql statetements are loaded into sqlite3:
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"

The output is saved and then loaded into gephi.

That's it,


17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

I've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.

The following command is used to build the list of BAMs:

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' 

wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \

This list is inserted as a list named SAMPLES a Makefile.

For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.

All in one:

That's it,


04 March 2016

Now in picard: two javascript-based tools filtering BAM and VCF files.

SamJS and VCFFilterJS are two tools I wrote for jvarkit. Both tools use the embedded java javascript engine to filter BAM or VCF file.
To get a broader audience, I've copied those functionalities to Picard in 'FilterSamReads' and 'FilterVcf'.


FilterSamReads filters a SAM or BAM file with a javascript expression using the java javascript-engine.
The script puts the following variables in the script context: 'record' a SamRecord and 'header' a SAMFileHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script samFilter.js

/** accept record if second base of DNA is a A */
function accept(r)
 return r.getReadString().length()>2 &&


Invoke and output

$ java -jar picard-tools-2.1.1/picard.jar \
 FilterSamReads I=in.sam  O=out.sam \
 JS=samFilter.js FILTER=includeJavascript
$ cat out.sam | cut -f 10 | cut -c 2 | sort | uniq



FilterVcf one or more hard filters to a VCF file to filter out genotypes and variants.
Filters a VCF file with a javascript expression interpreted by the java javascript engine. The script puts the following variables in the script context: 'variant' a VariantContext and 'header' a VCFHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script variantFilter.js

/** prints a VARIATION if two samples at least have a DP>100 */ 
function myfilterFunction(thevariant)
    var samples=header.genotypeSamples;
    var countOkDp=0;

    for(var i=0; i< samples.size();++i)
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = thevariant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp > 100 ) countOkDp++;
    return (countOkDp>2)

Invoke and output

java -jar picard-tools-2.1.1/picard.jar FilterVcf \
 I=in.vcf O=out.vcf \
$ grep -v '#' jeter.vcf | cut -f 7 | grep variantFilter | wc -l

That's it,

Reading a VCF file faster with java 8, htsjdk and java.util.stream.Stream

java 8 streams "support functional-style operations on streams of elements, such as map-reduce transformations on collections". In this post, I will show how I've implemented a java.util.stream.Stream of VCF variants that counts the number of items in dbsnp.

This example uses the java htsjdk API for reading variants.

When using parallel streams, the main idea is to implement a java.util.Spliterator that will split the sequence dictionary (the genome) into a maximum of N (here N=5) parts. Each part will count the number of variants in 1/N genome in its own thread. As we're using an tribble indexed VCF, it's easy to start counting at a given position of the genome.


the class ContigPos defines a chromosome and a position in the whole genome.

class ContigPos {
    /** contig/chromosome index in the dictionary */
    final int tid; 
    /** contig/chromosome name */
    final String contig;
    /** position in the chromosome */
    final int pos;

it contains a function to convert its' position to an index in the whole genome using the genome dictionary (htsjdk.samtools.SAMSequenceDictionary) .

long genomicIndex() {
    long n=0L;
    for(int i=0;i< this.tid;++i) {
        n += dict.getSequence(i).getSequenceLength();
    return n + this.pos;


VariantContextSpliterator is the main class. It splits the VCF file into parts and implements Spliterator<VariantContext> .

public class VariantContextSpliterator
    implements Closeable,Spliterator<VariantContext> {

VariantContextSpliterator contains the sequence dictionary and the path to the indexed VCF file

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** genome dictionary */
private final SAMSequenceDictionary dict ;

Each VariantContextSpliterator has is own private VCFileReader and CloseableIterator. Both should be closed when the is no more variant to be read.

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** current variant iterator */
private CloseableIterator<VariantContext> variantIter = null;

Each VariantContextSpliterator has a dedicated genomic region.

/* region start */
private ContigPos begin;
/** region end */
private ContigPos end ;

The very first VariantContextSpliterator will scan :

  • from begin = new ContigPos("chr1",0)
  • to end = new ContigPos("chrY",(size_of_chrY))

We don't want to open to many threads, so we're tracking the number of opened iterators in a AtomicInteger

AtomicInteger nsplits


VariantContextSpliterator.peek() is a method peeking the next Variant in the genomic interval.

We open the VCFFileReader if it was never opened, the number of opened files is incremented.

/** VCF reader was never opened */
if( this.vcfFileReader == null ) {
    /** open the VCF reader */
    this.vcfFileReader = new VCFFileReader(this.vcfFile, true);
    /** open a first iterator on the first chromosome */
    this.variantIter = this.vcfFileReader.query(
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */
    /** increase the number of opened streams */

while there is no more variant available on this chromosome , open the next chromosome for reading:

while(!this.variantIter.hasNext()) {
    this.variantIter = null;
    if(this.begin.tid == this.end.tid) /* this is the last chromosome */
        return null;
        this.begin = new ContigPos(this.begin.tid+1, 0);
        this.variantIter = this.vcfFileReader.query(
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */

get the next variant, update 'begin' with this variant. We close the VCFfileReader if we have reached the end of the genomic window.

/* get the next variant */
final VariantContext ctx = this.variantIter.next();
/* update 'begin' */
this.begin= new ContigPos(ctx.getContig(), ctx.getStart());

/** close if the end of the genomic location was reached */
if((this.begin.tid > this.end.tid) ||
   (this.begin.tid == this.end.tid && this.begin.pos >= this.end.pos) ) {
    return null;
this._peeked = ctx;
return this._peeked;


If a remaining variants exists, performs the given action on it, returning true; else returns false.

public boolean tryAdvance(Consumer<? super VariantContext> action) {
    final VariantContext ctx = this.next();
    if(ctx==null) {
        return false;
    return true;


trySplit returns a VariantContextSpliterator covering elements, that will, upon return from this method, not be covered by this VariantContextSpliterator. We can split if the remaining window size is greater than 1Mb and if the number of opened VCFReaderFile is lower than 10.

public Spliterator<VariantContext> trySplit() {
    final VariantContext ctx = this.peek();
    /** no more variant to read, can't split */
    if(ctx==null) return null;
    /** too many opened VCFFile reader, can't split */
    if( this.nsplits.get()>5) return null;

    long start = this.begin.genomicIndex();
    long distance = this.end.genomicIndex() - start;

    /** distance between begin and end is greater than 1Mb */
    while(distance > 1E6 )
        distance = distance/2;
        /** middle genomic index */
        final ContigPos mid = new ContigPos(start + distance);
        /** create a new VariantContextSpliterator starting from mid and ending at this.end */
        final VariantContextSpliterator next = new VariantContextSpliterator(this,mid,this.end);
        if(next.peek()==null) {
        /* update this.end to 'mid' */
        this.end= mid;
        //System.err.println("split successful:after split "+toString()+" and next="+next);
        return next;

    return null;


to get a stream , we the static function java.util.stream.StreamSupport.stream is called.

stream() Creates a new sequential or parallel Stream from a Spliterator. The spliterator is only traversed, split, or queried for estimated size after the terminal operation of the stream pipeline commences.

private Stream<VariantContext> stream(boolean parallel) {
    return StreamSupport.stream(new VariantContextSpliterator(this.vcfFile), parallel);

We count the number of variants in dbSNP. We print the duration for stream(), parallelStream() and a standard iterator.

final File vcFile =new File(args[0]);
StreameableVcfFile test= new StreameableVcfFile(vcFile);
long start1 = System.currentTimeMillis();
long end1 = System.currentTimeMillis();
System.out.println(" parallelstream: " + ((end1 - start1) / 1000));

long start2 = System.currentTimeMillis();
long end2 = System.currentTimeMillis();
System.out.println("stream : " + ((end2 - start2) / 1000));

long start3 = System.currentTimeMillis();
CloseableIterator<VariantContext>  r= new VCFFileReader(vcFile).iterator();
int n=0;
while(r.hasNext()) { r.next(); ++n;}
long end3 = System.currentTimeMillis();
System.out.println("iter : " + ((end3 - start3) / 1000));


count: 61045456 snps
parallelstream: 80 seconds

count: 61045456 snps
stream : 365 seconds

count: 61045456 snps
iter : 355 seconds

That's it,


Source code

24 February 2016

Registering a tool in the @ELIXIREurope regisry using XML, XSLT, JSON and curl. My notebook.

The Elixir Registry / pmid:26538599 "A portal to bioinformatics resources world-wide. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools."
In this post, I will describe how I've used the bio.tools API to register some tools from jvarkit.

Authenticate with your credentials

using curl, the 'bio.tools' service returns a authentication token.

$ curl -s \
 -H "Content-type: application/json" \
 -X POST \
 -d '{"username":"my-login@univ-nantes.fr","password":"password1234"}' \
 https://bio.tools/api/auth/login |\
 python -m json.tool
    "token": "74dedea023dbad8ecda49ac57bb1074acd794f"

Creating a JSON describing the tool.

The tool I'm goind to use is VCFhead. A very simple tool printing the first variants of a VCF file. In jvarkit I don't write the code parsing the arguments, everything is described using a XML file that is going to be processed with a XSTL stylesheet to generate an abstract java code handling the options, etc....

xsltproc command2java VcfHead.xml > AbstractVcfHead.java

For VcfHead the XML descriptor is available here: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfHead.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app xmlns="http://github.com/lindenb/jvarkit/" xmlns:h="http://www.w3.org/1999/xhtml" app="VcfHead" package="com.github.lindenb.jvarkit.tools.misc" >
  <description>Print the first variants of a VCF.</description>
  <input type="vcf"/>
  <output type="vcf"/>
    <option name="count" type="int" opt="n" longopt="count" min-inclusive="0" default="10">
      <description>number of variants to be printed</description>

Using a first XSLT stylesheet https://github.com/lindenb/jvarkit/blob/master/src/main/resources/xsl/jsonxelixir.xsl, 'VcfHead.xml' is firstly converted to the 'infamous' JSONx (JSON+XML) format .
xsltproc jsonxelixir VcfHead.xml > VcfHead.jsonx
The JSONx file:
<?xml version="1.0"?>
<jsonx:object xmlns:jsonx="http://www.ibm.com/xmlns/prod/2009/jsonx" xmlns:c="http://github.com/lindenb/jvarkit/" xmlns="http://www.w3.org/1999/xhtml" xmlns:x="http://www.ibm.com/xmlns/prod/2009/jsonx">
  <jsonx:string name="accessibility">Public</jsonx:string>
  <jsonx:string name="affiliation">univ-nantes.fr</jsonx:string>
  <jsonx:string name="cost">Free</jsonx:string>
  <jsonx:array name="platform">
  <jsonx:string name="version">1.0</jsonx:string>
  <jsonx:string name="homepage">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
  <jsonx:array name="function">
      <jsonx:array name="input">
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          <jsonx:array name="dataFormat">
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
      <jsonx:array name="output">
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          <jsonx:string name="dataDescription">any format</jsonx:string>
          <jsonx:array name="dataFormat">
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
      <jsonx:array name="functionName">
          <jsonx:string name="term">Formatting</jsonx:string>
          <jsonx:string name="uri">http://edamontology.org/operation_0335</jsonx:string>
      <jsonx:string name="functionDescription">Print the first variants of a VCF.</jsonx:string>
  <jsonx:string name="description">Print the first variants of a VCF.</jsonx:string>
  <jsonx:object name="docs">
    <jsonx:string name="docsTermsOfUse">https://opensource.org/licenses/MIT</jsonx:string>
    <jsonx:string name="docsGithub">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsHome">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsCitationInstructions">https://github.com/lindenb/jvarkit/wiki/Citing</jsonx:string>
    <jsonx:string name="docsDownloadSource">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
    <jsonx:string name="docsDownload">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
  <jsonx:array name="collection">
  <jsonx:object name="credits">
    <jsonx:array name="creditsInstitution">
      <jsonx:string>Institut du Thorax, Nantes, France</jsonx:string>
    <jsonx:array name="creditsDeveloper">
      <jsonx:string>Pierre Lindenbaum</jsonx:string>
  <jsonx:array name="interface">
      <jsonx:string name="interfaceType">Command line</jsonx:string>
  <jsonx:string name="name">VcfHead</jsonx:string>
  <jsonx:array name="topic">
      <jsonx:string name="term">Omics</jsonx:string>
      <jsonx:string name="uri">http://edamontology.org/topic_3391</jsonx:string>
  <jsonx:string name="license">MIT License</jsonx:string>
  <jsonx:array name="language">
  <jsonx:array name="resourceType">
  <jsonx:string name="maturity">Stable</jsonx:string>
  <jsonx:array name="contact">
      <jsonx:string name="contactURL">https://github.com/lindenb</jsonx:string>
      <jsonx:string name="contactName">Pierre Lindenbaum</jsonx:string>
      <jsonx:array name="contactRole">
  <jsonx:object name="publications">
    <jsonx:string name="publicationsPrimaryID">doi:10.6084/m9.figshare.1425030.v1</jsonx:string>

Using another XSLT stylesheet jsonx2json.xsl, the JSONx is converted to a JSON file.
xsltproc jsonx2json.xsl VcfHead.jsonx > VcfHead.json
the JSON file:
    "accessibility": "Public", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
    "contact": [
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
            "contactURL": "https://github.com/lindenb"
    "cost": "Free", 
    "credits": {
        "creditsDeveloper": [
            "Pierre Lindenbaum"
        "creditsInstitution": [
            "Institut du Thorax, Nantes, France"
    "description": "Print the first variants of a VCF.", 
    "docs": {
        "docsCitationInstructions": "https://github.com/lindenb/jvarkit/wiki/Citing", 
        "docsDownload": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsDownloadSource": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsGithub": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsHome": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsTermsOfUse": "https://opensource.org/licenses/MIT"
    "function": [
            "functionDescription": "Print the first variants of a VCF.", 
            "functionName": [
                    "term": "Formatting", 
                    "uri": "http://edamontology.org/operation_0335"
            "input": [
                    "dataFormat": [
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
            "output": [
                    "dataDescription": "any format", 
                    "dataFormat": [
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
    "homepage": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
    "interface": [
            "interfaceType": "Command line"
    "language": [
    "license": "MIT License", 
    "maturity": "Stable", 
    "name": "VcfHead", 
    "platform": [
    "publications": {
        "publicationsPrimaryID": "doi:10.6084/m9.figshare.1425030.v1"
    "resourceType": [
    "topic": [
            "term": "Omics", 
            "uri": "http://edamontology.org/topic_3391"
    "version": "1.0"

Registering the tool

Now we have the Token and the json descriptor we can add VcfHead to Elixir using curl:
curl  -H "Content-type: application/json" \
 -H "Authorization: Token 74dedea023dbad8ecda49ac57bb1074acd794f" \
 -X POST \
 -d  @path/to/VcfHead.json \
 "https://bio.tools/api/tool" |\
 python -m json.tool
    "accessibility": "Public", 
    "additionDate": "2016-02-24T11:37:17.458Z", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
    "contact": [
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
            "contactURL": "https://github.com/lindenb"

VCfhead is now visible from the Elixir Registry at https://bio.tools/tool/univ-nantes.fr/VcfHead/1.0.

That's it,

05 December 2015

Happy birthday my blog. You are now ten-year-old.

Happy birthday my blog. You are now 10-year-old.

03 December 2015

GATK-UI : a java-swing interface for the Genome Analysis Toolkit.

I've just pushed GATK-UI, a java swing interface for the Genome Analysis Toolkit GATK at https://github.com/lindenb/gatk-ui. This tool is also available as a WebStart/JNLP application.


Why did you create this tool ?

Some non-bioinformatician collaborators often want some coverage data for a defined set of BAM, for a specific region...

Did you test every tool ?


How did you create an interface for each GATK tool ?

Each tool in the GATK is documented in a web page: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php and
each web page is associated to a structured JSON page https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php.json

  "summary": "Select a subset of variants from a larger callset",
  "parallel": [
      "arg": "-nt",
      "link": "http://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#-nt",
      "name": "TreeReducible"
  "activeregion": {},

This json is transformed to XML in order to process it with XSLT . A XSLT stylesheet generates some java code

That's it,

13 July 2015

Playing with #Docker , my notebook

This post is my notebook about docker after we had a very nice introduction about docker by François Moreews (INRIA/IRISA, Rennes). I've used docker today for the first time, my aim was just to create an image containing https://github.com/lindenb/verticalize, a small tool I wrote to verticalize text files.

Install docker

you hate running this kind of command-lines, aren't you ?
$ wget -qO- https://get.docker.com/ | sh
sudo password for tatayoyo:
apparmor is enabled in the kernel and apparmor utils were already installed
+ [ https://get.docker.com/ = https://get.docker.com/ ]
+ sudo -E sh -c apt-key adv --keyserver

add my linux account to the "docker" group

sudo usermod -aG docker tatayoyo
logout and relog...

I'm working behind a $@!# proxy: edit /etc/default/docker to set the proxy-things

$ cat /etc/default/docker
# If you need Docker to use an HTTP proxy, it can also be specified
export http_proxy="http://(proxy-port):(proxy-host)/"
export https_proxy="http://(proxy-port):(proxy-host)/"
export ftp_proxy="http://(proxy-port):(proxy-host)/"
export HTTP_PROXY="http://(proxy-port):(proxy-host)/"
export FTP_PROXY="http://(proxy-port):(proxy-host)/"
export HTTPS_PROXY="http://(proxy-port):(proxy-host)/"

start the docker service

$ sudo start docker
[sudo] password for tatayoyo:
docker start/running, process 5023

create the Dockerfile

Create a new directory, in this directory we create a file named "Dockerfile". It contains
  • the name of the base-image we're using (here the latest ubuntu)
  • the $@!# proxy settings (again ???!!!!)
  • some calls to `apt` to download git, gcc, make ...
  • some statements to clone https://github.com/lindenb/verticalize, compile and install my tool into /usr/local/bin
FROM ubuntu:latest
ENV http_proxy "http://(proxy-port):(proxy-host)/"
ENV https_proxy "http://(proxy-port):(proxy-host)/"
ENV ftp_proxy "http://(proxy-port):(proxy-host)/"
ENV HTTP_PROXY "http://(proxy-port):(proxy-host)/"
ENV HTTPS_PROXY "http://(proxy-port):(proxy-host)/"
ENV FTP_PROXY "http://(proxy-port):(proxy-host)/"
RUN apt-get update
RUN apt-get install -y wget gcc make git
RUN git clone "https://github.com/lindenb/verticalize.git" /tmp/verticalize && make -C /tmp/verticalize && cp /tmp/verticalize/verticalize /usr/local/bin && rm -rf /tmp/verticalize

create the image 'verticalize' from the Dockerfile

sudo docker build -t verticalize .

List the images

$ docker images
REPOSITORY  TAG                 IMAGE ID         CREATED        VIRTUAL SIZE
verticalize latest              5f7159b4921a     12 seconds ago 317 MB

Tag the 'verticalize' image as hosted on my dockerhub repo https://registry.hub.docker.com/u/lindenb

$ docker tag 5f7159b4921a lindenb/verticalize:latest

$ docker images
REPOSITORY            TAG                 IMAGE ID            CREATED              VIRTUAL SIZE
verticalize           latest              5f7159b4921a        About a minute ago   317 MB
lindenb/verticalize   latest              5f7159b4921a        About a minute ago   317 MB

Push the image to dockerhub

$ docker push lindenb/verticalize

The push refers to a repository [lindenb/verticalize] (len: 1)
5f7159b4921a: Image push failed

Please login prior to push:
Username: lindenb
Email: xxxxxxx@yahoo.fr
WARNING: login credentials saved in /home/tatyoyo/.docker/config.json
Login Succeeded
The push refers to a repository [lindenb/verticalize] (len: 1)
5f7159b4921a: Image already exists
68f6ddc7de15: Buffering to Disk

We can now remove the local image ...

$ docker rmi -f  5f7159b4921a

.. and pull the image from dockerhub

$ docker pull lindenb/verticalize
latest: Pulling from lindenb/verticalize
83e4dde6b9cf: Downloading [==================> ] 24.82 MB/65.79 MB
b670fb0c7ecd: Download complete
29460ac93442: Download complete
d2a0ecffe6fa: Download complete
48e98a1c03ae: Download complete
94ac1beb0514: Download complete
e12eda8693a9: Download complete
5eb1952afbb7: Download complete
fb4ac6e6a264: Download complete
0f8372bacf03: Download complete
789c4f122778: Downloading [=================>  ] 7.511 MB/20.92 MB
68f6ddc7de15: Downloading [=====>              ]  4.99 MB/44.61 MB
5f7159b4921a: Download complete

At the end, run a command inside the docker container

My tool verticalize is installed in the image 'lindenb/verticalize:latest' :
$ cat << EOF |  docker run -i lindenb/verticalize:latest
> echo -e "#X\tY\n1\t2\n3\t4" | verticalize

>>> 2
$1 #X 1
$2 Y 2
<<< 2

>>> 3
$1 #X 3
$2 Y 4
<<< 3

That's it,

29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard


fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,

18 June 2015

Playing with the #GA4GH schemas and #Avro : my notebook

After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip

$ find -name "*.avdl"
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
Input files to compile:
$ find generated/org/ -name "*.java"
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro

Compile, archive and execute:
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
We use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro


The complete Makefile

That's it,