29 April 2009

CouchDB for Bioinformatics: Storing SNPs - My Notebook


In a previous post, I've been playing with Apache Hadoop.
I've encountered some technical difficulties with Hadoop (such as the simple question: "How should can I read my data from a file stored on the Hadoop File System (HDFS) ?? ") , so I now have a look at Apache CouchDB

(Via http://couchdb.apache.org/:) Apache CouchDB is a
  • A document database server, accessible via a RESTful JSON API
  • Ad-hoc and schema-free with a flat address space
  • Distributed, featuring robust, incremental replication with bi-directional conflict detection and management
  • Query-able and index-able, featuring a table oriented reporting engine that uses Javascript as a query language


Installation


Installing couchdb on my computer as easy as described on the wiki and after starting the default CouchDB server (host: localhost, port:5984) I got a {"couchdb":"Welcome","version":"0.10.0a769334"} when I opened my browser on http://localhost:5984/, as well as an interactive console at http://localhost:5984/_utils/.
the Hadoop console
.

Sending Request to CouchDB


Documents and Databases are created, deleted,found, etc.. using a REST protocol based on HTTP and its methods (GET/POST/PUT/DELETE/...). In order to play with this couchdb, I've created a small java program using the apache http client library (see source at the end of this post) (I know there is already a JAVA API for couchdb, but I wanted to be sure that I understand the concepts).

Creating a Database


First I want to create a database of Single Nucleotide Polymorphisms (SNPs): a PUT request is sent with the name of the database http://localhost:5984/position2snp
public void createDatabase(String database) throws IOException
{
PutMethod method= new PutMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON object.
{"ok":true}


Listing the databases


A special identifier (_all_dbs) in the request path and a GET method are used to to get all the databases:
public void getAllDatabases() throws IOException
{
GetMethod method= new GetMethod("http://localhost:5984/_all_dbs");
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON array containing the one and only database:
["position2snp"]


Adding some SNPs in the Database


CouchDB is a key/value store of JSON-based document. Here the KEY will be a fixed-size (We'll use some '0' for padding) concatenation of the chromosome and the genomic position for each SNP e.g. "chr01_00000006".
static String formatPosition(int chromosome,int position)
{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chromosome,position);
return w.toString();
}

The VALUE will be a structured description of the SNP. Here I've simulated a set of random SNP.
Random rand= new Random();
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
}

The new document is loaded by sending a PUT request, with the structured description of the SNP in the body of the request, to "http://localhost:5984/position2snp/THE_KEY" and .
public void putDocument(String id,String json) throws IOException
{
PutMethod method= new PutMethod("http://localhost:5984/position2snp/"+id);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

(Note: the POST method can be used instead of PUT to generate a unique key for a document)
For example, when I post a new SNP, couchdb returns the following JSON object containing the status ("ok"), the id/key of the new document/snp and its revision-id:
{"ok":true,"id":"chr02_00000081","rev":"1-948851818"}

Retrieving only one row in the Database


A GET request is sent with the special keyword _all_docs and the parameter limit. e.g: " http://localhost:5984/position2snp/_all_docs?limit=1".
Couchdb returns the following JSON object:
{"total_rows":49,"offset":0,"rows":[
{"id":"chr01_00000006","key":"chr01_00000006","value":{"rev":"1-1656436337"}}
]}

Retrieving the document from its KEY/ID


A GET request is sent to http://localhost:5984/position2snp/THE_KEY
For example sending http://localhost:5984/position2snp/chr03_00000046 returns:
{"_id":"chr03_00000046","_rev":"1-3263055450","rs":"rs9","avHet":0.09515628,"snpClass":"mutation","mapping":{"chromosome":"chr3","position":46}}

If the KEY was not found when sending http://localhost:5984/position2snp/chr03_00000000 , Couchdb returns:
{"error":"not_found","reason":"missing"}


Finding all the SNPs in a defined Genomic Segment


As the KEYs are sorted, have the same length, and define the position of the SNP on the genome, we can search CouchDB for all the SNPs in a defined region of the genome by sending a GET HTTP request with the parameters startkey and endkey. For example, if we want all the SNPs on the chromosome 2 between the bases 30 and 60 we send http://localhost:5984/position2snp/_all_docs?startkey=%22chr02_00000030%22&endkey=%22chr02_00000060%22 and the result is:
{"total_rows":47,"offset":18,"rows":[
{"id":"chr02_00000032","key":"chr02_00000032","value":{"rev":"1-4178679245"}},
{"id":"chr02_00000040","key":"chr02_00000040","value":{"rev":"1-392133644"}},
{"id":"chr02_00000056","key":"chr02_00000056","value":{"rev":"1-3847661844"}}
]}

Views


Views are the primary tool used for querying and reporting on CouchDB documents. Views are stored inside special documents called design documents, and can be accessed via an HTTP GET request to the URI /{dbname}/{docid}/{viewname} . The view is defined by a JavaScript function that MAPs view keys to values. If a view has a REDUCE function, it is used to produce aggregate results for that view. A reduce function is passed a set of intermediate values and combines them to a single value. Reduce functions must accept, as input, results emitted by its corresponding MAP function

We upload our view called _design/genotypage:
putDocument("position2snp","_design/genotypage",
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");

This view contains four functions: snpMutation, snpByClass, snpByName and snpByClassMaxHet

Find all SNPs having a class='mutation'


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpMutation" and couchdb returns
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000001","key":null,"value":{"_id":"chr01_00000001","_rev":"1-3508311375","rs":"rs12","avHet":0.30527565,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":1}}},
{"id":"chr01_00000006","key":null,"value":{"_id":"chr01_00000006","_rev":"1-3309871741","rs":"rs31","avHet":0.30192727,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":6}}},
{"id":"chr01_00000009","key":null,"value":{"_id":"chr01_00000009","_rev":"1-4077528375","rs":"rs3","avHet":0.44473252,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":9}}},
{"id":"chr01_00000015","key":null,"value":{"_id":"chr01_00000015","_rev":"1-247108112","rs":"rs17","avHet":0.42058986,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":15}}},
{"id":"chr01_00000016","key":null,"value":{"_id":"chr01_00000016","_rev":"1-3568315779","rs":"rs43","avHet":0.4113328,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":16}}},
(...)
{"id":"chr04_00000033","key":null,"value":{"_id":"chr04_00000033","_rev":"1-3823284043","rs":"rs20","avHet":0.17454243,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":33}}},
{"id":"chr04_00000035","key":null,"value":{"_id":"chr04_00000035","_rev":"1-1400920328","rs":"rs10","avHet":0.33354515,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":35}}},
{"id":"chr04_00000045","key":null,"value":{"_id":"chr04_00000045","_rev":"1-3632023176","rs":"rs49","avHet":0.44040334,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":45}}},
{"id":"chr04_00000058","key":null,"value":{"_id":"chr04_00000058","_rev":"1-1711768614","rs":"rs14","avHet":0.3784455,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":58}}}
]}


Create a table containing the snpClass and the avHet for each SNP


A GET request is sent to "method http://localhost:5984/position2snp/_design/genotypage/_view/snpByClass" and couchdb returns:
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000006","key":"mutation","value":0.30192727},
{"id":"chr01_00000009","key":"mutation","value":0.44473252},
{"id":"chr01_00000015","key":"mutation","value":0.42058986},
{"id":"chr01_00000016","key":"mutation","value":0.4113328},
(...)
{"id":"chr04_00000019","key":"silent","value":0.069098294},
{"id":"chr04_00000033","key":"silent","value":0.17454243},
{"id":"chr04_00000035","key":"silent","value":0.33354515},
{"id":"chr04_00000058","key":"silent","value":0.3784455}
]}


Find the snp named "rs1"


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpByName?key=%22rs1%22" , and using the request parameter key="rs1". Couchdb returns:
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000023","key":"rs1","value":{"_id":"chr01_00000023","_rev":"1-49396577","rs":"rs1","avHet":0.035366535,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":23}}}
]}


Map/Reduce: map all the SNP by they snpClass and avHet. Reduce this map to have the mean avHet for the two types of snpClass


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpByClassMaxHet?group=true" using the request parameter group=true to group by snpClass. Couchdb returns:
{"rows":[
{"key":"mutation","value":0.29830500208},
{"key":"silent","value":0.2255268866772728}
]}


Dropping the Database


A DELETE request with the name of the database is sent to "http://localhost:5984/position2snp".
public void deleteDatabase(String database) throws IOException
{
DeleteMethod method= new DeleteMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON object:
{"ok":true}

Source Code


package test;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.net.URLEncoder;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.DeleteMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.httpclient.methods.PutMethod;
import org.apache.commons.httpclient.methods.StringRequestEntity;
import org.lindenb.io.TInputStream;
import org.lindenb.json.Parser;
import org.lindenb.util.C;

public class CouchDBTest01
{
private static final int DEFAULT_PORT=5984;
private static final String DEFAULT_HOST="localhost";
private String host= DEFAULT_HOST;
private int port= DEFAULT_PORT;
private HttpClient client;

CouchDBTest01()
{
this.client = new HttpClient();
}

public String getHost()
{
return host;
}
public int getPort() {
return port;
}

private String getPath()
{
return "http://"+getHost()+":"+getPort();
}

private Object parseResult(String message,HttpMethod method)throws IOException
{
System.err.println("\n#################" + message+"#####################");

System.err.println("Sending :"+ method.getName()+" method "+getPath()+method.getPath()+(method.getQueryString()==null?"":"?"+method.getQueryString()));


this.client.executeMethod(method);
InputStream in=method.getResponseBodyAsStream();
in= new TInputStream(in,System.out);
Object o= new Parser().parse(in);
in.close();
method.releaseConnection();
return o;
}


public void createDatabase(String database) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database);
parseResult("Create database "+database,method);
}


public void deleteDatabase(String database) throws IOException
{
DeleteMethod method= new DeleteMethod(getPath()+"/"+database);
parseResult("Drop database "+database,method);
}

public void getAllDatabases() throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+"_all_dbs");
parseResult("Get All Databases",method);

}

public void getDocument(String database,String docid) throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+database+"/"+docid);
parseResult("Get document "+docid,method);
}

public void putDocument(String database,String docid,String json) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database+"/"+docid);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document "+docid,method);
}

public void putDocument(String database,String json) throws IOException
{
PostMethod method= new PostMethod(getPath()+"/"+database+"/");
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document",method);
}

public void getDocuments(String database,
String startkey,
String endkey,
Integer limit,
Boolean descending
) throws IOException
{
List<NameValuePair> params= new ArrayList<NameValuePair>();
if(startkey!=null) params.add(new NameValuePair("startkey",startkey));
if(endkey!=null) params.add(new NameValuePair("endkey",endkey));
if(limit!=null) params.add(new NameValuePair("limit",limit.toString()));
if(descending!=null) params.add(new NameValuePair("descending",descending.toString()));

GetMethod method= new GetMethod(getPath()+"/"+database+"/"+"_all_docs");
method.setQueryString(params.toArray(new NameValuePair[params.size()]));
parseResult("Get Documents",method);
}

static String formatPosition(int chrom,int position)
{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chrom,position);
return w.toString();
}

static String quote(String s)
{
return "\""+C.escape(s)+"\"";
}


void makeTest() throws IOException
{
final String POS2SNP="position2snp";
createDatabase(POS2SNP);
getAllDatabases();


Random rand= new Random();
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
getDocument(POS2SNP,formatPosition(chromosome, 0));
}



getDocuments(POS2SNP,null,null,1,null);

getDocuments(POS2SNP,quote(formatPosition(2, 30)),quote(formatPosition(2, 60)),null,null);

putDocument(POS2SNP,"_design/genotypage",
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");

GetMethod method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpMutation");
parseResult("Map1",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClass");
parseResult("Map2",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClassMaxHet");
method.setQueryString("group=true");
parseResult("Map3",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByName");
method.setQueryString("key="+URLEncoder.encode("\"rs1\"","UTF8"));
parseResult("Map4",method);


deleteDatabase(POS2SNP) ;
}

public static void main(String[] args) {
try {
int optind=0;
CouchDBTest01 app=new CouchDBTest01();
while(optind<args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Pierre Lindenbaum PhD.");
System.err.println("-h this screen");
System.err.println("-H (host)");
System.err.println("-p (port)");
System.err.println("-d debug");
return;
}
else if (args[optind].equals("-H"))
{
app.host=args[++optind];
}
else if (args[optind].equals("-p"))
{
app.port=Integer.parseInt(args[++optind]);
}
else if (args[optind].equals("--"))
{
++optind;
break;
}
else if (args[optind].startsWith("-"))
{
System.err.println("bad argument " + args[optind]);
System.exit(-1);
}
else
{
break;
}
++optind;
}
app.makeTest();

} catch (Exception e) {
e.printStackTrace();
}
}
}

23 April 2009

A Tag Cloud for my Resume.

I'm revising my CV as I'll move to Nantes and I wanted to create a Tag Cloud to illustrate my resume. Paul and Richard suggested to use wordle to generate the cloud but I wanted to generate it on the fly, for any language, whenever I want, etc...
so I stored my skills in a RDF file which looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rdf:RDF [
<!ENTITY info "plum">
<!ENTITY bio "blue">
<!ENTITY other "lightgray">
<!ENTITY devtool "magenta">
<!ENTITY devlang "darkRed">
<!ENTITY os "purple">
<!ENTITY database "orange">
]>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://ontology.lindenb.org/tagcloud/">
<Tag rdf:about="https://javacc.dev.java.net/">
<weight>25</weight>
<label>Javacc</label>
<title xml:lang="en">JavaCC is a parser/scanner generator for java</title>
<title xml:lang="fr">Un generateur de parser pour java</title>
<color>magenta</color>
</Tag>
(...)
<Tag rdf:about="http://en.wikipedia.org/wiki/Awk">
<weight>25</weight>
<label>Awk</label>
<title>Awk</title>
<color>darkRed</color>
</Tag>

<Tag rdf:about="http://en.wikipedia.org/wiki/GNU_bison">
<weight>25</weight>
<label>Lex/Yacc</label>
<title>Lex/Yacc & Flex/Bison</title>
<color>magenta</color>
</Tag>
(...)
</rdf:RDF>

Advantages: I can store the labels for various languages, use xml entities like <!ENTITY devlang "darkRed"> to quickly change a color, etc...

This XML file is then transformed with the following XSLT stylesheet


And (tada !) here is the result



Waves... :-)

(And the icing on the cake : it is a RDFa output).

Note: Pawel Szczesny did great job for his CV too.


Pierre

21 April 2009

Hadoop, my notebook: HDFS

This post is about the Apache Hadoop, an open-source algorithm implementing the MapReduce algorithm. This first notebook focuses on HDFS, the Hadoop file system, and follows the great Yahoo! Hadoop Tutorial Home. Forget the clusters, I'm running this hadoop engine on my one and only laptop.

Downloading & Installing


~/tmp/HADOOP> wget "http://apache.multidist.com/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz
Saving to: `hadoop-0.19.1.tar.gz'

100%[======================================>] 55,745,146 487K/s in 1m 53s

2009-04-21 20:52:04 (480 KB/s) - `hadoop-0.19.1.tar.gz' saved [55745146/55745146]
~/tmp/HADOOP> tar xfz hadoop-0.19.1.tar.gz
~/tmp/HADOOP> rm hadoop-0.19.1.tar.gz
~/tmp/HADOOP> mkdir -p hdfs/data
~/tmp/HADOOP> mkdir -p hdfs/name
#hum... this step was not clear as I'm not a ssh guru. I had to give my root password to make the server starts
~/tmp/HADOOP> ssh-keygen -t rsa -P 'password' -f ~/.ssh/id_rsa
Generating public/private dsa key pair.
Your identification has been saved in /home/pierre/.ssh/id_rsa.
Your public key has been saved in /home/pierre/.ssh/id_rsa.pub.
The key fingerprint is:
17:c0:29:b4:56:d1:d3:dd:ae:d5:ba:3e:5b:33:b0:99 pierre@linux-zfgk
~/tmp/HADOOP> cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys

Editing the Cluster configuration


Edit the file hadoop-0.19.1/conf/hadoop-site.xml.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>This is the URI (protocol specifier, hostname, and port) that describes the NameNode (main Node) for the cluster.</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/pierre/tmp/HADOOP/hdfs/data</value>
<description>This is the path on the local file system in which the DataNode instance should store its data</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/pierre/tmp/HADOOP/hdfs/name</value>
<description>This is the path on the local file system of the NameNode instance where the NameNode metadata is stored.</description>
</property>
</configuration>

Formatting HDFS


HDFS the Hadoop Distributed File System "HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. A file can be made of several blocks, and they are not necessarily stored on the same machine(...)If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default)."
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop namenode -format
09/04/21 21:11:18 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = linux-zfgk.site/127.0.0.2
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.19.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
************************************************************/
Re-format filesystem in /home/pierre/tmp/HADOOP/hdfs/name ? (Y or N) Y
09/04/21 21:11:29 INFO namenode.FSNamesystem: fsOwner=pierre,users,dialout,video
09/04/21 21:11:29 INFO namenode.FSNamesystem: supergroup=supergroup
09/04/21 21:11:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
09/04/21 21:11:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
09/04/21 21:11:29 INFO common.Storage: Storage directory /home/pierre/tmp/HADOOP/hdfs/name has been successfully formatted.
09/04/21 21:11:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at linux-zfgk.site/127.0.0.2
************************************************************/

Starting HDFS


~/tmp/HADOOP> hadoop-0.19.1/bin/start-dfs.sh
starting namenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-namenode-linux-zfgk.out
Password:
localhost: starting datanode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-datanode-linux-zfgk.out
Password:
localhost: starting secondarynamenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-secondarynamenode-linux-zfgk.out

Playing with HDFS


First Download a few SNP from UCSC/dbsnp into ~/local.xls.
~/tmp/HADOOP> mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'select name,chrom,chromStart,avHet from snp129 where avHet!=0 and name like "rs12345%" ' > ~/local.xls

Creating directories
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user/pierre

Copying a file "local.xls" from your local file system to HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls

Recursive listing of HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -lsr /
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user/pierre
-rw-r--r-- 3 pierre supergroup 308367 2009-04-21 21:45 /user/pierre/stored.xls

'cat' the first lines of the SNP file stored on HDFS:
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -cat /user/pierre/stored.xls | head
rs12345003 chr9 1765426 0.02375
rs12345004 chr9 2962430 0.055768
rs12345006 chr9 74304094 0.009615
rs12345007 chr9 73759324 0.112463
rs12345008 chr9 88421765 0.014184
rs12345013 chr9 78951530 0.104463
rs12345014 chr9 78542260 0.490608
rs12345015 chr9 10121973 0.201446
rs12345016 chr9 2698257 0.456279
rs12345027 chr9 8399632 0.04828

Removing a file. Note: "On startup, the NameNode enters a special state called Safemode." I could not delete a file before I used "dfsadmin -safemode leave".
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -rm /user/pierre/stored.xls
Deleted hdfs://localhost:9000/user/pierre/stored.xls

Check there is NO file named stored.xls in the local file system !
~/tmp/HADOOP> find hdfs/
hdfs/
hdfs/data
hdfs/data/detach
hdfs/data/in_use.lock
hdfs/data/tmp
hdfs/data/current
hdfs/data/current/blk_3340572659657793789
hdfs/data/current/dncp_block_verification.log.curr
hdfs/data/current/blk_3340572659657793789_1002.meta
hdfs/data/current/VERSION
hdfs/data/storage
hdfs/name
hdfs/name/in_use.lock
hdfs/name/current
hdfs/name/current/edits
hdfs/name/current/VERSION
hdfs/name/current/fsimage
hdfs/name/current/fstime
hdfs/name/image
hdfs/name/image/fsimage


Stop HDFS


~/tmp/HADOOP> hadoop-0.19.1/bin/stop-dfs.sh
stopping namenode
Password:
localhost: stopping datanode
Password:
localhost: stopping secondarynamenode



Pierre

10 April 2009

Resolving LSID: my notebook

This post is about LSID (The Life Science Identifier) and was inspired by the recent activity of Roderic Page on Twitter and by Roderic's paper "LSID Tester, a tool for testing Life Science Identifier resolution services".

OK.
At the beginning, there is a LSID

urn:lsid:ubio.org:namebank:11815

ubio.org is the authority.It is followed by a database and an id.
We need to resolve this authority to find some metadata about this LSID object. On unix, we put _lsid._tcp before this authority and the host command is used to ask the "DNS for the lsid service record for pdb.org with TCP as the network protocol" (I'm not really sure of what it really means, and I guess this can be a problem for the other bioinformaticians too).
%host -t srv _lsid._tcp.ubio.org
_lsid._tcp.ubio.org has SRV record 1 0 80 ANIMALIA.ubio.org.

So http://ANIMALIA.ubio.org the is location of the LSID service. We append /authority and we get a WSDL file at http://animalia.ubio.org/authority/ (This WSDL is another issue for me, is there so many bioinformaticians knowing how to read such format ?).

<wsdl:definitions targetNamespace="http://www.hyam.net/lsid/Authority">
<import namespace="http://www.omg.org/LSID/2003/AuthorityServiceHTTPBindings"
location="LSIDAuthorityServiceHTTPBindings.wsdl"
/>

<wsdl:service name="MyAuthorityHTTPService">
<wsdl:port name="MyAuthorityHTTPPort" binding="httpsns:LSIDAuthorityHTTPBinding">
<httpsns:address location="http://animalia.ubio.org/authority/index.php"/>
</wsdl:port>
</wsdl:service>
</wsdl:definitions>

At http://animalia.ubio.org/authority/LSIDAuthorityServiceHTTPBindings.wsdl we get the Http bindings.
</><definitions targetNamespace="http://www.omg.org/LSID/2003/AuthorityServiceHTTPBindings">
<import namespace="http://www.omg.org/LSID/2003/Standard/WSDL" location="LSIDPortTypes.wsdl"/>
<binding name="LSIDAuthorityHTTPBinding" type="sns:LSIDAuthorityServicePortType">
<http:binding verb="GET"/>
<operation name="getAvailableServices">
<http:operation location="/authority/"/>
<input>
<http:urlEncoded/>
</input>
<output>
<mime:multipartRelated>
<mime:part>
<mime:content part="wsdl" type="application/octet-stream"/>
</mime:part>
</mime:multipartRelated>
</output>
</operation>
</binding>
</definitions>

This WSDL tells us that http://animalia.ubio.org/authority/ is the URL where we can find some metadata about the LSID and using http+GET. And, by appending metadata.php (why this php extension ? this is not clear for me ) you'll get the following RDF metadata about urn:lsid:ubio.org:namebank:11815 (Very cool, I like this idea of getting a RDF from one identifier). The process of resolving the WSDL can be achieved once and cached.

<rdf:RDF>
<rdf:Description rdf:about="urn:lsid:ubio.org:namebank:11815">
<dc:identifier>urn:lsid:ubio.org:namebank:11815</dc:identifier>
<dc:creator rdf:resource="http://www.ubio.org"/>
<dc:subject>Pternistis leucoscepus (Gray, GR) 1867</dc:subject>
<ubio:taxonomicGroup>Aves</ubio:taxonomicGroup>
<ubio:recordVersion>4</ubio:recordVersion>
<ubio:canonicalName>Pternistis leucoscepus</ubio:canonicalName>
<dc:title>Pternistis leucoscepus</dc:title>
<dc:type>Scientific Name</dc:type>
<ubio:lexicalStatus>Unknown (Default)</ubio:lexicalStatus>
<gla:rank>Species</gla:rank>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:954940"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:954941"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:1564236"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:783787"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:1580313"/>
<gla:mapping rdf:resource="http://starcentral.mbl.edu/microscope/portal.php?pagetitle=classification&BLCHID=12-4498"/>
<gla:mapping rdf:resource="http://www.cbif.gc.ca/pls/itisca/next?v_tsn=553857&taxa=&p_format=&p_ifx=cbif&p_lang="/>
<gla:hasBasionym rdf:resource="urn:lsid:ubio.org:namebank:12292"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:12292"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762007"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762032"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762051"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:3408791"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1116259"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1137821"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1173817"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1174615"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1416177"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1672192"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:2233032"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:13853963"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1909656"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:2304281"/>
<dcterms:bibliographicCitation>Sclater, W.L., Systema Avium Ethiopicarum, p. 91</dcterms:bibliographicCitation>
</rdf:Description>
</rdf:RDF>


notebook EOF.

XML to DOM using XSLT

A short post. I was fed up with writing javascript/java code for creating dynamic web interfaces ( You know all those document.createElementNS, document.createTextNode node.appendChild etc... statements for building the DOM), so I wrote a XSL stylesheet taking as input a XML file and echoing the code that should be used to build the document. The stylesheet is available at:


For example the following XUL document:
<window id="example-window" title="Example 2.5.4">
<listbox>
<listhead>
<listheader label="Name"/>
<listheader label="Occupation"/>
</listhead>
<listcols>
<listcol/>
<listcol flex="1"/>
</listcols>
<listitem>
<listcell label="George"/>
<listcell label="House Painter"/>
</listitem>
<listitem>
<listcell label="Mary Ellen"/>
<listcell label="Candle Maker"/>
</listitem>
<listitem>
<listcell label="Roger"/>
<listcell label="Swashbuckler"/>
</listitem>
</listbox>
</window>

Will be transformed ( using xsltproc xml2dom.xsl file.xul ) into the following javascript code:
var window_id2244179= document.createElementNS(XUL.NS,"window");
window_id2244179.setAttribute("id","example-window");
window_id2244179.setAttribute("title","Example 2.5.4");
var listbox_id2244186= document.createElementNS(XUL.NS,"listbox");
window_id2244179.appendChild(listbox_id2244186);
var listhead_id2244188= document.createElementNS(XUL.NS,"listhead");
listbox_id2244186.appendChild(listhead_id2244188);
var listheader_id2244190= document.createElementNS(XUL.NS,"listheader");
listhead_id2244188.appendChild(listheader_id2244190);
listheader_id2244190.setAttribute("label","Name");
var listheader_id2244194= document.createElementNS(XUL.NS,"listheader");
listhead_id2244188.appendChild(listheader_id2244194);
listheader_id2244194.setAttribute("label","Occupation");
var listcols_id2244200= document.createElementNS(XUL.NS,"listcols");
listbox_id2244186.appendChild(listcols_id2244200);
var listcol_id2244202= document.createElementNS(XUL.NS,"listcol");
listcols_id2244200.appendChild(listcol_id2244202);
var listcol_id2244204= document.createElementNS(XUL.NS,"listcol");
listcols_id2244200.appendChild(listcol_id2244204);
listcol_id2244204.setAttribute("flex","1");
var listitem_id2244209= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244209);
var listcell_id2244211= document.createElementNS(XUL.NS,"listcell");
listitem_id2244209.appendChild(listcell_id2244211);
listcell_id2244211.setAttribute("label","George");
var listcell_id2244215= document.createElementNS(XUL.NS,"listcell");
listitem_id2244209.appendChild(listcell_id2244215);
listcell_id2244215.setAttribute("label","House Painter");
var listitem_id2244221= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244221);
var listcell_id2244223= document.createElementNS(XUL.NS,"listcell");
listitem_id2244221.appendChild(listcell_id2244223);
listcell_id2244223.setAttribute("label","Mary Ellen");
var listcell_id2244227= document.createElementNS(XUL.NS,"listcell");
listitem_id2244221.appendChild(listcell_id2244227);
listcell_id2244227.setAttribute("label","Candle Maker");
var listitem_id2244232= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244232);
var listcell_id2244234= document.createElementNS(XUL.NS,"listcell");
listitem_id2244232.appendChild(listcell_id2244234);
listcell_id2244234.setAttribute("label","Roger");
var listcell_id2244238= document.createElementNS(XUL.NS,"listcell");
listitem_id2244232.appendChild(listcell_id2244238);
listcell_id2244238.setAttribute("label","Swashbuckler");


Note: I also have a XML2HTML stylesheet here.

That's it.
Pierre

06 April 2009

Go West !

After one year at the Center for the Study of Human Polymorphisms I will follow my wife in Nantes (France) on September 1st 2009. Hum... that is not the right period to find a new occupation, so I hope I'll find a new job there (related to science or to the semantic web). Wanna hire me? Here is my profile on LinkedIn.


Nantes Image via wikipedia
.

03 April 2009

Consequences : SNP, cDNA, proteins, etc....

This post is about Consequences, a tool finding the consequences of a set of mutations mapped on the human genome. It was motivated by a recent post of FriendFeed, Daniel MacArthur asked:“Given a list of human b36 coordinates for a list of genic SNPs (most not in dbSNP), what would be the quickest way to get a list of the genes they're found in and, if possible, the amino acid position they would affect?”.

About one year ago, I wrote a tool named "Consequences" answering this question but the sources are somewhere in a tar.gz , burned in an old CD, in a cardboard, in my cellar... so it was faster to re-write this simple code from scratch. The result should be fine but please, tell me if you find a bug.

This tool takes as input a tab delimited file containing the following fields:

  1. A Name for your SNP
  2. the chromosome e.g. 'chr2' (at this time only one chromosome per input is supported)
  3. the position on the chromosome. The first base is indexed at 0
  4. The base observed ON THE PLUS STRAND OF THE GENOME
. The sequence of the chromosome is then downloaded using the DAS server of the UCSC, the genes are downloaded using the mysql server of the UCSC and the 'knownGene' table. Then, for each mutation, I simply look at the consequence of each mutation. Here is a sample of the output:

<consequences chrom="chr1">
<observed-mutation position="1116" name="snp1" base="A">
<gene name="uc001aaa.2" exon-count="3" strand="+" txStart="1115" txEnd="4121" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
<gene name="uc009vip.1" exon-count="2" strand="+" txStart="1115" txEnd="4272" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
</observed-mutation>
(...)
</observed-mutation>
<observed-mutation position="1149167" name="snp282" base="A">
<gene name="uc009vjv.1" exon-count="6" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142754" cdsEnd="1149171">
<in-exon name="Exon 2" codon-wild="CAG" codon-mut="TAG" aa-wild="Q" aa-mut="*" base-wild="C" base-mut="T" index-cdna="3" index-protein="1">
<wild-cDNA>ATG C AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</wild-cDNA>
<mut-cDNA>ATG T AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</mut-cDNA>
<wild-protein>M Q RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</wild-protein>
<mut-protein>M * RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</mut-protein>
</in-exon>
</gene>
<gene name="uc009vjw.1" exon-count="7" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142150" cdsEnd="1142150">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
<observed-mutation position="1205906" name="snp195" base="A">
<gene name="uc001adt.1" exon-count="18" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1205904" cdsEnd="1216853">
<in-exon name="Exon 1" codon-wild="ATG" codon-mut="ATA" aa-wild="M" aa-mut="I" base-wild="G" base-mut="A" index-cdna="2" index-protein="0">
<wild-cDNA>AT G AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</wild-cDNA>
<mut-cDNA>AT A AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</mut-cDNA>
<wild-protein> M RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</wild-protein>
<mut-protein> I RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</mut-protein>
</in-exon>
</gene>
<gene name="uc001adu.1" exon-count="17" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1209267" cdsEnd="1216853">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
</consequences>


The source code is available here:

A 'jar' is available for download at http://lindenb.googlecode.com/files/consequences.jar.
Running the tool:
java -cp {path}/mysql-connector-java-xxxx-bin.jar:consequences.jar org.lindenb.tinytools.Consequences your-list-of-snp.txt


Well, that is not big science but it might be helpful.
That's it.

Pierre