28 April 2011

mapping people I'm following on twitter: (KML, java, geonames.org)

I wrote a java tool to map the people I'm following on twitter. This tool invokes the twitter API to fetch the profiles of my contacts and it uses the geonames web services to guess the geolocation of the places.

The source code is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/TwitterToKML.java.
the build.xml is here.

Compilation

ant twitterkml
#get your twitter-id at "http://api.twitter.com/1/users/show.xml?screen_name=<your-twitter-username>"
java -jar dist/twitterkml.jar -g <geonames-id> -o result.kml <twitter-numeric-id>
##WAIT. I don't use the OAuth API, so my program waits until the 'rate limit' is enabled again.

'following'



'followers'



That's it,
Pierre

dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions

People from the "Human Genetics Center" in Houston have compiled a new resource named dbNSFP and described in http://www.ncbi.nlm.nih.gov/pubmed/21520341.

Hum Mutat. 2011 Apr 21. doi:10.1002/humu.21517.
dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions.
Liu X, Jian X, Boerwinkle E.


They have compiled the "prediction scores from four new and popular algorithms (SIFT, Polyphen2, LRT and MutationTaster), along with a conservation score (PhyloP) and other related information, for every potential NS in the human genome (a total of 75,931,005)." .

So, you don't have to send some new jobs to SIFT or Polyphen. Everything has already been calculated and joined here.

The database is available from http://sites.google.com/site/jpopgen/dbNSFP.

Downloading

lindenb@yokofakun:~$ wget "http://dl.dropbox.com/u/17001647/dbNSFP/dbNSFP.chr1-22XY.zip"
--2011-04-27 13:50:26-- http://dl.dropbox.com/u/17001647/dbNSFP/dbNSFP.chr1-22XY.zip
Proxy request sent, awaiting response... 200 OK
Length: 1200703405 (1.1G) [application/zip]
Saving to: `dbNSFP.chr1-22XY.zip'

100%[=================================================================================================================>] 1,200,703,405 1.82M/s in 10m 11s

2011-04-27 14:00:38 (1.87 MB/s) - `dbNSFP.chr1-22XY.zip' saved [1200703405/1200703405]

Content

unzip -t dbNSFP.chr1-22XY.zip
Archive: dbNSFP.chr1-22XY.zip
testing: dbNSFP.chr1 OK
testing: dbNSFP.chr10 OK
testing: dbNSFP.chr11 OK
testing: dbNSFP.chr12 OK
testing: dbNSFP.chr13 OK
testing: dbNSFP.chr14 OK
testing: dbNSFP.chr15 OK
testing: dbNSFP.chr16 OK
testing: dbNSFP.chr17 OK
testing: dbNSFP.chr18 OK
testing: dbNSFP.chr19 OK
testing: dbNSFP.chr2 OK
testing: dbNSFP.chr20 OK
testing: dbNSFP.chr21 OK
testing: dbNSFP.chr22 OK
testing: dbNSFP.chr3 OK
testing: dbNSFP.chr4 OK
testing: dbNSFP.chr5 OK
testing: dbNSFP.chr6 OK
testing: dbNSFP.chr7 OK
testing: dbNSFP.chr8 OK
testing: dbNSFP.chr9 OK
testing: dbNSFP.chrX OK
testing: dbNSFP.chrY OK

Sample (verticalized)

>>2
$1 #chr : 22
$2 pos(1-based) : 15453440
$3 ref : T
$4 alt : G
$5 aaref : M
$6 aaalt : L
$7 hg19pos(1-based) : 17073440
$8 genename : CCT8L2
$9 geneid : 150160
$10 CCDSid : CCDS13738.1
$11 refcodon : ATG
$12 codonpos : 1
$13 fold-degenerate : 0
$14 aapos : 1
$15 cds_strand : -
$16 LRT_Omega : 1.116940
$17 PhyloP_score : 0.963611
$18 PlyloP_pred : C
$19 SIFT_score : 1.0
$20 SIFT_pred : D
$21 Polyphen2_score : 0.25
$22 Polyphen2_pred : P
$23 LRT_score : 0.419288
$24 LRT_pred : U
$25 MutationTaster_score : 1.0
$26 MutationTaster_pred : D
<<2
>>3
$1 #chr : 22
$2 pos(1-based) : 15453440
$3 ref : T
$4 alt : C
$5 aaref : M
$6 aaalt : V
$7 hg19pos(1-based) : 17073440
$8 genename : CCT8L2
$9 geneid : 150160
$10 CCDSid : CCDS13738.1
$11 refcodon : ATG
$12 codonpos : 1
$13 fold-degenerate : 0
$14 aapos : 1
$15 cds_strand : -
$16 LRT_Omega : 1.116940
$17 PhyloP_score : 0.963611
$18 PlyloP_pred : C
$19 SIFT_score : 1.0
$20 SIFT_pred : D
$21 Polyphen2_score : 0.25
$22 Polyphen2_pred : P
$23 LRT_score : 0.419288
$24 LRT_pred : U
$25 MutationTaster_score : 1.0
$26 MutationTaster_pred : D
<<3


That's it,

Pierre

22 April 2011

Playing with the HTML5 File API: translating a Fasta file.

In the current post, I'm using the new HTML5 File Api. This new API can read the content of a file on the client side without needing a remote server. Let me repeat this:

YOU DO NOT NEED A SERVER
YOU DO NOT NEED TO COPY AND PASTE THE CONTENT OF THE FILE IN A TEXTAREA
.
As an example, the following code reads a whole DNA fasta file stored on your computer and translate each DNA sequence to a protein. When the user selects a new file, a FileReader object is created and a callback function translating the DNA is invoked when the fasta file has been loaded.

Test (your browser must support HTML5)

:

Source code



That's it,

Pierre

15 April 2011

"404 not found": An update for "bioinformatics/cabios"

Yesterday, I blogged about the persistence of the URLs present in the abstract of NAR. Today , I've updated my tool and used it to scan the abstracts of the following pubmed query: "Bioinformatics"[JOUR] or "Comput Appl Biosci"[JOUR].

Here is the result:

YearTotalAlive%
1815
1995100
19969333
199713323
1998861922
1999701724
2000832530
20011106458
20021217864
200328417059
200440225763
200549535972
200637429779
200744838185
200846641589
200950746291
201060556693
201128326894


Again, even if we can reach a web site, it doesn't mean that the service described in an article is still available or maintained.

As suggested by Egon Willighagen, I've uploaded the RDF output of my program on figshare: http://figshare.com/figures/index.php/Bioinformatics.404_20110415.rdf.

That's it,

Pierre

14 April 2011

"404 not found": a database of non-functional resources in the NAR database collection

Today, Andra Waagmeester asked on Biostar :"NAR nicely lists all their database issues on http://www.oxfordjournals.org/nar/database/c/. Is the list also available in a downloadable format?".

I suggested to download from pubmed all the articles published in an annual issue of NAR , to extract the URLs from the abstract and to check if they were still active. I just wrote a java program doing this job (it is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/NucleicAcidsResearch404.java)

A few comments:

  • The connection timeout was fixed to 10 seconds.
  • Some URLs are poorly written e.g: http://www.ncbi.nlm.nih.gov/pubmed/14681415
  • An abstract can contain more than one URL
  • There can be different URLs for the same database
  • getting a HTTP:404 error doesn't mean that the database has really been discontinued.
  • getting a status HTTP:200 doesn't mean that the database is still active and/or maintained
  • 1155 URLs have been extracted from this pubmed query `"Nucleic Acids Res"[JOUR] "Database issue"[ISS]` (as far as I can see , this query only goes to 2004) Edit:ok, that was because NCBI eFetch is limited to 10K records


YearCount(URL)count(Active)%
200415710063
200516211470
200618614779
200719415881
200820618087
200920818689
201014713692
201120019396


... a snapshot of the output...


(...)

(...)

Credit for the Title: Neil Saunders ;-)


Update:
It seems that the URLs in the abstracts are broken where they were cut in the PDF !
via openwetware.org


That's it,
Pierre