22 September 2016

Writing a Custom ReadFilter for the GATK, my notebook.


The GATK contains a set of predefined read filters that "filter or transfer incoming SAM/BAM data files":

With the help of the modular architecture of the GATK, it's possible to write a custom ReadFilter. In this post I'll write a ReadFilter that removes the reads if they contain a specific sequence in a soft-clipped segment.
The Read filters extend the class org.broadinstitute.gatk.engine.filters.ReadFilter:
public abstract class ReadFilter implements SamRecordFilter {
    public void initialize(GenomeAnalysisEngine engine) {}
    public boolean filterOut(final SAMRecord first, final SAMRecord second) {
        throw new UnsupportedOperationException("Paired filter not implemented: " + this.getClass());
    }
}
Thus the implementation of our filter, named My is defined below:
The class is compiled and packaged using the gatk itself as a library :
mkdir -p  org/broadinstitute/gatk/engine/filters
cp MyFilter.java org/broadinstitute/gatk/engine/filters
javac -cp /path/to/GenomeAnalysisTK.jar org/broadinstitute/gatk/engine/filters/MyFilter.java
jar cvf myfilter.jar org
rm -rf org
The GATK main class is org.broadinstitute.gatk.engine.CommandLineGATK, we can now check that there is a filter named My in the list of read Filters.
$ java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar  org.broadinstitute.gatk.engine.CommandLineGATK  \
  -T ReadLengthDistribution \
  --read_filter generate_and_error_please
(...)
##### ERROR MESSAGE: Invalid command line: Malformed read filter: Read filter generate_and_error_please not found. Available read filters:
##### ERROR 
##### ERROR                                 FilterName        Documentation
##### ERROR                           MissingReadGroup        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MissingReadGroupFilter.php
##### ERROR                                         My        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MyFilter.php
##### ERROR                    NoOriginalQualityScores        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_NoOriginalQualityScoresFilter.php
We can now use the GATK tools with or without the 'My' filter .
java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.ATGATA.tbl \
  --read_filter My --clippattern ATGATA 
(...)
INFO  10:53:32,412 MicroScheduler - 32 reads were filtered out during the traversal out of approximately 12434 total reads (0.26%) 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  10:53:32,413 MicroScheduler -   -> 32 reads (0.26% of total) failing MyFilter

$ cat dist.ATGATA.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12



while without 'My' the output is:

java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.all.tbl
(...)
INFO  10:53:35,713 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12434 total reads (0.00%) 
INFO  10:53:35,713 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:35,714 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

$ cat  dist.all.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12

The Makefile



That's it,
Pierre

No comments: