kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic datasets

Christopher Fletez-Brant*, Dongwon Lee*†, Andrew McCallion and Michael A. Beer†

*, These authors contributed equally to this work.
†, To whom correspondence should be addressed:

McKusick-Nathans Institute of Genetic Medicine,
Johns Hopkins University School of Medicine,
733 N. Broadway, BRB Suite 573,
Baltimore, MD 21205, USA

Abstract

Massively parallel sequencing technologies have made the generation of genomic datasets a routine component of many biological investigations. For example, ChIP-seq and DNase-seq assays detect genomic regions bound by specific factors or open chromatin regulatory regions. A bottleneck in the interpretation of these data is the identification of the underlying molecular mechanisms which define these TF bound or open chromatin regions. We have recently developed a novel computational methodology which uses a Support Vector Machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short TF binding sites which determine the tissue specificity of these genomic assays [Lee et al. Genome Research 2011]. This regulatory information can 1) give confidence in genomic experiments by recovering previously known binding sites, and 2) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published datasets and demonstrate how this tool identifies novel accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.

Installation for kmer-SVM Tool from Galaxy ToolShed

kmer-SVM can be downloaded from the Galaxy Tool Shed, located at http://toolshed.g2.bx.psu.edu/ by searching for 'kmersvm'. You can also find kmer-SVM here (kmersvm.tar.gz). Once downloaded, extract the file. Inside the resulting directory is the file 'kmersvm'. Move that file to /path/to/galaxy-dist/tools. Generally:

tar -xzvf kmersvm-{changeno}
cd kmersvm-{changeno}
mv kmersvm /path/to/galaxy-dist/tools
                
Where {changeno} is changeset number assigned by Galaxy.

Dependencies

Everyone:

  1. Galaxy Project Server
  2. Swig (needed specifically to install python-modular package from Shogun Toolbox)
  3. Numpy
  4. Shogun Toolbox, v0.9.3 - v1.10
  5. Bitarray
  6. R
  7. ROCR R Package (Available through CRAN)

Mac Users:

  1. Xcode (Mac App Store)
  2. Fortran Compiler

Note that for the Fortran compiler binaries are provided for Mac users. However, if difficulties in installation are encountered, it may be beneficial to compile the Fortran compiler from source. Additionally, be sure to add the location of your Shogun installation to the PYTHONPATH.

Further, kmer-SVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X. At this time kmer-SVM has not been tested on Windows.

Required Files

Tool and Test Files

Use the install.sh script to install files required for the use of kmser-SVM tools and tool tests. Call install.sh with the path to the tools folder, as follows:

sh install.sh /path/to/galaxy-dist/tools

Index Files

For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script nullseq_build_indices.py. Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call nullseq_build_indices.py. For example:

python nullseq_build_indices.py mm8.zip  mm8
          

Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted:

Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc:

mm8	Mouse(mm8)	/path/to/nullseq_indice_mm8
mm9	Mouse(mm9)	/path/to/nullseq_indices_mm9
hg18	Human(hg18)	/path/to/nullseq_indices_hg18
hg19	Human(hg19)	/path/to/nullseq_indices_hg19

FASTA Files - "Fetch Sequences" and Genomes

To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows:

  1. Download 2bit files from the UCSC genome browser. For example,
  2. http://hgdownload.cse.ucsc.edu/goldenPath/mm8/bigZips/mm8.2bit
    http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit
    http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit
    http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
  3. Add the following lines to galaxy-dist/tool-data/alignseq.loc
  4. seq	mm8	/path/to/mm8.2bit
    seq	mm9	/path/to/mm9.2bit
    seq	hg18	/path/to/hg18.2bit
    seq	hg19	/path/to/hg19.2bit

tool_conf.xml

Add the following lines to galaxy-dist/tool_conf.xml to make kmer-SVM suite tools visible in Galaxy:

    <section name="SVM Tools" id="kmersvm">
      <tool file="kmersvm/classify.xml"/>
      <tool file="kmersvm/nullseq.xml"/>
      <tool file="kmersvm/rocprcurve.xml"/>
      <tool file="kmersvm/train.xml"/>
      <tool file="kmersvm/split_genome.xml"/>
      <tool file="kmersvm/seqprofile.xml" />
    </section>
         

Tool Tests

Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script run_functional_tests.sh. We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome".

IDs for kmer-SVM tests can be found by calling run_functional_tests.sh with the '-list' flag.

Non-Galaxy-Based Usage

The kmer-SVM suite can be ran without using the Galaxy framework. Each tool exists as a standalone Python script (all located in /scripts) which can be called from the command line. Specific documentation can be found within each tool's Python file, or by calling the script with no arguments. A general workflow can be found in the paper, which can be followed by calling each of the relevant Python scripts as outlined below, with the exception that users will have to provide needed FASTA files themselves.

A simple worflow for the kmer-SVM suite is as follows:

  1. python nullseq_build_indices.py mm8.zip mm8
  2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This assumes no negative data sets. Output will need to be converted to FASTA. Skip if negative data is provided.
  3. python kmersvm_train.py positive.fa negative.fa #Outputs will be two files, one containing SVM weights and the other SVM predictions.
  4. python split_genome.py input.bed #Skip if you already have a list of regions you want to test. Output is a BED file, which will need to be converted to FASTA.
  5. python kmersvm_classify.py weights.out test_seq.fa

Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows:

python make_profile.py input.bed mm8 /path/to/mm8/indices profile.out

Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments.

Sample Data Sets

ESRRB

  1. positives: ESRRB_mm8.bed
  2. negatives: ESRRB_mm8_neg10x.bed

GR(3134)

  1. positives: GR_3134_mm8.bed
  2. negatives: GR_3134_mm8_neg10x.bed

GR(att20)

  1. positives: GR_att20_mm8.bed
  2. negatives: GR_att20_mm8_neg10x.bed

EWS-FLI (EWS502)

  1. positives: EWSFLI_ews502_hg18.bed
  2. negatives: EWSFLI_ews502_hg18_neg10x.bed

EWS-FLI (HUVEC)

  1. positives: EWSFLI_huvec_hg18.bed
  2. negatives: EWSFLI_huvec_hg18_neg10x.bed