kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic datasets

Christopher Fletez-Brant*, Dongwon Lee*†, Andrew McCallion and Michael A. Beer†

*, These authors contributed equally to this work.
†, To whom correspondence should be addressed:

McKusick-Nathans Institute of Genetic Medicine,
Johns Hopkins University School of Medicine,
733 N. Broadway, BRB Suite 573,
Baltimore, MD 21205, USA


Massively parallel sequencing technologies have made the generation of genomic datasets a routine component of many biological investigations. For example, ChIP-seq and DNase-seq assays detect genomic regions bound by specific factors or open chromatin regulatory regions. A bottleneck in the interpretation of these data is the identification of the underlying molecular mechanisms which define these TF bound or open chromatin regions. We have recently developed a novel computational methodology which uses a Support Vector Machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short TF binding sites which determine the tissue specificity of these genomic assays [Lee et al. Genome Research 2011]. This regulatory information can 1) give confidence in genomic experiments by recovering previously known binding sites, and 2) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published datasets and demonstrate how this tool identifies novel accessory factors and repressive sequence elements. kmer-SVM is available at

Installation for kmer-SVM Tool from Galaxy ToolShed

kmer-SVM can be downloaded from the Galaxy Tool Shed, located at by searching for 'kmersvm'. You can also find kmer-SVM here (kmersvm.tar.gz). Once downloaded, extract the file. Inside the resulting directory is the file 'kmersvm'. Move that file to /path/to/galaxy-dist/tools. Generally:

tar -xzvf kmersvm-{changeno}
cd kmersvm-{changeno}
mv kmersvm /path/to/galaxy-dist/tools
Where {changeno} is changeset number assigned by Galaxy.



  1. Galaxy Project Server
  2. Swig (needed specifically to install python-modular package from Shogun Toolbox)
  3. Numpy
  4. Shogun Toolbox, v0.9.3 - v1.10
  5. Bitarray
  6. R
  7. ROCR R Package (Available through CRAN)

Mac Users:

  1. Xcode (Mac App Store)
  2. Fortran Compiler

Note that for the Fortran compiler binaries are provided for Mac users. However, if difficulties in installation are encountered, it may be beneficial to compile the Fortran compiler from source. Additionally, be sure to add the location of your Shogun installation to the PYTHONPATH.

Further, kmer-SVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X. At this time kmer-SVM has not been tested on Windows.

Required Files

Tool and Test Files

Use the script to install files required for the use of kmser-SVM tools and tool tests. Call with the path to the tools folder, as follows:

sh /path/to/galaxy-dist/tools

Index Files

For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call For example:

python  mm8

Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted:

Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc:

mm8	Mouse(mm8)	/path/to/nullseq_indice_mm8
mm9	Mouse(mm9)	/path/to/nullseq_indices_mm9
hg18	Human(hg18)	/path/to/nullseq_indices_hg18
hg19	Human(hg19)	/path/to/nullseq_indices_hg19

FASTA Files - "Fetch Sequences" and Genomes

To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows:

  1. Download 2bit files from the UCSC genome browser. For example,
  3. Add the following lines to galaxy-dist/tool-data/alignseq.loc
  4. seq	mm8	/path/to/mm8.2bit
    seq	mm9	/path/to/mm9.2bit
    seq	hg18	/path/to/hg18.2bit
    seq	hg19	/path/to/hg19.2bit


Add the following lines to galaxy-dist/tool_conf.xml to make kmer-SVM suite tools visible in Galaxy:

    <section name="SVM Tools" id="kmersvm">
      <tool file="kmersvm/classify.xml"/>
      <tool file="kmersvm/nullseq.xml"/>
      <tool file="kmersvm/rocprcurve.xml"/>
      <tool file="kmersvm/train.xml"/>
      <tool file="kmersvm/split_genome.xml"/>
      <tool file="kmersvm/seqprofile.xml" />

Tool Tests

Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome".

IDs for kmer-SVM tests can be found by calling with the '-list' flag.

Non-Galaxy-Based Usage

The kmer-SVM suite can be ran without using the Galaxy framework. Each tool exists as a standalone Python script (all located in /scripts) which can be called from the command line. Specific documentation can be found within each tool's Python file, or by calling the script with no arguments. A general workflow can be found in the paper, which can be followed by calling each of the relevant Python scripts as outlined below, with the exception that users will have to provide needed FASTA files themselves.

A simple worflow for the kmer-SVM suite is as follows:

  1. python mm8
  2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This assumes no negative data sets. Output will need to be converted to FASTA. Skip if negative data is provided.
  3. python positive.fa negative.fa #Outputs will be two files, one containing SVM weights and the other SVM predictions.
  4. python input.bed #Skip if you already have a list of regions you want to test. Output is a BED file, which will need to be converted to FASTA.
  5. python weights.out test_seq.fa

Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows:

python input.bed mm8 /path/to/mm8/indices profile.out

Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments.

Sample Data Sets


  1. positives: ESRRB_mm8.bed
  2. negatives: ESRRB_mm8_neg10x.bed


  1. positives: GR_3134_mm8.bed
  2. negatives: GR_3134_mm8_neg10x.bed


  1. positives: GR_att20_mm8.bed
  2. negatives: GR_att20_mm8_neg10x.bed


  1. positives: EWSFLI_ews502_hg18.bed
  2. negatives: EWSFLI_ews502_hg18_neg10x.bed


  1. positives: EWSFLI_huvec_hg18.bed
  2. negatives: EWSFLI_huvec_hg18_neg10x.bed