gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R version 3.5 or later):

# Installation of Bioconductor packages:

$ R

> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

> BiocManager::install()

> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg38.masked', 'BSgenome.Hsapiens.UCSC.hg19.masked'))

> install.packages('ROCR','kernlab','seqinr')

> quit()

# If you want to use the mm10 genome, you need to remove the Bioconductor version of BSgenome.Mmusculus.UCSC.mm10.masked and add one that has repeatmasking added, with these two additional steps:

> install.packages('remotes')

> library(remotes)

> BiocManager::install('BSgenome.Mmusculus.UCSC.mm10')

> remotes::install_github('mbeer3/BSgenome.Mmusculus.UCSC.mm10.masked')

# Installation of gkmSVM package:

$ git clone https://github.com/mghandi/gkmSVM.git

$ R CMD INSTALL gkmSVM

-- or --

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R version 3.4 or earlier):

# Installation of Bioconductor packages:

$ R

> source('https://bioconductor.org/biocLite.R')

> biocLite('GenomicRanges')

> biocLite('rtracklayer')

> biocLite('BSgenome')

> biocLite('BSgenome.Hsapiens.UCSC.hg38.masked')

> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)

> install.packages('ROCR')

> install.packages('kernlab')

> install.packages('seqinr')

> quit()

# Installation of gkmSVM package:

$ git clone https://github.com/mghandi/gkmSVM.git

$ R CMD INSTALL gkmSVM

Now, to run gkmSVM-R on a CTCF ChIP-seq set similar to Ghandi, Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: CTCF_GM12878_hg38_top5k.bed (use 'Save link as..') nr10mers.fa ref.fa alt.fa from https://www.beerlab.org/gkmsvm

1. Generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R

> library(gkmSVM)

> library(BSgenome.Hsapiens.UCSC.hg38.masked)

> genNullSeqs('CTCF_GM12878_hg38_top5k.bed',nMaxTrials=10,xfold=1,genome=BSgenome.Hsapiens.UCSC.hg38.masked, outputPosFastaFN='CTCF_GM12878_hg38_top5k.fa', outputBedFN='neg1x_CTCF_GM12878_hg38_top5k.bed', outputNegFastaFN='neg1x_CTCF_GM12878_hg38_top5k.fa')

2. Calculate kernel matrix:

> gkmsvm_kernel('CTCF_GM12878_hg38_top5k.fa','neg1x_CTCF_GM12878_hg38_top5k.fa', 'CTCF_GM12878_vs_neg1x_kernel.out')

3. Perform SVM training with cross-validation:

> gkmsvm_trainCV('CTCF_GM12878_vs_neg1x_kernel.out','CTCF_GM12878_hg38_top5k.fa','neg1x_CTCF_GM12878_hg38_top5k.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', outputCVpredfn='CTCF_GM12878_vs_neg1x_cvpred.out', outputROCfn='CTCF_GM12878_vs_neg1x_roc.out')

4. Generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', 'CTCF_GM12878_vs_neg1x_weights.out')

This should get AUROC=.965 and AUPRC=.964 with some small variation arising from the randomly sampled negative sets. You can then select the top weights in unix with:

$ sort -grk 2 CTCF_GM12878_vs_neg1x_weights.out | head -12

or from within R with:

> wts<-read.table('CTCF_GM12878_vs_neg1x_weights.out',header=F)

> wts[order(wts$V2,decreasing=T)[1:12],]

which should give weights very similar to:

CCACTAGGGG 6.832130

CCACTAGAGG 6.473275

CCACCAGGGG 6.409469

CACCTGGTGG 6.213327

CACCTAGTGG 6.134386

CACCAGGTGG 5.822585

CACTAGGGGG 5.806617

CACTAGAGGG 5.769022

CACCAGGGGG 5.469279

CACTAGGTGG 5.446346

CCACCAGAGG 5.126879

CAGCAGAGGG 5.044466

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in 'ref.fa' and 'alt.fa'. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', 'dsvm_CTCF_GM12878_vs_neg1x.out')

6. Simlar example for mm10 mouse genome:

Input file: CTCF_MEL_mm10_top5k.bed (use 'Save link as..')

> library(gkmSVM)

> library(BSgenome.Mmusculus.UCSC.mm10.masked) (note: Bioconductor version will not work because missing repeat fields RM and TRF, fix using installation instructions above)

> genNullSeqs('CTCF_MEL_mm10_top5k.bed',nMaxTrials=10,xfold=1,genome=BSgenome.Mmusculus.UCSC.mm10.masked, outputPosFastaFN='CTCF_MEL_mm10_top5k.fa', outputBedFN='neg1x_CTCF_MEL_mm10_top5k.bed', outputNegFastaFN='neg1x_CTCF_MEL_mm10_top5k.fa')

> gkmsvm_kernel('CTCF_MEL_mm10_top5k.fa','neg1x_CTCF_MEL_mm10_top5k.fa', 'CTCF_MEL_vs_neg1x_kernel.out')

> gkmsvm_trainCV('CTCF_MEL_vs_neg1x_kernel.out','CTCF_MEL_mm10_top5k.fa','neg1x_CTCF_MEL_mm10_top5k.fa',svmfnprfx='CTCF_MEL_vs_neg1x', outputCVpredfn='CTCF_MEL_vs_neg1x_cvpred.out', outputROCfn='CTCF_MEL_vs_neg1x_roc.out')

> gkmsvm_classify('nr10mers.fa',svmfnprfx='CTCF_MEL_vs_neg1x', 'CTCF_MEL_vs_neg1x_weights.out')

> wts<-read.table('CTCF_MEL_vs_neg1x_weights.out',header=F)

> wts[order(wts$V2,decreasing=T)[1:12],]

Which should have AUROC around 0.97 and give high scoring weights to similar kmers as the human trained model, as CTCF binding specificity is conserved between human and mouse.

If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and

Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).