gkmSVM-R Tutorial notes
INSTALLATION for linux or mac (R version 3.5 or later):
# Installation of Bioconductor packages:
$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg38.masked', 'BSgenome.Hsapiens.UCSC.hg19.masked'))
> install.packages('ROCR','kernlab','seqinr')
> quit()
# If you want to use the mm10 genome, you need to remove the Bioconductor version of BSgenome.Mmusculus.UCSC.mm10.masked and add one that has repeatmasking added, with these two additional steps:
> install.packages('remotes')
> library(remotes)
> BiocManager::install('BSgenome.Mmusculus.UCSC.mm10')
> remotes::install_github('mbeer3/BSgenome.Mmusculus.UCSC.mm10.masked')
# Installation of gkmSVM package:
$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM
-- or --
> install.packages('gkmSVM')
INSTALLATION for linux or mac (R version 3.4 or earlier):
# Installation of Bioconductor packages:
$ R
> source('https://bioconductor.org/biocLite.R')
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg38.masked')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()
# Installation of gkmSVM package:
$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM
Now, to run gkmSVM-R on a CTCF ChIP-seq set similar to Ghandi, Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
Input files: CTCF_GM12878_hg38_top5k.bed (use 'Save link as..') nr10mers.fa ref.fa alt.fa from http://www.beerlab.org/gkmsvm
1. Generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)
$ R
> library(gkmSVM)
> library(BSgenome.Hsapiens.UCSC.hg38.masked)
> genNullSeqs('CTCF_GM12878_hg38_top5k.bed',nMaxTrials=10,xfold=1,genome=BSgenome.Hsapiens.UCSC.hg38.masked, outputPosFastaFN='CTCF_GM12878_hg38_top5k.fa', outputBedFN='neg1x_CTCF_GM12878_hg38_top5k.bed', outputNegFastaFN='neg1x_CTCF_GM12878_hg38_top5k.fa')
2. Calculate kernel matrix:
> gkmsvm_kernel('CTCF_GM12878_hg38_top5k.fa','neg1x_CTCF_GM12878_hg38_top5k.fa', 'CTCF_GM12878_vs_neg1x_kernel.out')
3. Perform SVM training with cross-validation:
> gkmsvm_trainCV('CTCF_GM12878_vs_neg1x_kernel.out','CTCF_GM12878_hg38_top5k.fa','neg1x_CTCF_GM12878_hg38_top5k.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', outputCVpredfn='CTCF_GM12878_vs_neg1x_cvpred.out', outputROCfn='CTCF_GM12878_vs_neg1x_roc.out')
4. Generate 10-mer weights:
> gkmsvm_classify('nr10mers.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', 'CTCF_GM12878_vs_neg1x_weights.out')
This should get AUROC=.965 and AUPRC=.964 with some small variation arising from the randomly sampled negative sets. You can then select the top weights in unix with:
$ sort -grk 2 CTCF_GM12878_vs_neg1x_weights.out | head -12
or from within R with:
> wts<-read.table('CTCF_GM12878_vs_neg1x_weights.out',header=F)
> wts[order(wts$V2,decreasing=T)[1:12],]
which should give weights very similar to:
CCACTAGGGG 6.832130
CCACTAGAGG 6.473275
CCACCAGGGG 6.409469
CACCTGGTGG 6.213327
CACCTAGTGG 6.134386
CACCAGGTGG 5.822585
CACTAGGGGG 5.806617
CACTAGAGGG 5.769022
CACCAGGGGG 5.469279
CACTAGGTGG 5.446346
CCACCAGAGG 5.126879
CAGCAGAGGG 5.044466
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in 'ref.fa' and 'alt.fa'. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='CTCF_GM12878_vs_neg1x', 'dsvm_CTCF_GM12878_vs_neg1x.out')
6. Simlar example for mm10 mouse genome:
Input file: CTCF_MEL_mm10_top5k.bed (use 'Save link as..')
> library(gkmSVM)
> library(BSgenome.Mmusculus.UCSC.mm10.masked) (note: Bioconductor version will not work because missing repeat fields RM and TRF, fix using installation instructions above)
> genNullSeqs('CTCF_MEL_mm10_top5k.bed',nMaxTrials=10,xfold=1,genome=BSgenome.Mmusculus.UCSC.mm10.masked, outputPosFastaFN='CTCF_MEL_mm10_top5k.fa', outputBedFN='neg1x_CTCF_MEL_mm10_top5k.bed', outputNegFastaFN='neg1x_CTCF_MEL_mm10_top5k.fa')
> gkmsvm_kernel('CTCF_MEL_mm10_top5k.fa','neg1x_CTCF_MEL_mm10_top5k.fa', 'CTCF_MEL_vs_neg1x_kernel.out')
> gkmsvm_trainCV('CTCF_MEL_vs_neg1x_kernel.out','CTCF_MEL_mm10_top5k.fa','neg1x_CTCF_MEL_mm10_top5k.fa',svmfnprfx='CTCF_MEL_vs_neg1x', outputCVpredfn='CTCF_MEL_vs_neg1x_cvpred.out', outputROCfn='CTCF_MEL_vs_neg1x_roc.out')
> gkmsvm_classify('nr10mers.fa',svmfnprfx='CTCF_MEL_vs_neg1x', 'CTCF_MEL_vs_neg1x_weights.out')
> wts<-read.table('CTCF_MEL_vs_neg1x_weights.out',header=F)
> wts[order(wts$V2,decreasing=T)[1:12],]
Which should have AUROC around 0.97 and give high scoring weights to similar kmers as the human trained model, as CTCF binding specificity is conserved between human and mouse.
If you find this tool useful, please cite:
Ghandi, Mohammad-Noori, Ghareghani,
Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS
Computational Biology (2014).