Difference between revisions of "Tutorial"

From BeerLab
Jump to navigation Jump to search
>Admin
>Admin
 
(15 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
INSTALLATION for linux or mac (R 3.5 or later)
 
INSTALLATION for linux or mac (R 3.5 or later)
  
$ R
+
$ R <br/>
 +
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
 +
> BiocManager::install() <br/>
 +
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
 +
> install.packages('ROCR','kernlab','seqinr') <br/>
  
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
> BiocManager::install()
+
$ R CMD INSTALL gkmSVM <br/>
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
 
> install.packages('ROCR','kernlab','seqinr')
 
 
 
$ git clone https://github.com/mghandi/gkmSVM.git
 
$ R CMD INSTALL gkmSVM
 
  
 
--or--
 
--or--
  
> install.packages('gkmSVM')
+
> install.packages('gkmSVM') <br/>
  
 
INSTALLATION for linux or mac (R 3.4 or earlier)
 
INSTALLATION for linux or mac (R 3.4 or earlier)
  
$ R
+
$ R <br/>
> source("https://bioconductor.org/biocLite.R")  
+
> source("https://bioconductor.org/biocLite.R") <br/>
> biocLite('GenomicRanges')
+
> biocLite('GenomicRanges') <br/>
> biocLite('rtracklayer')
+
> biocLite('rtracklayer') <br/>
> biocLite('BSgenome')
+
> biocLite('BSgenome') <br/>
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes)
+
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes) <br/>
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
+
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
> install.packages('ROCR')
+
> install.packages('ROCR') <br/>
> install.packages('kernlab')
+
> install.packages('kernlab') <br/>
> install.packages('seqinr')
+
> install.packages('seqinr') <br/>
> quit()
+
> quit() <br/>
  
$ git clone https://github.com/mghandi/gkmSVM.git
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
$ R CMD INSTALL gkmSVM
+
$ R CMD INSTALL gkmSVM <br/>
  
 
--or--
 
--or--
  
> install.packages('gkmSVM')
+
> install.packages('gkmSVM') <br/>
  
  
 
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
 
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
  
Input files: ctcfpos.bed     nr10mers.fa
+
Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
  
 
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
 
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
  
$ R
+
$ R <br/>
> library(gkmSVM)  
+
> library(gkmSVM) <br/>
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')
+
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
  
 
2. calculate kernel matrix:
 
2. calculate kernel matrix:
Line 67: Line 66:
 
which should give weights very similar to:
 
which should give weights very similar to:
  
CACCTGGTGG      5.133463
+
<code>
CACCAGGTGG      5.090566
+
CACCTGGTGG      5.133463  
CACCAGGGGG      5.038873
+
CACCAGGTGG      5.090566  
CCACTAGGGG      4.833398
+
CACCAGGGGG      5.038873  
CCACCAGGGG      4.832404
+
CCACTAGGGG      4.833398  
CACCTAGTGG      4.782613
+
CCACCAGGGG      4.832404  
CACCAGAGGG      4.707206
+
CACCTAGTGG      4.782613  
CACTAGGGGG      4.663015
+
CACCAGAGGG      4.707206  
CACTAGAGGG      4.610800
+
CACTAGGGGG      4.663015  
CACTAGGTGG      4.580834
+
CACTAGAGGG      4.610800  
CCACTAGAGG      4.529869
+
CACTAGGTGG      4.580834  
CAGCAGAGGG      4.335304
+
CCACTAGAGG      4.529869  
 +
CAGCAGAGGG      4.335304  
 +
...
 +
</code>
 +
 
 +
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
 +
score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
 +
by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
 +
Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
 +
 
 +
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
 +
 
  
 
If you find this tool useful, please cite:
 
If you find this tool useful, please cite:
  
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
+
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
 
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).
 
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Latest revision as of 19:23, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG      5.133463 
CACCAGGTGG      5.090566 
CACCAGGGGG      5.038873 
CCACTAGGGG      4.833398 
CCACCAGGGG      4.832404 
CACCTAGTGG      4.782613 
CACCAGAGGG      4.707206 
CACTAGGGGG      4.663015 
CACTAGAGGG      4.610800 
CACTAGGTGG      4.580834 
CCACTAGAGG      4.529869 
CAGCAGAGGG      4.335304 
...

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')


If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).