Difference between pages "Publications" and "Tutorial"

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463 CACCAGGTGG 5.090566 CACCAGGGGG 5.038873 CCACTAGGGG 4.833398 CCACCAGGGG 4.832404 CACCTAGTGG 4.782613 CACCAGAGGG 4.707206 CACTAGGGGG 4.663015 CACTAGAGGG 4.610800 CACTAGGTGG 4.580834 CCACTAGAGG 4.529869 CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')

If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

@@ Line 1: / Line 1: @@
-* [http://www.ncbi.nlm.nih.gov/pubmed/19253296 Identification of miR-21 targets in breast cancer cells using a quantitative proteomic approach. Yang Y, Chaerkady R, Beer MA, Mendell JT, Pandey A. Proteomics. 2009 Mar;9(5):1374-84]
+gkmSVM-R Tutorial notes
-* [http://www.ncbi.nlm.nih.gov/pubmed/19211792 Lin-28B transactivation is necessary for Myc-mediated let-7 repression and proliferation. Chang TC, Zeitels LR, Hwang HW, Chivukula RR, Wentzel EA, Dews M, Jung J, Gao P, Dang CV, Beer MA, Thomas-Tikhonenko A, Mendell JT. Proc Natl Acad Sci U S A. 2009 Mar 3;106(9):3384-9. Epub 2009 Feb 11.]
+INSTALLATION for linux or mac (R 3.5 or later)
-* [http://www.ncbi.nlm.nih.gov/pubmed/18071029 Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, McCallion AS. Genome Res. 2008 Feb;18(2):252-60. Epub 2007 Dec 10]
+$ R <br/>
+> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
+> BiocManager::install() <br/>
+> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
+> install.packages('ROCR','kernlab','seqinr') <br/>
-* [http://www.ncbi.nlm.nih.gov/pubmed/17540599 Transactivation of miR-34a by p53 broadly influences gene expression and promotes apoptosis. Chang TC, Wentzel EA, Kent OA, Ramachandran K, Mullendore M, Lee KH, Feldmann G, Yamakuchi M, Ferlito M, Lowenstein CJ, Arking DE, Beer MA, Maitra A, Mendell JT. Mol Cell. 2007 Jun 8;26(5):745-52. Epub 2007 May 31]
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
+$ R CMD INSTALL gkmSVM <br/>
-* [http://www.ncbi.nlm.nih.gov/pubmed/15870260 Functional characterization of a novel Ku70/80 pause site at the H19/Igf2 imprinting control region. D. J. Katz, M. A. Beer, J. M. Levorse and S. M. Tilghman, Mol Cell Biol 25, p3855-3863 (2005).]
+--or--
-* [http://www.ncbi.nlm.nih.gov/pubmed/14672978 Whole-genome discovery of transcription factor finding sites by network-level conservation. M. Pritsker, Y. C. Liu, M. A. Beer, and S. Tavazoie, Genome Res. 2004 Jan;14(1):99-108. Epub 2003 Dec 12.]
+> install.packages('gkmSVM') <br/>
-* [http://www.ncbi.nlm.nih.gov/pubmed/15084257 Predicting Gene Expression from Sequence. M. A. Beer and S. Tavazoie, Cell 117, p185-198 (2004)]
+INSTALLATION for linux or mac (R 3.4 or earlier)
+$ R <br/>
+> source("https://bioconductor.org/biocLite.R") <br/>
+> biocLite('GenomicRanges') <br/>
+> biocLite('rtracklayer') <br/>
+> biocLite('BSgenome') <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')     (or other genomes) <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
+> install.packages('ROCR') <br/>
+> install.packages('kernlab') <br/>
+> install.packages('seqinr') <br/>
+> quit() <br/>
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
+$ R CMD INSTALL gkmSVM <br/>
+--or--
+> install.packages('gkmSVM') <br/>
+Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
+Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
+. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
+$ R <br/>
+> library(gkmSVM) <br/>
+> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',   outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
+. calculate kernel matrix:
+> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
+. perform SVM training with cross-validation:
+> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
+. generate 10-mer weights:
+> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
+This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:
+$ sort –grk  2 ctcf_1x_weights.out | head -12
+which should give weights very similar to:
+<code>
+CACCTGGTGG      5.133463 <br/>
+CACCAGGTGG      5.090566 <br/>
+CACCAGGGGG      5.038873 <br/>
+CCACTAGGGG      4.833398 <br/>
+CCACCAGGGG      4.832404 <br/>
+CACCTAGTGG      4.782613 <br/>
+CACCAGAGGG      4.707206 <br/>
+CACTAGGGGG      4.663015 <br/>
+CACTAGAGGG      4.610800 <br/>
+CACTAGGTGG      4.580834 <br/>
+CCACTAGAGG      4.529869 <br/>
+CAGCAGAGGG      4.335304 <br/>
+</code>
+. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
+score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
+by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
+Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
+> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
+If you find this tool useful, please cite:
+Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
+Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Difference between pages "Publications" and "Tutorial"

Revision as of 19:19, 5 August 2019

Navigation menu

Search