Difference between pages "Lab Members" and "Tutorial"

From BeerLab
(Difference between pages)
Jump to navigation Jump to search
 
>Admin
 
Line 1: Line 1:
__NOTOC__
+
gkmSVM-R Tutorial notes
==PI==
 
[[Users:Mbeer|Mike Beer (with short hair)]]
 
  
[[File:Beer_m.gif]] [[File:group_pic.jpg|350px]]
+
INSTALLATION for linux or mac (R 3.5 or later)
  
==Postdocs==
+
$ R
* Ayoti Patra
 
[[File:Ayoti.png|80x160]]
 
  
==Graduate students==
+
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
* Dustin Shigaki
+
> BiocManager::install()
* Wang Xi
+
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
* Jin-Woo Oh
+
> install.packages('ROCR','kernlab','seqinr')
* Milad Razavi Mohseni
 
* Nick Vulpescu (rotation)
 
  
==Current Undergraduates==
+
$ git clone https://github.com/mghandi/gkmSVM.git
* Gary Yang
+
$ R CMD INSTALL gkmSVM
* Deepak Manda
 
* Alex Chang
 
  
==Former graduate students==
+
--or--
* Paul Michel (summer MD genomics rotation)
 
* [https://scholar.google.com/citations?user=7oyAkKkAAAAJ&hl=en Dongwon Lee] (now at Boston Children's Hospital and Harvard Medical School)
 
* Kipper Fletez-Brant (HG rotation student, now at 23andMe)
 
* Mahmoud Ghandi (now at Third Rock Ventures Newco)
 
* [http://www.genebrew.com Rahul Karnik] (now Principal Scientist at Omega Therapeutics)
 
* Jun Kyu Rhee (now at Korea Institute of Science and Technology)
 
* Donavan Cheng (now Director, Population Genomics at Illumina)
 
  
==Former Undergraduates==
+
> install.packages('gkmSVM')
* Richard Liu
+
 
* Nico Eng
+
INSTALLATION for linux or mac (R 3.4 or earlier)
* Michael Mudgett
+
 
* Felix Yu
+
$ R
* Amy Xiao
+
> source("https://bioconductor.org/biocLite.R")
* Sunny Thodupunuri
+
> biocLite('GenomicRanges')
* Tianyue Ou
+
> biocLite('rtracklayer')
* Ganesh Arvapalli
+
> biocLite('BSgenome')
* Zachary Heiman
+
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes)
* Gianluca Silva Croso
+
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
* Kendrick Hougen
+
> install.packages('ROCR')
* Nole Lin
+
> install.packages('kernlab')
* Ashutosh Jindal
+
> install.packages('seqinr')
* Kyle Xiong
+
> quit()
* Ben Strober
+
 
* Alessandro Asoni
+
$ git clone https://github.com/mghandi/gkmSVM.git
* Billy Kang
+
$ R CMD INSTALL gkmSVM
* John Lee
+
 
* Andrew Pao
+
--or--
* Tuo Li
+
 
* Peter Li
+
> install.packages('gkmSVM')
* Juinting Chiang
+
 
 +
 
 +
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
 +
 
 +
Input files:  ctcfpos.bed    nr10mers.fa
 +
 
 +
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
 +
 
 +
$ R
 +
> library(gkmSVM)
 +
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')
 +
 
 +
2. calculate kernel matrix:
 +
 
 +
> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
 +
 
 +
3. perform SVM training with cross-validation:
 +
 
 +
> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
 +
 
 +
4. generate 10-mer weights:
 +
 
 +
> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
 +
 
 +
This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:                   
 +
 +
$ sort –grk  2 ctcf_1x_weights.out | head -12
 +
 
 +
which should give weights very similar to:
 +
 
 +
CACCTGGTGG      5.133463
 +
CACCAGGTGG      5.090566
 +
CACCAGGGGG      5.038873
 +
CCACTAGGGG      4.833398
 +
CCACCAGGGG      4.832404
 +
CACCTAGTGG      4.782613
 +
CACCAGAGGG      4.707206
 +
CACTAGGGGG      4.663015
 +
CACTAGAGGG      4.610800
 +
CACTAGGTGG      4.580834
 +
CCACTAGAGG      4.529869
 +
CAGCAGAGGG      4.335304
 +
 
 +
If you find this tool useful, please cite:
 +
 
 +
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
 +
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Revision as of 17:49, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R

> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") > BiocManager::install() > BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) > install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git $ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R > source("https://bioconductor.org/biocLite.R") > biocLite('GenomicRanges') > biocLite('rtracklayer') > biocLite('BSgenome') > biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes) > biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') > install.packages('ROCR') > install.packages('kernlab') > install.packages('seqinr') > quit()

$ git clone https://github.com/mghandi/gkmSVM.git $ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed nr10mers.fa

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R > library(gkmSVM) > genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463 CACCAGGTGG 5.090566 CACCAGGGGG 5.038873 CCACTAGGGG 4.833398 CCACCAGGGG 4.832404 CACCTAGTGG 4.782613 CACCAGAGGG 4.707206 CACTAGGGGG 4.663015 CACTAGAGGG 4.610800 CACTAGGTGG 4.580834 CCACTAGAGG 4.529869 CAGCAGAGGG 4.335304

If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).