Difference between pages "Recent News" and "Tutorial"

From BeerLab
(Difference between pages)
Jump to navigation Jump to search
 
>Admin
 
Line 1: Line 1:
'''[https://retractionwatch.com/2018/03/20/over-a-dozen-board-members-resigned-after-a-journal-refused-to-retract-a-paper-today-its-retracted/#more-63352 Scientific Reports finally retracts paper plagiarizing our gkm-SVM paper.]'''
+
gkmSVM-R Tutorial notes
  
Don't do this: '''[http://www.sixthtone.com/news/1001156/19-academics-resign-from-journal-over-alleged-plagiarism/ plagiarism1]'''
+
INSTALLATION for linux or mac (R 3.5 or later)
'''[http://retractionwatch.com/2017/10/17/21-faculty-johns-hopkins-threaten-resign-board-journal-doesnt-retract-paper/ plagiarism2]'''
 
'''[http://retractionwatch.com/2017/10/10/board-member-resigns-journal-handling-paper-accused-plagiarism/ plagiarism3]'''
 
'''[http://retractionwatch.com/2017/11/07/17-johns-hopkins-researchers-resign-protest-ed-board-nature-journal/ plagiarism4]'''
 
'''[http://www.the-scientist.com/?articles.view/articleNo/50888/title/Mass-Resignation-from-Scientific-Reports-s-Editorial-Board/ plagiarism5]'''
 
  
'''[http://www.jhunewsletter.com/2017/02/16/professor-beer-awarded-1-8-million-nih-grant/ JHU Newsletter article on our ENCODE grant by Anna Chen]'''
+
$ R <br/>
 +
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
 +
> BiocManager::install() <br/>
 +
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
 +
> install.packages('ROCR','kernlab','seqinr') <br/>
  
'''[https://www.genome.gov/27567592/2017-release-nih-to-expand-critical-catalog-for-genomics-research/ NHGRI press release on next phase of ENCODE project consortium]'''
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
'''[http://www.bme.jhu.edu/news-events/news-highlights.php?id=592 Beer Lab awarded NIH ENCODE grant]'''
+
--or--
  
'''[http://www.nature.com/nature/journal/v538/n7624/full/538275a.html Our computational work featured in Nature news article on "The Dark Side of the Human Genome"]'''
+
> install.packages('gkmSVM') <br/>
  
'''[http://www.beerlab.org/gkmsvm New R package for gkmSVM and deltaSVM released]'''
+
INSTALLATION for linux or mac (R 3.4 or earlier)
  
'''[http://www.bme.jhu.edu/about/awards-achievements.php?id=201612 BME Associate Professors Beer and Karchin named top performers at CAGI4]'''
+
$ R <br/>
 +
> source("https://bioconductor.org/biocLite.R") <br/>
 +
> biocLite('GenomicRanges') <br/>
 +
> biocLite('rtracklayer') <br/>
 +
> biocLite('BSgenome') <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes) <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
 +
> install.packages('ROCR') <br/>
 +
> install.packages('kernlab') <br/>
 +
> install.packages('seqinr') <br/>
 +
> quit() <br/>
  
'''[https://genomeinterpretation.org/content/4-eQTL-causal_SNPs gkmSVM among top-scoring methods for CAGI eQTL prediction challenge] '''
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
'''[http://www.iscb.org/recomb-regsysgen2015-submissions/recomb-regsysgen2015-top-papers-reading-papers deltaSVM paper voted Top 10 in Regulatory Genomics 2015] '''
+
--or--
  
'''[http://www.nature.com/ng/journal/v47/n8/full/ng.3364.html Nature Genetics News & Views article on our deltaSVM paper]'''
+
> install.packages('gkmSVM') <br/>
  
'''[http://www.hopkinsmedicine.org/news/media/releases/vulnerabilities_in_genomes_dimmer_switches_should_shed_light_on_hundreds_of_complex_diseases Nature Genetics paper on impact of regulatory variants]'''
 
  
'''[http://www.newsweek.com/humans-and-mice-are-both-more-similar-and-different-previously-thought-285635 Newsweek article on Mouse ENCODE paper]'''
+
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
  
'''[http://www.hopkinsmedicine.org/news/media/releases/scientists_map_mouse_genomes_mission_control_centers Mouse ENCODE Consortium paper in Nature]'''
+
Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
  
'''[http://www.bme.jhu.edu/news-events/news-highlights.php?id=412 Beer Lab awarded NIH grant for regulatory contributions to disease. ]'''
+
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequencesIn general training on larger sequence sets will produce more accurate and robust models.)
  
'''[http://www.bme.jhu.edu/news-events/news-highlights.php?id=360  kmer-SVM Genome Research paper voted Top 10 in Regulatory Genomics.] '''
+
$ R <br/>
 +
> library(gkmSVM) <br/>
 +
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
  
'''[http://www.hopkinsmedicine.org/institute_basic_biomedical_sciences/news_events/Announcements/2013_04_YID.html  Dongwon Lee awarded Young Investigator Day Award.] '''
+
2. calculate kernel matrix:
 +
 
 +
> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
 +
 
 +
3. perform SVM training with cross-validation:
 +
 
 +
> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
 +
 
 +
4. generate 10-mer weights:
 +
 
 +
> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
 +
 
 +
This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:                   
 +
 +
$ sort –grk  2 ctcf_1x_weights.out | head -12
 +
 
 +
which should give weights very similar to:
 +
 
 +
<code>
 +
CACCTGGTGG      5.133463 <br/>
 +
CACCAGGTGG      5.090566 <br/>
 +
CACCAGGGGG      5.038873 <br/>
 +
CCACTAGGGG      4.833398 <br/>
 +
CCACCAGGGG      4.832404 <br/>
 +
CACCTAGTGG      4.782613 <br/>
 +
CACCAGAGGG      4.707206 <br/>
 +
CACTAGGGGG      4.663015 <br/>
 +
CACTAGAGGG      4.610800 <br/>
 +
CACTAGGTGG      4.580834 <br/>
 +
CCACTAGAGG      4.529869 <br/>
 +
CAGCAGAGGG      4.335304 <br/>
 +
</code>
 +
 
 +
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
 +
score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
 +
by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
 +
Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
 +
 
 +
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
 +
 
 +
 
 +
If you find this tool useful, please cite:
 +
 
 +
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
 +
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463
CACCAGGTGG 5.090566
CACCAGGGGG 5.038873
CCACTAGGGG 4.833398
CCACCAGGGG 4.832404
CACCTAGTGG 4.782613
CACCAGAGGG 4.707206
CACTAGGGGG 4.663015
CACTAGAGGG 4.610800
CACTAGGTGG 4.580834
CCACTAGAGG 4.529869
CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')


If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).