Difference between pages "Computational Regulatory Genomics" and "Tutorial"

From BeerLab
(Difference between pages)
Jump to navigation Jump to search
 
>Admin
 
Line 1: Line 1:
__NOTOC__
+
gkmSVM-R Tutorial notes
__NOTITLE__
 
<metadesc>Computational Regulatory Genomics for graduate PhD and postdoctoral research in BME and IGM at Johns Hopkins, top ranking programs modeling DNA and systems biology.</metadesc>
 
<!-- <h2>Computational Regulatory Genomics</h2> -->
 
<h3>[[Recent News  ]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[[Lab Members  ]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [[Publications]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [[Postdoctoral Positions Available]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [[Resources]]</h3>
 
  
[[File:Beer_Michael_small.jpg|link=http://www.bme.jhu.edu/people/primary]] [[File:EncodeNatureGraphic_small.png]] [[File:dyn_net.gif]]
+
INSTALLATION for linux or mac (R 3.5 or later)
  
We are in the '''[http://www.bme.jhu.edu/people/primary.php?id=384 Department of Biomedical Engineering]''' and the '''[https://www.hopkinsmedicine.org/profiles/results/directory/profile/8377361/michael-beer McKusick-Nathans Department of Genetic Medicine]''' at Johns Hopkins University. You can apply for graduate study in my lab through '''[http://www.bme.jhu.edu/people/primary.php?id=384 BME]''' or the Ph.D. program in '''[https://www.hopkinsmedicine.org/profiles/results/directory/profile/8377361/michael-beer Human Genetics.]'''
+
$ R <br/>
   
+
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
<h3>Research Interests: </h3> The ultimate goal of our research is to understand how gene regulatory information is encoded in genomic DNA sequence.  
+
> BiocManager::install() <br/>
We have recently made significant progress in understanding how DNA sequence features specify cell-type specific mammalian enhancer activity using machine learning, and how enhancer-gene networks control cell-fate decisions and contribute to human disease. For details, see:
+
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
 +
> install.packages('ROCR','kernlab','seqinr') <br/>
 +
 
 +
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
 +
 
 +
--or--
 +
 
 +
> install.packages('gkmSVM') <br/>
 +
 
 +
INSTALLATION for linux or mac (R 3.4 or earlier)
 +
 
 +
$ R <br/>
 +
> source("https://bioconductor.org/biocLite.R") <br/>
 +
> biocLite('GenomicRanges') <br/>
 +
> biocLite('rtracklayer') <br/>
 +
> biocLite('BSgenome') <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes) <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
 +
> install.packages('ROCR') <br/>
 +
> install.packages('kernlab') <br/>
 +
> install.packages('seqinr') <br/>
 +
> quit() <br/>
 +
 
 +
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
 +
 
 +
--or--
 +
 
 +
> install.packages('gkmSVM') <br/>
 +
 
 +
 
 +
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
 +
 
 +
Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
 +
 
 +
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)
 +
 
 +
$ R <br/>
 +
> library(gkmSVM) <br/>
 +
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
  
* '''[https://www.annualreviews.org/doi/abs/10.1146/annurev-genom-121719-010946?journalCode=genom Enhancer Predictions and Genome-Wide Regulatory Circuits.]''' Beer MA, Shigaki D, Huangfu D.  Ann. Rev. Genomics and Human Genetics 2020.
+
2. calculate kernel matrix:
  
* '''[https://www.nature.com/articles/s41588-019-0408-9 Genome-scale screens identify JNK–JUN signaling as a barrier for pluripotency exit and endoderm differentiation. ]''' Li Q, et al.  Nature Genetics 2019.
+
> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
  
* '''[http://onlinelibrary.wiley.com/doi/10.1002/humu.23185/full Predicting enhancer activity and variant impact using gkm-SVM.]''' Beer, MA.  Human Mutation 2017.
+
3. perform SVM training with cross-validation:
  
* '''[http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3331.html A method to predict the impact of regulatory variants from DNA sequence.]''' Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer, MA. Nature Genetics 2015.
+
> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
  
* '''[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003711 Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features.]''' Ghandi M, Lee D, Mohammad-Noori M, and Beer MA.  2014. PLOS Computational Biology. July 17, 2014.
+
4. generate 10-mer weights:
  
and other [[Publications]].
+
> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
  
Our work uses functional genomics DNase-seq, ATAC-seq, ChIP-seq, RNA-seq, MPRA, Hi-C, and chromatin state data to computationally identify combinations of transcription factor binding sites which operate to define the activity of cell-type specific enhancersWe are currently focused on:
+
This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative setsYou can then select the top weights with:                  
 +
 +
$ sort –grk  2 ctcf_1x_weights.out | head -12
  
* improving on the SVM methodology by including more general sequence features and constraints
+
which should give weights very similar to:
* predicting the impact of SNPs on enhancer activity (delta-SVM) and GWAS association for specific diseases
 
* experimentally assessing the predicted impact of regulatory element mutation in mammalian cells
 
* systematically determining regulatory element logic from ENCODE human and mouse data
 
* using this sequence based regulatory code to assess common modes of regulatory element evolution and variation
 
  
We are located in the McKusick-Nathans Institute of Genetic Medicine, and the Department of Biomedical Engineering, which has long been a leader in the development of rigorous quantitative modeling of biological systems, and is a natural home for graduate studies in Bioinformatics and Computational Biology at Johns Hopkins, including research in Genomics, Systems Biology, Machine Learning, and Network Modeling.  
+
<code>
 +
CACCTGGTGG      5.133463 <br/>
 +
CACCAGGTGG      5.090566 <br/>
 +
CACCAGGGGG      5.038873 <br/>
 +
CCACTAGGGG      4.833398 <br/>
 +
CCACCAGGGG      4.832404 <br/>
 +
CACCTAGTGG      4.782613 <br/>
 +
CACCAGAGGG      4.707206 <br/>
 +
CACTAGGGGG      4.663015 <br/>
 +
CACTAGAGGG      4.610800 <br/>
 +
CACTAGGTGG      4.580834 <br/>
 +
CCACTAGAGG      4.529869 <br/>
 +
CAGCAGAGGG      4.335304 <br/>
 +
</code>
  
 +
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
 +
score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
 +
by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
 +
Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
  
<h3>About Computational Biology in JHU Biomedical Engineering:</h3>
+
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
The Department of Biomedical Engineering has long been a leader in the development of rigorous quantitative modeling of biological systems, and is a natural home for graduate studies in Bioinformatics and Computational Biology at Johns Hopkins. Students with backgrounds in Physics, Mathematics, Computer Science and Engineering are encouraged to apply. Opportunities for research include: Computational Medicine, Genomics, Systems Biology, Machine Learning, and Network Modeling. Graduate students in Johns Hopkins' Biomedical Engineering programs can select research advisors from throughout Johns Hopkins' Medical Institutions, Whiting School of Engineering, and Krieger School of Arts and Sciences.
 
  
<h3>[http://karchinlab.org/bme-compbio-jhu Visit Some Computational Labs at Johns Hopkins]</h3>
 
  
<h3>[http://ccb.jhu.edu/ Center for Computational Biology at Johns Hopkins]</h3>
+
If you find this tool useful, please cite:
  
[[File:bmesmall.png]]
+
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
 +
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463
CACCAGGTGG 5.090566
CACCAGGGGG 5.038873
CCACTAGGGG 4.833398
CCACCAGGGG 4.832404
CACCTAGTGG 4.782613
CACCAGAGGG 4.707206
CACTAGGGGG 4.663015
CACTAGAGGG 4.610800
CACTAGGTGG 4.580834
CCACTAGAGG 4.529869
CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')


If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).