Difference between pages "Computational Regulatory Genomics" and "Tutorial"

From BeerLab
(Difference between pages)
Jump to navigation Jump to search
 
>Admin
 
Line 1: Line 1:
__NOTOC__
+
gkmSVM-R Tutorial notes
<h1>Welcome to the Beer Lab!</h1>
 
  
[[File:Beer_lab_plate_art.jpg]]
+
INSTALLATION for linux or mac (R 3.5 or later)
  
<h3>Research Interests: </h3> The ultimate goal of our research is to understand how genomic DNA sequence specifies gene regulation.  
+
$ R <br/>
We have recently made significant progress in understanding how DNA sequence features control cell-type specific mammalian enhancer activity by using kmer-based SVM machine learning approaches. For details, see:
+
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
 +
> BiocManager::install() <br/>
 +
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
 +
> install.packages('ROCR','kernlab','seqinr') <br/>
  
* '''[http://www.horizonpress.com/genomeanalysis Mammalian Enhancer Prediction.]''' Lee D, Beer MA. 2014. Genome Analysis: Current Procedures and Applications. Horizon Press (in press)
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23861010 Robust k-mer Frequency Estimation Using Gapped k-mers.]''' Ghandi M, Mohammad-Noori M, and Beer MA. 2013. Journal of Mathematical Biology. (Epub ahead of print)
+
--or--
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23771147 kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic datasets.]''' Fletez-Brant C*, Lee D*, McCallion AS and Beer MA. 2013. Nucleic Acids Research 41: W544–W556.
+
> install.packages('gkmSVM') <br/>
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23019145 Integration of ChIP-seq and Machine Learning Reveals Enhancers and a Predictive Regulatory Sequence Vocabulary in Melanocytes.]''' Gorkin DU, Lee D, Reed X, Fletez-Brant C, Blessling SL, Loftus SK, Beer MA, Pavan WJ, and McCallion AS. 2012. Genome Research 22:2290-2301.
+
INSTALLATION for linux or mac (R 3.4 or earlier)
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/21875935 Discriminative prediction of mammalian enhancers from DNA sequence.]''' Lee D, Karchin R, and Beer MA. 2011. Genome Research 21:2167-2180.
+
$ R <br/>
 +
> source("https://bioconductor.org/biocLite.R") <br/>
 +
> biocLite('GenomicRanges') <br/>
 +
> biocLite('rtracklayer') <br/>
 +
> biocLite('BSgenome') <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes) <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
 +
> install.packages('ROCR') <br/>
 +
> install.packages('kernlab') <br/>
 +
> install.packages('seqinr') <br/>
 +
> quit() <br/>
  
This work uses functional genomics DNase-seq, ChIP-seq, RNA-seq, and chromatin state data to computationally identify combinations of transcription factor binding sites which operate to define the activity of a set of cell-type specific enhancers. We are currently focused on:
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
* improving this methodology by including more diverse constraints and features
+
--or--
* predicting the impact of SNPs on enhancer activity (delta-SVM) and GWAS disease association
 
* experimentally characterizing the predicted impact of regulatory element mutation in mammalian cells
 
* systematically determining regulatory elements from ENCODE human and mouse data
 
* using the inferred regulatory code to assess common modes of regulatory element evolution and variation
 
  
<h3>[[Lab Members]]</h3>
+
> install.packages('gkmSVM') <br/>
<h3>[[Publications]]</h3>
+
 
 +
 
 +
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
 +
 
 +
Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
 +
 
 +
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
 +
 
 +
$ R <br/>
 +
> library(gkmSVM) <br/>
 +
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',  outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
 +
 
 +
2. calculate kernel matrix:
 +
 
 +
> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
 +
 
 +
3. perform SVM training with cross-validation:
 +
 
 +
> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
 +
 
 +
4. generate 10-mer weights:
 +
 
 +
> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
 +
 
 +
This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:                   
 +
 +
$ sort –grk  2 ctcf_1x_weights.out | head -12
 +
 
 +
which should give weights very similar to:
 +
 
 +
<code>
 +
CACCTGGTGG      5.133463 <br/>
 +
CACCAGGTGG      5.090566 <br/>
 +
CACCAGGGGG      5.038873 <br/>
 +
CCACTAGGGG      4.833398 <br/>
 +
CCACCAGGGG      4.832404 <br/>
 +
CACCTAGTGG      4.782613 <br/>
 +
CACCAGAGGG      4.707206 <br/>
 +
CACTAGGGGG      4.663015 <br/>
 +
CACTAGAGGG      4.610800 <br/>
 +
CACTAGGTGG      4.580834 <br/>
 +
CCACTAGAGG      4.529869 <br/>
 +
CAGCAGAGGG      4.335304 <br/>
 +
</code>
 +
 
 +
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
 +
score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
 +
by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
 +
Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
 +
 
 +
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
 +
 
 +
 
 +
If you find this tool useful, please cite:
 +
 
 +
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
 +
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463
CACCAGGTGG 5.090566
CACCAGGGGG 5.038873
CCACTAGGGG 4.833398
CCACCAGGGG 4.832404
CACCTAGTGG 4.782613
CACCAGAGGG 4.707206
CACTAGGGGG 4.663015
CACTAGAGGG 4.610800
CACTAGGTGG 4.580834
CCACTAGAGG 4.529869
CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')


If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).