Difference between pages "Publications" and "Tutorial"

From BeerLab
(Difference between pages)
Jump to navigation Jump to search
 
>Admin
 
Line 1: Line 1:
Full publication list on '''[https://scholar.google.com/citations?hl=en&user=9aH8_eEAAAAJ google scholar]'''.
+
gkmSVM-R Tutorial notes
  
* '''[https://www.nature.com/articles/s41467-021-21368-0]''' Xi W, Beer MA.  Nature Comm. 2021.
+
INSTALLATION for linux or mac (R 3.5 or later)
  
* '''[https://www.annualreviews.org/doi/abs/10.1146/annurev-genom-121719-010946?journalCode=genom Enhancer Predictions and Genome-Wide Regulatory Circuits.]''' Beer MA, Shigaki D, Huangfu D. Ann. Rev. Genomics and Human Genetics 2020.
+
$ R <br/>
 +
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
 +
> BiocManager::install() <br/>
 +
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
 +
> install.packages('ROCR','kernlab','seqinr') <br/>
  
* '''[https://www.nature.com/articles/s41586-020-2493-4 Expanded encyclopaedias of DNA elements in the human and mouse genomes.]''' ENCODE Project Consortium.  Nature 2020.
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
* '''[https://www.jci.org/articles/view/126726 Genomic and epigenomic EBF1 alterations modulate TERT expression in gastric cancer.]''' Xing M, Ooi WF, Tan J, Qamra A, Lee PH, Li Z, Xu C, Padmanabhan N, Lim JQ, Guo YA, Yao X, Amit M, Ng LM, Sheng T, Wang J, Huang KK, Anene-Nzelu CG, Ho SWT, Ray M, Ma L, Fazzi G, Lim KJ, Wijaya GC, Zhang S, Nandi T, Yan T, Chang MM, Das K, Isa ZFA, Wu J, Yean PPS, Lam YN, Lin JS, Tay ST, Lee M, Keng ATL, Ong X, White K, Rozen SG, Beer MA, Foo RSY, Grabsch H, Skanderup AJ, Li S, Teh BT, Tan P. J. Clin. Invest. 2020
+
--or--
  
* '''[https://www.nature.com/articles/s41588-019-0408-9 Genome-scale screens uncover JNK/JUN signaling as a key barrier from pluripotency to human endoderm differentiation.]''' Li QV, Dixon G, Verma N, Rosen BP, Gordillo M, Luo R, Xu C, Wang Q, Soh C-L, Yang D, Crespo M, Shukla A, Xiang Q, Dundar F, Zumbo P, Witkin M, Koche R, Betel D, Chen S, Massague J, Garippa R, Evans T, Beer MA, and Huangfu D. Nature Genetics 2019.
+
> install.packages('gkmSVM') <br/>
  
* '''[https://onlinelibrary.wiley.com/doi/full/10.1002/humu.23797 Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.]''' Shigaki D, Adato O, Adhikar A, Dong S, Hawkins-Hooker A, Inoue F, Juven-Gershon T, Kenlay H, Martin B, Patra A, Penzar D, Schubach M, Xiong C, Yan Z, Boyle A, Kreimer A, Kulakovskiy IV, Reid J, Unger R, Yosef N, Shendure J, Ahituv N, Kircher M, and Beer MA. Human Mutation 2019.  [http://www.beerlab.org/deltasvm_models (models)]
+
INSTALLATION for linux or mac (R 3.4 or earlier)
  
* '''[https://stm.sciencemag.org/content/11/497/eaaw0790.abstract Epigenetic activation and memory at a novel TGFβ2 enhancer in systemic sclerosis fibroblasts.]''' Shin JY, Beckett JD, Shah A, McMahan Z, Paik J, Sampedro MM, MacFarlane EG, Beer MA, Warren D, Wigley FM, and Dietz HC. Science Translational Medicine 2019.
+
$ R <br/>
 +
> source("https://bioconductor.org/biocLite.R") <br/>
 +
> biocLite('GenomicRanges') <br/>
 +
> biocLite('rtracklayer') <br/>
 +
> biocLite('BSgenome') <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')    (or other genomes) <br/>
 +
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
 +
> install.packages('ROCR') <br/>
 +
> install.packages('kernlab') <br/>
 +
> install.packages('seqinr') <br/>
 +
> quit() <br/>
  
* '''[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006625 Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy.]''' Xi, W and Beer, MA.  PLOS Comp Biol 2018.
+
$ git clone https://github.com/mghandi/gkmSVM.git <br/>
 +
$ R CMD INSTALL gkmSVM <br/>
  
* '''[https://www.sciencedirect.com/science/article/pii/S0002929718303707 Parkinson-Associated SNCA Enhancer Variants Revealed by Open Chromatin in Mouse Dopamine Neurons.]''' McClymont SA, Hook PW, Soto AI, Reed X, Law WD, Kerans SJ, Waite EL, Briceno NJ, Thole JF, Heckman MG, Diehl NN, Wszolek  ZK, Moore CD, Zhu H, Akiyama JA, Dickel DE, Visel A, Pennacchio LA, Ross OA, Beer MA, & McCallion AS. Am. Jour. Human Genetics 2018
+
--or--
  
* '''[https://www.nature.com/articles/s41588-018-0156-2 Genetic determinants of co-accessible chromatin regions in activated T cells across humans.]''' Gate RE, Cheng CS, Aiden AP, Siba A, Tabaka M, Lituiev D, Machol I, Gordon MG, Subramaniam M, Shamim M, Hougen KL, Wortman I, Huang S-C, Durand NC, Feng T, De Jager PL, Chang HY, Lieberman Aiden E, Benoist C, Beer MA, Ye CJ & Regev A.  Nature Genetics 2018.
+
> install.packages('gkmSVM') <br/>
  
* '''[http://onlinelibrary.wiley.com/doi/10.1002/humu.23185/full Predicting enhancer activity and variant impact using gkm-SVM.]''' Beer, MA.  Human Mutation 2017.
 
  
* '''[http://onlinelibrary.wiley.com/doi/10.1002/humu.23197/full Predicting gene expression in massively parallel reporter assays: A comparative study.]''' Kreimer A, Zeng H, Edwards M, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong N, Li Y, Amin T, Goke J, Mueller N, Kellis, M, Kundaje A, Beer MA, Keles S, Gifford D, and Yosef, N. Human Mutation 2017.
+
Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
  
* '''[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170403 Embryonic loss of human females with partial trisomy 19 identifies region critical for the single active X.]''' Migeon BR, Beer MA, and Bjornsson HT. Plos ONE 2017.
+
Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
  
* '''[http://bioinformatics.oxfordjournals.org/content/early/2016/05/05/bioinformatics.btw203 gkmSVM: an R package for gapped-kmer SVM.]''' [[Media:Bioinformatics-2016-Ghandi-bioinformatics-btw203.pdf|pdf]] [[tutorial]] Ghandi, M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, and Beer MA. Bioinformatics 2016.
+
1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
  
* '''[http://elifesciences.org/content/5/e11613v1 Epigenomic landscapes of retinal rods and cones.]''' Mo, A, Luo, C, Davis, FP, Mukamel, EA, Henry, GL, Nery JR, Urich, MA, Picard, S, Lister, R, Eddy, SR, Beer, MA, Ecker, JR, and Nathans, J. eLife 2016.
+
$ R <br/>
 +
> library(gkmSVM) <br/>
 +
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',   outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
  
* '''[http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3331.html A method to predict the impact of regulatory variants from DNA sequence.]''' Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer, MA. Nature Genetics 2015.
+
2. calculate kernel matrix:
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/25582907 Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression.]''' Pervouchine DD, Djebali S, Breschi A, Davis CA, Barja PP, Dobin, A, Tanzer A, Lagarde J, Zaleski C, See L-H, Fastuca M, Drenkow J, Wang H, Bussotti G, Pei B, Balasubramanian S, Monlong J, Harmanci A, Gerstein M, Beer MA, Notredame C, Guigó R, Gingeras TR. Nat. Comm 2015.
+
> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
  
* '''[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140557 Identification of predictive cis-regulatory elements using a discriminative objective function and dynamic search spaces.]''' Karnik, R, and Beer MA. PLOS One 2015.
+
3. perform SVM training with cross-validation:
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/25413365 Comparison of the transcriptional landscapes between human and mouse tissues.]''' Lin S, Lin Y, Nery JR, Urich MA, Breschi A, Davis CA, Dobin A, Zaleski C, Beer MA, Chapman WC, Gingeras TR, Ecker JR, Snyder MP. PNAS 2014.
+
> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/25409824 A comparative encyclopedia of DNA elements in the mouse genome.]''' Mouse ENCODE Consortium (includes Lee D and Beer MA). 2014. Nature 515:355–364.
+
4. generate 10-mer weights:
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/25319996 Divergent functions of hematopoietic transcription factors in lineage priming and differentiation during erythro-megakaryopoiesis.]''' Pimkin M, Kossenkov AV, Mishra T, Morrissey CS, Wu W, Keller CA, Blobel GA, Lee D, Beer MA, Hardison RC, Weiss MJ. 2014. Genome Research.
+
> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/25033408 Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features.]''' Ghandi M*, Lee D*, Mohammad-Noori M, and Beer MA. 2014. PLoS Computational Biology 10(7):e1003711.
+
This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:                  
 +
 +
$ sort –grk  2 ctcf_1x_weights.out | head -12
  
* '''[http://www.horizonpress.com/genomeanalysis Mammalian Enhancer Prediction.]''' Lee D, Beer MA. 2014. Genome Analysis: Current Procedures and Applications. Horizon Press.  [[Media:book.pdf|pdf]]
+
which should give weights very similar to:
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23861010 Robust k-mer Frequency Estimation Using Gapped k-mers.]''' Ghandi M, Mohammad-Noori M, and Beer MA. 2013. Journal of Mathematical Biology. (Epub ahead of print)
+
<code>
 +
CACCTGGTGG      5.133463 <br/>
 +
CACCAGGTGG      5.090566 <br/>
 +
CACCAGGGGG      5.038873 <br/>
 +
CCACTAGGGG      4.833398 <br/>
 +
CCACCAGGGG      4.832404 <br/>
 +
CACCTAGTGG      4.782613 <br/>
 +
CACCAGAGGG      4.707206 <br/>
 +
CACTAGGGGG      4.663015 <br/>
 +
CACTAGAGGG      4.610800 <br/>
 +
CACTAGGTGG      4.580834
 +
CCACTAGAGG      4.529869
 +
CAGCAGAGGG      4.335304
 +
</code>
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23771147 kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic datasets.]''' Fletez-Brant C*, Lee D*, McCallion AS and Beer MA. 2013. Nucleic Acids Research 41: W544–W556.
+
5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
 +
score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
 +
by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
 +
Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/23019145 Integration of ChIP-seq and Machine Learning Reveals Enhancers and a Predictive Regulatory Sequence Vocabulary in Melanocytes.]''' Gorkin DU, Lee D, Reed X, Fletez-Brant C, Blessling SL, Loftus SK, Beer MA, Pavan WJ, and McCallion AS. 2012. Genome Research 22:2290-2301.
+
> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/22912661 Group Normalization for Genomic Data.]''' Ghandi M, and Beer MA. 2012. PLoS ONE 7:e38695.
 
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/21875935 Discriminative prediction of mammalian enhancers from DNA sequence.]''' Lee D, Karchin R, and Beer MA. 2011. Genome Research 21:2167-2180.
+
If you find this tool useful, please cite:
  
* '''[http://www.ncbi.nlm.nih.gov/pubmed/21720494 Identification of Novel Phosphorylation Motifs Through an Integrative Computational and Experimental Analysis of the Human Phosphoproteome.]''' Amanchy R, Kandasamy K, Mathivanan S, Periaswamy B, Reddy R, Yoon WH, Joore J, Beer MA, Cope L, Pandey A. 2011.  J Proteomics Bioinform 4:22-35.
+
Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
 
+
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).
* '''[http://www.ncbi.nlm.nih.gov/pubmed/19430481 A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies.]''' Khanna H, Davis EE, Murga-Zamalloa CA, Estrada-Cuzcano A, Lopez I, den Hollander AI, Zonneveld MN, Othman MI, Waseem N, Chakarova CF, Maubaret C, Diaz-Font A, MacDonald I, Muzny DM, Wheeler DA, Morgan M, Lewis LR, Logan CV, Tan PL, Beer MA, Inglehearn CF, Lewis RA, Jacobson SG, Bergmann C, Beales PL, Attié-Bitach T, Johnson CA, Otto EA, Bhattacharya SS, Hildebrandt F, Gibbs RA, Koenekoop RK, Swaroop A, Katsanis N. 2009. Nat Genet. 41:739-45.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/19253296 Identification of miR-21 targets in breast cancer cells using a quantitative proteomic approach.]''' Yang Y, Chaerkady R, Beer MA, Mendell JT, and Pandey A. 2009. Proteomics 9:1374-1384.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/19211792 Lin-28B transactivation is necessary for Myc-mediated let-7 repression and proliferation.]''' Chang T-C, Zeitels LR, Hwang H-W, Chivukula RR, Wentzel EA, Dews M, Jung J, Gao P, Dang CV, Beer MA, Thomas-Tikhonenko A, and Mendell JT. 2009. PNAS 106:3384-3389.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/18071029  Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b.]''' McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, and McCallion AS. 2008. Genome Research 18:252-260.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/17540599 Transactivation of miR-34a by p53 Broadly Influences Gene Expression and Promotes Apoptosis.]''' Chang T-C, Wentzel EA, Kent OA, Ramachandran K, Mullendore M, Lee KH, Feldmann G, Yamakuchi M, Ferlito M, Lowenstein CJ,  Arking DE, Beer MA, Maitra A, and Mendell JT. 2007. Molecular Cell 26: 745-752.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/15870260 Functional Characterization of a Novel Ku70/80 Pause Site at the H19/Igf2 Imprinting Control Region.]''' Katz DJ, Beer MA, Levorse JM, and Tilghman SM. 2005. Mol Cell Biol 25:3855-3863.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/14672978 Whole-Genome Discovery of Transcription Factor Binding Sites by Network-Level Conservation.]''' Pritsker M, Liu Y-C, Beer MA, and Tavazoie S. 2004. Genome Research 14:99-108.
 
 
 
* '''[http://www.ncbi.nlm.nih.gov/pubmed/15084257 Predicting Gene Expression from Sequence.]''' Beer MA and Tavazoie S. 2004. Cell 117:185-198.
 
 
 
 
 
----
 
 
 
For a full list including my prior work in simulations of plasma turbulence, see '''[https://scholar.google.com/citations?user=9aH8_eEAAAAJ&hl=en&oi=ao my google scholar page.]'''
 

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')


Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463
CACCAGGTGG 5.090566
CACCAGGGGG 5.038873
CCACTAGGGG 4.833398
CCACCAGGGG 4.832404
CACCTAGTGG 4.782613
CACCAGAGGG 4.707206
CACTAGGGGG 4.663015
CACTAGAGGG 4.610800
CACTAGGTGG 4.580834 CCACTAGAGG 4.529869 CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')


If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).