Difference between pages "Postdoctoral Positions Available" and "Tutorial"

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463 CACCAGGTGG 5.090566 CACCAGGGGG 5.038873 CCACTAGGGG 4.833398 CCACCAGGGG 4.832404 CACCTAGTGG 4.782613 CACCAGAGGG 4.707206 CACTAGGGGG 4.663015 CACTAGAGGG 4.610800 CACTAGGTGG 4.580834 CCACTAGAGG 4.529869 CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')

If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

@@ Line 1: / Line 1: @@
-<h3>Postdoctoral Fellowship in Computational Genomics at Johns Hopkins University </h3>
+gkmSVM-R Tutorial notes
- A postdoctoral position is available in the Department of Biomedical Engineering, Johns Hopkins University School of Medicine to work with Dr. Michael Beer to develop novel computational models at the forefront of regulatory genomics.  Our laboratory actively analyzes and collaborates to generate functional genomic  ChIP-seq, DNase-seq, and RNA-seq data to unravel the underlying DNA sequence code which specifies cell-type specific enhancer activity and the regulatory component of a wide range of human diseases.  Our lab is housed in the Institute of Genetic Medicine which provides a highly collaborative and dynamic environment and opportunities to directly evaluate and inform our computational modeling of disease relevant human genetic variation.  The ideal applicant should have a PhD degree and publication record in computational biology, genomics, biomedical engineering, applied mathematics or physics, or other related fields with strong quantitative training.   Strong programming skills in C/C++, Python, or equivalent are required.  Interested applicants should email curriculum vitae and at least two letters of recommendation to Dr. Michael Beer (mbeer@jhu.edu). Applications will be considered until the position is filled. The Johns Hopkins University is an Affirmative Action / Equal Opportunity Employer.  There are no citizenship restrictions.
+INSTALLATION for linux or mac (R 3.5 or later)
+$ R <br/>
+> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
+> BiocManager::install() <br/>
+> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
+> install.packages('ROCR','kernlab','seqinr') <br/>
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
+$ R CMD INSTALL gkmSVM <br/>
+--or--
+> install.packages('gkmSVM') <br/>
+INSTALLATION for linux or mac (R 3.4 or earlier)
+$ R <br/>
+> source("https://bioconductor.org/biocLite.R") <br/>
+> biocLite('GenomicRanges') <br/>
+> biocLite('rtracklayer') <br/>
+> biocLite('BSgenome') <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')     (or other genomes) <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
+> install.packages('ROCR') <br/>
+> install.packages('kernlab') <br/>
+> install.packages('seqinr') <br/>
+> quit() <br/>
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
+$ R CMD INSTALL gkmSVM <br/>
+--or--
+> install.packages('gkmSVM') <br/>
+Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
+Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
+. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
+$ R <br/>
+> library(gkmSVM) <br/>
+> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',   outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
+. calculate kernel matrix:
+> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
+. perform SVM training with cross-validation:
+> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
+. generate 10-mer weights:
+> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
+This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:
+$ sort –grk  2 ctcf_1x_weights.out | head -12
+which should give weights very similar to:
+<code>
+CACCTGGTGG      5.133463 <br/>
+CACCAGGTGG      5.090566 <br/>
+CACCAGGGGG      5.038873 <br/>
+CCACTAGGGG      4.833398 <br/>
+CCACCAGGGG      4.832404 <br/>
+CACCTAGTGG      4.782613 <br/>
+CACCAGAGGG      4.707206 <br/>
+CACTAGGGGG      4.663015 <br/>
+CACTAGAGGG      4.610800 <br/>
+CACTAGGTGG      4.580834 <br/>
+CCACTAGAGG      4.529869 <br/>
+CAGCAGAGGG      4.335304 <br/>
+</code>
+. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
+score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
+by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
+Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
+> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
+If you find this tool useful, please cite:
+Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
+Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Difference between pages "Postdoctoral Positions Available" and "Tutorial"

Revision as of 19:19, 5 August 2019

Navigation menu

Search