Difference between pages "Research Interests" and "Tutorial"

Revision as of 19:19, 5 August 2019

gkmSVM-R Tutorial notes

INSTALLATION for linux or mac (R 3.5 or later)

$ R
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked'))
> install.packages('ROCR','kernlab','seqinr')

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

INSTALLATION for linux or mac (R 3.4 or earlier)

$ R
> source("https://bioconductor.org/biocLite.R")
> biocLite('GenomicRanges')
> biocLite('rtracklayer')
> biocLite('BSgenome')
> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked') (or other genomes)
> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked')
> install.packages('ROCR')
> install.packages('kernlab')
> install.packages('seqinr')
> quit()

$ git clone https://github.com/mghandi/gkmSVM.git
$ R CMD INSTALL gkmSVM

--or--

> install.packages('gkmSVM')

Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:

Input files: ctcfpos.bed, nr10mers.fa, ref.fa, alt.fa from www.beerlab.org/gkmsvm

1. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa: (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences. In general training on larger sequence sets will produce more accurate and robust models.)

$ R
> library(gkmSVM)
> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18', outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa')

2. calculate kernel matrix:

> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')

3. perform SVM training with cross-validation:

> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')

4. generate 10-mer weights:

> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')

This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets. You can then select the top weights with:

$ sort –grk 2 ctcf_1x_weights.out | head -12

which should give weights very similar to:

CACCTGGTGG 5.133463 CACCAGGTGG 5.090566 CACCAGGGGG 5.038873 CCACTAGGGG 4.833398 CCACCAGGGG 4.832404 CACCTAGTGG 4.782613 CACCAGAGGG 4.707206 CACTAGGGGG 4.663015 CACTAGAGGG 4.610800 CACTAGGTGG 4.580834 CCACTAGAGG 4.529869 CAGCAGAGGG 4.335304

5. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin, Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).

> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')

If you find this tool useful, please cite:

Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and
Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

@@ Line 1: / Line 1: @@
-The ultimate goal of our research is to understand how genomic DNA sequence specifies gene regulation. We are currently focused on 1) developing computational tools to identify functional regulatory elements in non-coding DNA, and 2) experimentally testing and characterizing how these elements function.
+gkmSVM-R Tutorial notes
-In our computational work, we are using microarray gene expression data, genome-wide location analysis, and whole-genome DNA sequence to systematically identify DNA functional elements and infer combinatorial regulatory logic. We use pattern recognition algorithms to identify over-represented and phylogenetically conserved DNA sequence elements (or putative transcription factor binding sites). We then use a probabilistic Bayesian network to find the most likely functional constraints on the position, spacing, orientation, and combinations of these DNA sequence elements. This methodology has generated a large set of high confidence predictions for regulatory interactions, and is in principle applicable to any organism with microarray and genome sequence data.
+INSTALLATION for linux or mac (R 3.5 or later)
-In our experimental work, we are testing these computational predictions by rapid generation of transgenic GFP reporter strains in C. elegans via microparticle bombardment. C. elegans is an attractive model system for several reasons:
+$ R <br/>
+> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") <br/>
+> BiocManager::install() <br/>
+> BiocManager::install(c('GenomicRanges','rtracklayer','BSgenome', 'BSgenome.Hsapiens.UCSC.hg19.masked', 'BSgenome.Hsapiens.UCSC.hg18.masked')) <br/>
+> install.packages('ROCR','kernlab','seqinr') <br/>
-* Relevance to human disease: About 60% of C. elegans genes have a human homologue (Harris et al., NAR 2004); and 80% of genes implicated in human cancer have a worm homologue (Futreal et al., Nat Rev Cancer 2004; Poulin et al, Oncogene 2004).
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
-* The high quality of the genome sequence data and microarray tools.
+$ R CMD INSTALL gkmSVM <br/>
-* Rapid and effective transformation techniques and GFP reporter assays.
-* Availablity of bacterial feeding library for genome wide RNAi screens to further characterize regulatory interactions.
+--or--
-* Relative ease and cost of strain maintenance.
+> install.packages('gkmSVM') <br/>
+INSTALLATION for linux or mac (R 3.4 or earlier)
+$ R <br/>
+> source("https://bioconductor.org/biocLite.R") <br/>
+> biocLite('GenomicRanges') <br/>
+> biocLite('rtracklayer') <br/>
+> biocLite('BSgenome') <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg19.masked')     (or other genomes) <br/>
+> biocLite('BSgenome.Hsapiens.UCSC.hg18.masked') <br/>
+> install.packages('ROCR') <br/>
+> install.packages('kernlab') <br/>
+> install.packages('seqinr') <br/>
+> quit() <br/>
+$ git clone https://github.com/mghandi/gkmSVM.git <br/>
+$ R CMD INSTALL gkmSVM <br/>
+--or--
+> install.packages('gkmSVM') <br/>
+Now to run gkmSVM-R on the ctcf test set from Ghandi Lee, Mohammad-Noori, Beer, PLOS CompBio 2014:
+Input files: [http://www.beerlab.org/gkmsvm/ctcfpos.bed ctcfpos.bed], [http://www.beerlab.org/gkmsvm/nr10mers.fa nr10mers.fa], [http://www.beerlab.org/gkmsvm/ref.fa ref.fa], [http://www.beerlab.org/gkmsvm/alt.fa alt.fa] from [http://www.beerlab.org/gkmsvm www.beerlab.org/gkmsvm]
+. generate GC, length, and repeat matched negative set and extract fasta sequence files for ctcfpos.fa and ctcfneg_1x.fa:  (Larger negative sets can be generated by increasing xfold, and running time can be decreased by reducing nMaxTrials, at the cost of not matching difficult sequences.  In general training on larger sequence sets will produce more accurate and robust models.)
+$ R <br/>
+> library(gkmSVM) <br/>
+> genNullSeqs('ctcfpos.bed',nMaxTrials=10,xfold=1,genomeVersion='hg18',   outputPosFastaFN='ctcfpos.fa', outputBedFN='ctcfneg_1x.bed', outputNegFastaFN='ctcfneg_1x.fa') <br/>
+. calculate kernel matrix:
+> gkmsvm_kernel('ctcfpos.fa','ctcfneg_1x.fa', 'ctcf_1x_kernel.out')
+. perform SVM training with cross-validation:
+> gkmsvm_trainCV('ctcf_1x_kernel.out','ctcfpos.fa','ctcfneg_1x.fa',svmfnprfx='ctcf_1x', outputCVpredfn='ctcf_1x_cvpred.out', outputROCfn='ctcf_1x_roc.out')
+. generate 10-mer weights:
+> gkmsvm_classify('nr10mers.fa',svmfnprfx='ctcf_1x', 'ctcf_1x_weights.out')
+This should get AUROC=.955 and AUPRC=.954 with some small variation arising from the randomly sampled negative sets.  You can then select the top weights with:
+$ sort –grk  2 ctcf_1x_weights.out | head -12
+which should give weights very similar to:
+<code>
+CACCTGGTGG      5.133463 <br/>
+CACCAGGTGG      5.090566 <br/>
+CACCAGGGGG      5.038873 <br/>
+CCACTAGGGG      4.833398 <br/>
+CCACCAGGGG      4.832404 <br/>
+CACCTAGTGG      4.782613 <br/>
+CACCAGAGGG      4.707206 <br/>
+CACTAGGGGG      4.663015 <br/>
+CACTAGAGGG      4.610800 <br/>
+CACTAGGTGG      4.580834 <br/>
+CCACTAGAGG      4.529869 <br/>
+CAGCAGAGGG      4.335304 <br/>
+</code>
+. To calculate the impact of a variant, in this case on CTCF binding, we use gkmsvm_classify to find the
+score difference between two alleles given in FASTA format in ‘ref.fa’ and ‘alt.fa’. This is only different
+by a scale factor from deltaSVM calculated directly from SVM weights, as described in (Lee, Gorkin,
+Baker, Strober, Aasoni, McCallion, Beer, Nature Genetics 2015).
+> gkmsvm_delta('ref.fa','alt.fa',svmfnprfx='ctcf_1x', 'dsvm_ctcf_1x.out')
+If you find this tool useful, please cite:
+Ghandi, Mohammad-Noori, Ghareghani, Lee, Garraway, and Beer, Bioinformatics (2016); and <br/>
+Ghandi, Lee, Mohammad-Noori, and Beer, PLOS Computational Biology (2014).

Difference between pages "Research Interests" and "Tutorial"

Revision as of 19:19, 5 August 2019

Navigation menu

Search