As an alternative to the tarfile, you may use git clone to download the source:
# git clone https://github.com/rakarnik/motifspec.git
MotifSpec is written in C++. We have compiled it successfully on recent Linux (GCC 4.1.2) and Mac OS X versions (GCC 4.2.1), though it should compile successfully on any Unix with a reasonably recent GCC. Once you have unpacked the source, compile it by typing:
# make
After the code has compiled, the MotifSpec binary (motifspec) is placed in the "bin" directory. You can copy it to wherever you want in your system to run it.
MotifSpec has worker processes that perform the actual motif search and an archive process that collects non-redundant motif results. You can run any number of worker processes, parallelizing the motif search, since each random restart is independent. The connection between the worker processes and the archive process is the "-o" option, which must be the same for all processes within a run. You can thus have multiple MotifSpec runs within the same directory, as long as you keep the "-o" option different between the runs.
To start a worker process to run on the NRSF ChIP-seq data:
# bin/motifspec -s seq/nrsf.fa -su seq/nrsf.pos -o nrsf -worker 1 -numcols 15 -seed 1 >& nrsf.1.out &
To start the archive process:
# bin/motifspec -s seq/nrsf.fa -su seq/nrsf.pos -o nrsf -numcols 15 -seed 1 >& nrsf.out &
Results will be collected in nrsf.ms and information on the motifs can be summarized by:
# perl scripts/msproc.pl nrsf.ms
Which should eventually produce:
ace mot tot ssp gtseq hits score cons seqc sspc iter
nrsf.ms 1 7251 2417 1442 1445 3516.41 TCAGCACCATGGACAG 0.9790 0.70 1.2
nrsf.ms 2 7251 2417 965 335 731.53 RGRRARRRRRRRRRR 0.9550 0.70 1.1
nrsf.ms 3 7251 2417 501 143 370.55 AARAAAAAAAAAAAAA 0.9880 0.70 1.5
nrsf.ms 4 7251 2417 55 37 161.95 ACCYTG--AARKG-Y 0.9750 0.70 1.3
-s <seqfile> |
The input file containing sequences in FASTA format (positive and negative sets combined) |
-o <out> |
The output prefix. This argument must be the same for worker and archive processes within a run. |
-worker <i> |
Sets the ID number of the worker process. Omit this argument for the archive process. |
You must also specify one (and only one) of the following three options:
-su <sufile> |
File containing IDs of sequences that constitute the fixed positive set (one ID per line) |
-sc <scfile> |
File containing binding scores (tab-delimited, one ID and binding score per line) |
-ex <exfile> |
File containing expression values (tab-delimited, one ID and multiple expression values per line) |
-numcols <k> |
The number of columns in the PWM motif model (default 10) |
-order <o> |
Order of the background markov model (default 3) |
-simcut <s> |
Similarity cutoff (CompareACE-like) for motifs to be considered redundant (default 0.9) |
-minpass <m> |
The number of iterations that must occur without improvement before a motif search is terminated (default 100) |
-seed <s> |
Sets the random seed for reproducibility of runs (default uses system time) |
Motif <motidx> <site> <seqidx> <seqpos> <strand> <site> <seqidx> <seqpos> <strand> <site> <seqidx> <seqpos> <strand> . . . . <site> <seqidx> <seqpos> <strand> *** **** * Score: <score> Sequences above sequence threshold: <s2> Size of search space: <s1> Sequence cutoff: <seqcut> Expression cutoff: <exprcut> Score cutoff: <sccut> Iteration found: <workerid>.<restartnum> Dejavu: <dj> |
where the output parts are:
motidx |
The index of the motif within the output file |
site |
Sequence of the motif hit |
seqidx |
The index of the sequence within the list of input sequences |
seqpos |
The position of the motif hit within the sequence |
strand |
The strand on which the motif hit occurs (1 = Watson, 0 = Crick) |
*** **** * |
Informative columns (* if informative, <space> if not |
score |
Score of the motif |
s1 |
Size of the search space or positive set | s2 |
Number of sequences that contain the motif, across positive and negative sets |
seqcut |
Dynamically learned sequence threshold |
sccut |
Dynamically learned binding score threshold (only valid in score mode) |
exprcut |
Dybamically learned expression correlation (cluster width) threshold (only valid in expression mode) |
workerid |
ID of the worker process that found this motif |
restartnum |
Number of the random restart that found this motif |
dj |
Number of times a similar motif was found |
msproc.pl | Process MotifSpec output file to get tabular list of motifs |
extract_ms.pl | Extract list of sequences having a motif |
If you use MotifSpec in your work, please reference our paper:
Karnik, R, and Beer, MA. Identification of predictive cis-regulatory
elements using a discriminative objective function and dynamic search
spaces. (submitted)
Please contact Mike Beer at mbeer AT jhu DOT edu with any questions you may have.