fimo - a motif search tool

Usage:

fimo [options] <motifs> <database>

Description:

The name fimo stands for "find individual motif occurences." The program searches a sequence database for occurrences of known motifs, treating each motif independently. The program uses a dynamic programming algorithm to convert log-odds scores into p-values, assuming a zero-order background model. The p-values for each motif are then converted to q-values following the method of Benjamini and Hochberg (where "q-value" is defined as the minimal false discovery rate at which a given motif occurrence is deemed significant). The program reports all motif occurrences that receive q-values smaller than a specified threshold. If a given motif has the strand feature set to +/- (rather than +), then fimo will search both strands for occurrences.

The most accurate estimation of q-values requires FIMO to retain the p-values for all matches to a motif in memory. This is not feasible for very large sequence databases. The parameter --max-stored-scores sets the maximum number of matches that will be retained for a motif. It defaults to 100,000. If the number of matches reaches the maximum value allowed, FIMO will discard 50% of the least significant matches, and new matches falling below the significance level of the retained matches will also be discarded. If FIMO has to discard matches it will not be able to use boostraping on the complete set of p-values to estimate the parameter pi₀. In this case FIMO will calculate q-values using pi₀ = 1.0;

Input:

<motifs> is a list of motifs, in MEME format.
<database> is a collection of sequences in FASTA format.

Output:

FIMO will create a directory, named fimo_out by default. Any existing output files in the directory will be overwritten. The directory will contain:

An XML file named fimo.xml using the CisML schema.
An HTML file named fimo.html
A plain text file named fimo.text
A plain text file in GFF format named fimo.gff

The default output directory can be overridden using the --o or --oc options which are described below.

The --text option will limit output to plain text sent to the standard output.

The HTML and plain text output contain the following columns:

The motif identifier
The sequence identiifer
The start position of the motif occurence
The end position of the motif occurence. If the start position is larger then the end position, the motif occurrence is on the reverse strand.
The score for the motif occurence. The score is computed by by summing the appropriate entries from each column of the position-dependent scoring matrix that represents the motif.
The p-value of the motif occurence. The p-value is the probability of a random sequence of the same length as the motif matching that position of the sequence with a score at least as good.
The q-vlavlue of the motif occurence. The q-value is the estimated false discovery rate if the occurrence is accepted as significant. See Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA (2003) 100:9440–9445
The sequence matched to the motif.

The HTML and plain text output is sorted by increasing p-value.

Options:

--bgfile <bfile> - Read background frequencies from <bfile>. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.
--max-seq-length <max> - Set the maximum length allowed for input sequences. By default the maximum allowed length is 250000000.
--max-stored-scores <max> - Set the maximum number of scores that will be stored. Precise calculation of q-values depends on having a complete list of scores. However, keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped, and approximate q-values will be calculated. By default the maximum number of stored matches is 100,000.
--motif <id> - Use only the motif identified by <id>. This option may be repeated.
--motif-pseudo <float> - A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency (default=0.1).
--norc - Do not score the reverse complement DNA strand. Both strands are scored by default.
--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--output-pthresh <float> - The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. Using the --output-pthresh option will set the q-value threshold to 1.0. The default p-value threshold is 1e-4.
--output-qthresh <float> - The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. Using the --output-qthresh option will set the p-value threshold to 1.0. The default q-value threshold is 1.0.
--no-qvalue - Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). By default, q-values are computed.
--text Limits output to plain text sent to standard out. For FIMO, the text output is unsorted, and q-values are not reported. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.