About SPatt

What is SPatt ?

SPatt (Statistic for Patterns) is a suite of C++ programs designed for the computation of pattern occurrences p-value on text. Assuming the text is generated according to Markov model, the p-value of a given observation is its probability to occur. The lower is the p-value, the more unlikely is the observation. For example, this tools can be used to find patterns with unusual behaviour in DNA or proteins sequences.

Typical usage

The DNA motif/pattern GCTGG|CCAGC (means occurs GCTGG or CCAGC) 210 times in the complete genome of the bacteriophage lambda (phage_lambda.fasta, 48 kb). How significant is this observation assuming that the DNA sequence is random with independent and identically uniformely distributed letters (freq(A)=freq(C)=freq(G)=freq(T)=0.25) ?

Here is the command to run ("-S" for the provided sequence, "-p" for the pattern, for the alphabet descriptor, "-m" for the Markov model order, "-1" means independent and identically uniformely distributed, "--over" to compute over-representation p-value):

spatt -S phage_lambda.fasta -p "GCTGG|CCAGC" -a "ACGT" -m -1 --over

and here is its (truncated) result:

pattern=GCTGG|CCAGC	Nobs=210	P(N>=Nobs)=2.090885e-24

This result indicates that the observation of "at least 210 occurrences of GCTGG|CCAGC in a random DNA sequence of 48 kb" is highly significant with a p-value of 2.1e-24.


Computing the distribution of pattern in random sequences is a challenging and computationally intensive task for which it exists many concurrent approaches. The goal of SPatt is to implement make available the most relevant ones in a single easy-to-use package. Here is a list of the current features implemented in SPatt:

  • arbitrary alphabet (DNA, protein, binary, others);
  • automatic detection of case sensitive alphabets;
  • regex-like syntax allowing for complex patterns;
  • homogeneous Markov model of abitrary order;
  • exact computations for a single sequence or a set of sequences;
  • Gaussian approximations;
  • overlapping or renewal counting;
  • presence/absence counting when dealing with datasets with several sequences;
  • efficient implementation using optimal Markov chain embedding through deterministic finite automata;
  • output of a scilab source code of the Markov chain embedding parameter (mostly for educational purpose);
  • optional output of dot (graphviz package) files for representing automata.

More details

Last edited 01/13/2012