Lempel-Ziv complexity (LZc) is a measure associated with sequences of symbols, and is related (but not identical) to Kolmogorov complexity. Note that it does not necessarily measure "complexity" in the colloquial sense; rather, it measures roughly how simple a sequence is to describe, or how "compressible" a sequence is (indeed, Lempel-Ziv algorithms are widely used for compressing data). The more "random" a sequence, the higher it tends to score.
The fLZc (fast Lempel-Ziv complexity) library is written in C, with a Matlab interface. LZc comes in many flavours; fLZc computes the canonical LZ76c 1 and LZ78c 2 versions, with optional normalisation.
You will need the Make build tool installed on your system, a suitable C99 compiler (e.g., GCC or Clang), and the Matlab mex
and makemex
executables on your system execution path. You will also need to set up an environmental variable MATLAB_PATH
which contains the root path of your Matlab installation. 3
To build the required binary code, in a terminal navigate to the <fLZc_root>/C
directory and run
make clean && make
This builds some C standalone executables (for testing and benchmarking) in the <fLZc_root>/C/bin
directory, and the Matlab .mex C interface binaries in the <fLZc_root>/matlab/mex
directory.
For availability of LZc normalisation, download the file fLZc_data.zip, and unzip into the <fLZc_root>/matlab/data
directory.
Start Matlab in the <fLZc_root>/matlab
directory (or make sure at least that <fLZc_root>/matlab/startup.m
is run), and run
clear; LZc_demo
LZ76c is implemented as a slightly amended version of the algorithm given in 4. LZ78c is implemented using the excellent hash-map khashl as a dictionary. Alphabet size is arbitrary, although normalisation (by mean LZc of random sequences) is only currently available up to alphabets of <= 10 symbols. There is also a Matlab utlity for "quantisation" (discretisation by quantiles) of continuous-valued data; in particular, by default binarisation is around the median rather than the mean (this arguably makes more sense anyway). Versions of both LZ76c and LZ78c are available that optionally return the actual dictionary, and/or "running" LZc values along the sequence length of the supplied symbol sequence.
For long sequences, LZ78c is considerably faster; in the worst-case scenario it scales in approximately linear time with sequence length, although the memory overhead may be high for very long sequences. LZ76c scales in approximately quadratic time in the worst-case scenario, but has a negligible memory footprint.
LZ76c is arguably somewhat closer to complexity in the Kolmogorov sense; LZ78c, by contrast, may assign surprisingly large complexity values to sequences containing long subsequences of repeated symbols.
Lionel Barnett, Department of Informatics, University of Sussex, UK.
email: lionelb@sussex.ac.uk
Footnotes
-
A. Lempel and J. Ziv, On the Complexity of Finite Sequences, IEEE Trans. Info Theor. 22(1), 1976. ↩
-
A. Lempel and J. Ziv, Compression of Individual Sequences via Variable-Rate Coding, IEEE Trans. Info Theor. 24(5), 1978. ↩
-
Building targets POSIX environments, and has been tested on Linux and MacOS. Not so sure about Windows (note that Visual Studio for Windows does not support C99; you are probably better off with MinGW). ↩
-
F. Kaspar and H. G. Schuster, Easily calculable measure for the complexity of spatiotemporal patterns, Phys. Rev. A 36(2), 1987 ↩