Population

In addition to generating trees from a specific start rule, Grammarinator also provides support for two evolutionary operators: mutate() and recombine(). These operators require Grammarinator to maintain a set of trees, known as the population. The population can be created by either processing existing sources or by generating trees from scratch.

Grammarinator supports multiple tree serialization formats to represent population members. These formats determine how the population is stored on disk and consumed by various generation or fuzzing tools.

Supported Tree Formats

  1. FlatBuffer-encoded trees (.grtf):

    • Recommended format for both Python and C++ workflows.

    • Compact and fast to read/write.

    • Cross-language compatible (e.g., usable from Python, C++, etc. with FlatBuffer bindings).

    • Supported by both Python and C++ components.

    • Default format when tree codec is not explicitly selected.

  2. JSON-encoded trees (.grtj):

    • Portable and human-readable format.

    • Slower to process than FlatBuffer.

    • Useful for debugging or language-agnostic inspection.

    • Supported by both Python and C++ components.

  3. Pickle-encoded trees (.grtp):

    • Python-specific format based on the pickle module.

    • Not portable across languages or even Python versions.

    • Only usable with grammarinator-generate.

    • Retained primarily for backward compatibility and prototyping.

When creating or using a population, the appropriate format must be specified consistently across tools using the --tree-format flag.

Population Creation From Existing Sources

The grammarinator-parse utility provides support for creating an initial set of trees from real tests or any input that is not necessarily generated by a fuzzer. This allows the incorporation of real-world scenarios or specific test cases into the population and apply evolutionary algorithms to generate variations and explore different test cases.

The CLI of grammarinator-parse
usage: python -m grammarinator.parse [-h] [--glob PATTERN [PATTERN ...]]
                                     -g FILE [FILE ...] [-r NAME] [-t NAME]
                                     [--hidden NAME] [--max-depth MAX_DEPTH]
                                     [--strict] [-o DIR] [--parser-dir DIR]
                                     [--lib DIR] [--tree-format NAME]
                                     [--encoding NAME]
                                     [--encoding-errors NAME]
                                     [--disable-cleanup] [-j NUM]
                                     [--antlr FILE] [--sys-path DIR]
                                     [--sys-recursion-limit NUM]
                                     [--log-level LEVEL] [-v] [-q] [--version]
                                     FILE [FILE ...]

Grammarinator: Parser

positional arguments:
  FILE                  input files or directories to process.

options:
  -h, --help            show this help message and exit
  --glob PATTERN [PATTERN ...]
                        wildcard patterns for input files to process
                        (supported wildcards: ?, *, **, [])
  -g, --grammar FILE [FILE ...]
                        ANTLR grammar files describing the expected format of
                        input to parse.
  -r, --rule NAME       name of the rule to start parsing with (default: first
                        parser rule).
  -t, --transformer NAME
                        reference to a transformer (in package.module.function
                        format) to postprocess the parsed tree.
  --hidden NAME         list of hidden tokens to be built into the parsed
                        tree.
  --max-depth MAX_DEPTH
                        maximum expected tree depth (deeper tests will be
                        discarded (default: inf)).
  --strict              discard tests that contain syntax errors.
  -o, --out DIR         directory to save the trees (default: /home/docs/check
                        outs/readthedocs.org/user_builds/grammarinator/checkou
                        ts/stable/docs).
  --parser-dir DIR      directory to save the parser grammars (default:
                        <OUTDIR>/grammars).
  --lib DIR             alternative location of import grammars.
  --tree-format NAME    format of the saved trees (choices: flatbuffers, json,
                        pickle, default: pickle)
  --encoding NAME       input file encoding (default: utf-8).
  --encoding-errors NAME
                        encoding error handling scheme (default: strict).
  --disable-cleanup     disable the removal of intermediate files.
  -j, --jobs NUM        parallelization level (default: number of cpu cores
                        (2)).
  --antlr FILE          path of the ANTLR v4 tool jar file (default:
                        /home/docs/.antlerinator/antlr-4.13.2-complete.jar)
  --sys-path DIR        add directory to the search path for Python modules
                        (may be specified multiple times)
  --sys-recursion-limit NUM
                        override maximum depth of the Python interpreter stack
                        (default: 1000)
  --log-level LEVEL     verbosity level of diagnostic messages (TRACE, DEBUG,
                        INFO, WARNING, ERROR, CRITICAL, DISABLE; default:
                        INFO)
  -v, --verbose         verbose mode (alias for --log-level DEBUG)
  -q, --quiet           quiet mode (alias for --log-level DISABLE)
  --version             show program's version number and exit

The tool parses files with ANTLR v4 grammars, builds Grammarinator- compatible
tree representations from them and saves them for further reuse.

The usage of the grammarinator-parse utility is generally straightforward. It takes a set of inputs and processes them with the specified grammars (-g). Inputs can be listed as files or directories (FILE), or specified with file patterns (using --glob). The listed directories are traversed recursively. The start rule, which determines the root of every tree in the population, can be defined using the --rule argument. The --tree-format option controls the serialization format of the output trees. If omitted, the default is flatbuffer (producing .grtf files). After the parsing is completed and the tree is created, various transformers (--transformer) can be applied to modify the tree before saving it to the file system using the --out option.

There are two settings that may require further explanation:

  • --hidden: When using ANTLR to tokenize an input, tokens are sorted into various channels. The hidden channel typically contains tokens that are not important for the parser and are not explicitly listed at every allowed position in the grammar for better readability. Examples of such tokens could be whitespaces or comments. However, when working with parse trees, including when generating tests, these “hidden” tokens may become important. To ensure that hidden tokens are added to the tree, the names of the corresponding rules need to be listed using the --hidden argument.

  • --max-depth: Controlling the depth of the generated tree, and therefore the size of the serialized test, is important for both generation and execution performance. This argument allows to set the maximum depth of the tree. Any inputs that exceeds this depth limit will be discarded. The grammarinator-generate utility also has a corresponding setting to guide the generator and avoid generating excessively deep trees.

Convert Population Trees to Test Sources

The grammarinator-decode utility supports decoding the tree elements of a population - whether encoded using pickle, JSON, or FlatBuffers - into test sources serialized according to the chosen method.

The CLI of grammarinator-decode
usage: python -m grammarinator.decode [-h] [--glob PATTERN [PATTERN ...]]
                                      [--ext EXT] [-s NAME] [-o DIR]
                                      [--stdout] [--tree-format NAME]
                                      [--encoding NAME]
                                      [--encoding-errors NAME] [-j NUM]
                                      [--sys-path DIR]
                                      [--sys-recursion-limit NUM]
                                      [--log-level LEVEL] [-v] [-q]
                                      [--version]
                                      FILE [FILE ...]

Grammarinator: Decode

positional arguments:
  FILE                  input files or directories to process

options:
  -h, --help            show this help message and exit
  --glob PATTERN [PATTERN ...]
                        wildcard patterns for input files to process
                        (supported wildcards: ?, *, **, [])
  --ext EXT             extension to use when saving decoded trees (default:
                        .txt).
  -s, --serializer NAME
                        reference to a seralizer (in package.module.function
                        format) that takes a tree and produces a string from
                        it.
  -o, --out DIR         directory to save the test cases (default: /home/docs/
                        checkouts/readthedocs.org/user_builds/grammarinator/ch
                        eckouts/stable/docs).
  --stdout              print test cases to stdout (alias for --out='').
  --tree-format NAME    format of the saved trees (choices: flatbuffers, json,
                        pickle, default: pickle)
  --encoding NAME       output file encoding (default: utf-8).
  --encoding-errors NAME
                        encoding error handling scheme (default: strict).
  -j, --jobs NUM        parallelization level (default: number of cpu cores
                        (2)).
  --sys-path DIR        add directory to the search path for Python modules
                        (may be specified multiple times)
  --sys-recursion-limit NUM
                        override maximum depth of the Python interpreter stack
                        (default: 1000)
  --log-level LEVEL     verbosity level of diagnostic messages (TRACE, DEBUG,
                        INFO, WARNING, ERROR, CRITICAL, DISABLE; default:
                        INFO)
  -v, --verbose         verbose mode (alias for --log-level DEBUG)
  -q, --quiet           quiet mode (alias for --log-level DISABLE)
  --version             show program's version number and exit

The tool decodes tree files and serializes them to test cases.

grammarinator-decode processes a set of tree inputs and creates a test representation from them. Inputs can be listed as files or directories (FILE), or specified with file patterns (using --glob). The listed directories are traversed recursively. First, the files are converted to trees using the appropriate tree codec specified by --tree-format. The resulting trees are then serialized using the function defined by --serializer (or str by default). The serialized tests are saved into the --out directory with the --ext extension and encoded with --encoding.

The decoder functionality can be created not only in Python, but also in C++ using serializers written in C++. For this, the --decode argument has to be provided to the build script. When converting an output corpus generated by either the libFuzzer integration or the AFL++ integration, it is recommended to use these C++ decoders. When built with the same configuration, they will reproduce exactly the same test cases that were observed during fuzzing.