Table of Contents

The Dokuwiki Parser

This document explains the details of the dokuwiki parser and is intended for developers who want to modify the parser's behaviour or gain control over the output document, perhaps modifying the generated HTML or implementing different output formats.

Overview

The parser breaks down the process of transforming a raw dokuwiki document to the final output document (normally XHTML) into discrete stages. Each stage is represented by one or more PHP classes.

Broadly these elements are;

  1. Lexer1): scans 2) from a DokuWiki document and outputs a sequence of “tokens” 3), corresponding to the syntax in the document
  2. Handler4): receives the tokens from the Lexer and transforms them into a sequence of “instructions” 5). It describes how the document should be rendered, from start to finish, without the Renderer needing to keep track of state.
  3. Parser6): “connects up” the Lexer with the Handler, providing the dokuwiki syntax rules as well as the point of access to the system (the Parser::parse() method)
  4. Renderer7): accepts the instructions from the Handler and “draws” the document ready for viewing (e.g. as XHTML)

No mechanism is provided for connecting up the Handler with the Renderer - this needs coding per specific use case.

A rough diagram of the relationships between these compononents;

      +-----------+          +-----------+
      |           |  Input   |  Client   |
      |  Parser   |<---------|   Code    |
      |           |  String  |           |
      +-----.-----+          +-----|-----+
    Modes   |                     /|\
      +     |             Renderer |
    Input   |          Instructions|
    String \|/                     |
      +-----'-----+          +-----------+
      |           |          |           |
      |  Lexer    |--------->|  Handler  |
      |           |  Tokens  |           |
      +-----.-----+          +-----------+
            |
            |
       +----+---+
       | Modes  |-+
       +--------+ |-+
         +--------+ |
           +--------+

The “Client Code” (code using the Parser) invokes the Parser, giving it the input string. It receives, in return, the list of “Renderer Instructions”, built by the Handler. These can then be fed to some object which implements the Renderer.

Note: A critical point behind this design is the intent to allow the Renderer to be as “dumb” as possible. It should not need to make further interpretation / modification of the instructions it is given but be purely concerned with rendering some kind of output (e.g. XHTML) - in particular the Renderer should not need to keep track of state. By keeping to this principle, aside from making Renderers fairly easy to implement (the focus being purely on what to output), it will also make it possible for Renderers to be interchangeable (e.g. output PDF as alternative to XHTML). At the same time, the instructions output from the Handler are geared for rendering XHTML and may not be entirely suited for all output formats.

Lexer

Defined in inc/parser/lexer.php

In the most general sense, it provides a tool for managing complex regular expressions, where state is important. The Lexer comes from Simple Test but contains three modifications (read: hacks);

In short, Simple Test’s lexer acts as a tool to make regular expressions easy to manage - rather than giant regexes you write many small / simple ones. The lexer takes care of combining them efficiently then gives you a SAX-like callback API to allow you to write code to respond to matched “events”.

The Lexer as a whole is made of three main classes;

The need for state

The wiki syntax used in dokuwiki contains markup, “inside” of which only certain syntax rules apply. The most obvious example is the <code/> tag, inside of which no other wiki syntax should be recognized by the Lexer. Other syntax, such as the list or table syntax should allow some markup but not others e.g. you can use links in a list context but not tables.

The Lexer provides “state awareness” allowing it to apply the correct syntax rules depending on it's current position (the context) in the text it's scanning. If it sees an opening <code> tag, it should switch to a different state within which no other syntax rules apply (i.e. anything that would normally look like wiki syntax should be treated as “dumb” text) until it finds the close </code> tag.

Lexer Modes

The term mode is a label for a particular lexing state8). The code using the Lexer registers one or more regex patterns with a particular named mode. Then, as the Lexer matches those patterns against the text it is scanning, it calls functions on the Handler with the same name as the mode (unless the mapHandler method was used to create an alias - see below).

The Lexer API

A short introduction to the Lexer can be found at Simple Test Lexer Notes. This provides more detail.

The key methods in the Lexer are;

Constructor

Accepts an object reference to the Handler, a name of the initial mode that the Lexer should start in and (optionally) a boolean flag as to whether pattern matching should be case sensitive.

Example;

$Handler = & new MyHandler();
$Lexer = & new Doku_Lexer($Handler, 'base', TRUE);

Here the initial mode here is called 'base'.

addEntryPattern / addExitPattern

Used to register a pattern for entering and exiting a particular parsing mode. For example;

// arg0: regex to match - note no need to add start/end pattern delimiters
// arg1: name of mode where this entry pattern may be used
// arg2: name of mode to enter
$Lexer->addEntryPattern('<file>','base','file');
 
// arg0: regex to match
// arg1: name of mode to exit
$Lexer->addExitPattern('</file>','file');

The above would allow the <file/> tag to be used from the base mode to enter a new mode (called file). If further modes should be applied while the Lexer is inside the file mode, these would need to be registered with the file mode.

Note: there's no need to use pattern start and end delimiters.

addPattern

Used to trigger additional “tokens” inside an existing mode (no transitions). It accepts a pattern and the name of a mode it should be used inside.

This is best seen from considering the list syntax in the parser. Lists syntax looks like this in dokuwiki;

Before the list
  * Unordered List Item
  * Unordered List Item
  * Unordered List Item
After the list

Using addPattern it becomes possible to match the complete list at once while still exiting correctly and tokenizing each list item;

// Match the opening list item and change mode
$Lexer->addEntryPattern('\n {2,}[\*]','base','list');
 
// Match new list items but stay in the list mode
$Lexer->addPattern('\n {2,}[\*]','list');
 
// If it's a linefeed that fails to match the above addPattern rule, exit the mode
$Lexer->addExitPattern('\n','list');
addSpecialPattern

Used to enter a new mode just for the match then drop straight back into the “parent” mode. Accepts a pattern, a name of a mode it can be applied inside and the name of the “temporary” mode to enter for the match. Typically this would be used if you want to substitute wiki markup with something else. For example to match a smiley like :-) you might have;

$Lexer->addSpecialPattern(':-)','base','smiley');
mapHandler

Allows a particular named mode to be mapped onto a method with a different name in the Handler. This may be useful when differing syntax should be handled in the same manner, such as the dokuwiki syntax for disabling other syntax inside a particular text block;

$Lexer->addEntryPattern('<nowiki>','base','unformatted');
$Lexer->addEntryPattern('%%','base','unformattedalt');
$Lexer->addExitPattern('</nowiki>','unformatted');
$Lexer->addExitPattern('%%','unformattedalt');
 
// Both syntaxes should be handled the same way...
$Lexer->mapHandler('unformattedalt','unformatted');

Subpatterns Not Allowed

Because the Lexer itself uses subpatterns (inside the ParallelRegex class), code using the Lexer cannot. This may take some getting used to but, generally, the addPattern method can be applied for solving the types problems where subpatterns are typically applied. It has the advantage of keeping regexs simpler and thereby easier to manage.

Note: If you do use parenthesis in your pattern, they will automatically be escaped by the lexer.

Syntax Errors and State

To prevent “badly formed” (in particular a missing closing tag) markup causing the Lexer to enter a state (mode) which it never leaves, it can be useful to use a lookahead pattern to check for the closing markup first9). For example;

// Use lookahead in entry pattern...
$Lexer->addEntryPattern('<file>(?=.*\x3C/file\x3E)','base','file');
$Lexer->addExitPattern('</file>','file');

The entry pattern checks it can find a closing </file> tag before it enters the state.

Note the use of hex characters in the lookahead is required because there's (probably) a bug in the Lexer hack to make lookaheads possible. Needs investigation.

Handler

Defined in inc/parser/handler.php

The Handler is a class providing methods which are called by the Lexer as it matches tokens. It then “fine tunes” the tokens into a sequence of instructions ready for a Renderer.

The Handler as a whole contains the following classes:

Handler Token Methods

The Handler must provide methods named corresponding to the modes registered with the Lexer (bearing in mind the Lexer mapHandler() method - see above).

For example if you registered a file mode with the Lexer like;

$Lexer->addEntryPattern('<file>(?=.*\x3C/file\x3E)','base','file');
$Lexer->addExitPattern('</file>','file');

The Handler will need a method like;

class Doku_Handler {
 
    /**
    * @param string match contains the text that was matched
    * @param int state - the type of match made (see below)
    * @param int pos - byte index where match was made
    */
    function file($match, $state, $pos) {
        return TRUE;
    }
}

Note: a Handler method must return TRUE or the Lexer will halt immediately. This behaviour can be useful when dealing with other types of parsing problem but for the DokuWiki parser, all Handler methods will always return TRUE.

The arguments provided to a handler method are;

As a more complex example, in the Parser the following is defined for matching lists;

    function connectTo($mode) {
        $this->Lexer->addEntryPattern('\n {2,}[\-\*]',$mode,'listblock');
        $this->Lexer->addEntryPattern('\n\t{1,}[\-\*]',$mode,'listblock');
 
        $this->Lexer->addPattern('\n {2,}[\-\*]','listblock');
        $this->Lexer->addPattern('\n\t{1,}[\-\*]','listblock');
 
    }
 
    function postConnect() {
        $this->Lexer->addExitPattern('\n','listblock');
    }

The listblock method in the Handler 10), looks like;

    function listblock($match, $state, $pos) {
 
        switch ( $state ) {
 
            // The start of the list...
            case DOKU_LEXER_ENTER:
                // Create the List rewriter, passing in the current CallWriter
                $ReWriter = & new Doku_Handler_List($this->CallWriter);
 
                // Replace the current CallWriter with the List rewriter
                // all incoming tokens (even if not list tokens)
                // are now diverted to the list
                $this->CallWriter = & $ReWriter;
 
                $this->__addCall('list_open', array($match), $pos);
            break;
 
            // The end of the list
            case DOKU_LEXER_EXIT:
                $this->__addCall('list_close', array(), $pos);
 
                // Tell the List rewriter to clean up
                $this->CallWriter->process();
 
                // Restore the old CallWriter
                $ReWriter = & $this->CallWriter;
                $this->CallWriter = & $ReWriter->CallWriter;
 
            break;
 
            case DOKU_LEXER_MATCHED:
                $this->__addCall('list_item', array($match), $pos);
            break;
 
            case DOKU_LEXER_UNMATCHED:
                $this->__addCall('cdata', array($match), $pos);
            break;
        }
        return TRUE;
    }

Token Conversion

Part of the fine tuning, performed by the handler, involves inserting / renaming or removing tokens provided by the Lexer.

For example, a list like;

This is not a list
  * This is the opening list item
  * This is the second list item
  * This is the last list item
This is also not a list

Would result in sequence of tokens something like;

  1. base: "This is not a list", DOKU_LEXER_UNMATCHED
  2. listblock: "\n *", DOKU_LEXER_ENTER
  3. listblock: " This is the opening list item", DOKU_LEXER_UNMATCHED
  4. listblock: "\n *", DOKU_LEXER_MATCHED
  5. listblock: " This is the second list item", DOKU_LEXER_UNMATCHED
  6. listblock: "\n *", DOKU_LEXER_MATCHED
  7. listblock: " This is the last list item", DOKU_LEXER_UNMATCHED
  8. listblock: "\n", DOKU_LEXER_EXIT
  9. base: "This is also not a list", DOKU_LEXER_UNMATCHED

But to be useful to the Renderer, this has to be converted to the following instructions;

  1. p_open:
  2. cdata: "This is not a list"
  3. p_close:
  4. listu_open:
  5. listitem_open:
  6. cdata: " This is the opening list item"
  7. listitem_open:
  8. listitem_open:
  9. cdata: " This is the second list item"
  10. listitem_open:
  11. listitem_open:
  12. cdata: " This is the last list item"
  13. listitem_open:
  14. list_close:
  15. p_open:
  16. cdata: "This is also not a list"
  17. p_close:

In the case of lists, this requires the help of the Doku_Handler_List class, which has it's own knowledge of state and is captures the incoming tokens, replacing them with the correct instructions for a Renderer.

Parser

The Parser acts as the front end to external code and sets up the Lexer with the patterns and modes describing dokuwiki syntax.

Using the Parser will generally look like;

// Create the parser
$Parser = & new Doku_Parser();
 
// Create the handler and store in the parser
$Parser->Handler = & new Doku_Handler();
 
// Add required syntax modes to parser
$Parser->addMode('footnote',new Doku_Parser_Mode_Footnote());
$Parser->addMode('hr',new Doku_Parser_Mode_HR());
$Parser->addMode('unformatted',new Doku_Parser_Mode_Unformatted());
# etc.
 
$doc = file_get_contents('wikipage.txt.');
$instructions = $Parser->parse($doc);

More detailed examples are below.

As a whole, the Parser also contains classes representing each available syntax mode, the base class for all of these being Doku_Parser_Mode. The behaviour of these modes are best understood by looking at the examples of adding syntax later in this document.

The reason for representing the modes with classes is to avoid repeated calls to the Lexer methods. Without them it would be necessary to hard code each pattern rule for every mode that pattern can be matched in, for example, registering a single pattern rule for the CamelCase link syntax would require something like;

$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','base','camelcaselink');
$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','footnote','camelcaselink');
$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','table','camelcaselink');
$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','listblock','camelcaselink');
$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','strong','camelcaselink');
$Lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b','underline','camelcaselink');
// etc.

Each mode that is allowed to contain CamelCase links must be explicitly named.

Rather than hard coding this, instead it is implemented using a single class like;

class Doku_Parser_Mode_CamelCaseLink extends Doku_Parser_Mode {
 
    function connectTo($mode) {
        $this->Lexer->addSpecialPattern(
                '\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b',$mode,'camelcaselink'
            );
    }
 
}

When setting up the Lexer, the Parser calls the connectTo method on the Doku_Parser_Mode_CamelCaseLink object for every other mode which accepts the CamelCase syntax (some don't such as the <code /> syntax).

At the expense of making the Lexer setup harder to understand, this allows the code to be more flexible when adding new syntax.

Instructions Data Format

The following shows an example of raw wiki text and the corresponding output from the parser;

The raw text (contains a table);

abc
| Row 0 Col 1    | Row 0 Col 2     | Row 0 Col 3        |
| Row 1 Col 1    | Row 1 Col 2     | Row 1 Col 3        |
def

When parsed the following PHP array is returned (described below);

Array
(
    [0] => Array
        (
            [0] => document_start
            [1] => Array
                (
                )

            [2] => 0
        )

    [1] => Array
        (
            [0] => p_open
            [1] => Array
                (
                )

            [2] => 0
        )

    [2] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] => 

abc
                )

            [2] => 0
        )

    [3] => Array
        (
            [0] => p_close
            [1] => Array
                (
                )

            [2] => 5
        )

    [4] => Array
        (
            [0] => table_open
            [1] => Array
                (
                    [0] => 3
                    [1] => 2
                )

            [2] => 5
        )

    [5] => Array
        (
            [0] => tablerow_open
            [1] => Array
                (
                )

            [2] => 5
        )

    [6] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 5
        )

    [7] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 0 Col 1
                )

            [2] => 7
        )

    [8] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>     
                )

            [2] => 19
        )

    [9] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 23
        )

    [10] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 23
        )

    [11] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 0 Col 2
                )

            [2] => 24
        )

    [12] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>      
                )

            [2] => 36
        )

    [13] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 41
        )

    [14] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 41
        )

    [15] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 0 Col 3
                )

            [2] => 42
        )

    [16] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>         
                )

            [2] => 54
        )

    [17] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 62
        )

    [18] => Array
        (
            [0] => tablerow_close
            [1] => Array
                (
                )

            [2] => 63
        )

    [19] => Array
        (
            [0] => tablerow_open
            [1] => Array
                (
                )

            [2] => 63
        )

    [20] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 63
        )

    [21] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 1 Col 1
                )

            [2] => 65
        )

    [22] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>     
                )

            [2] => 77
        )

    [23] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 81
        )

    [24] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 81
        )

    [25] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 1 Col 2
                )

            [2] => 82
        )

    [26] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>      
                )

            [2] => 94
        )

    [27] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 99
        )

    [28] => Array
        (
            [0] => tablecell_open
            [1] => Array
                (
                    [0] => 1
                    [1] => left
                )

            [2] => 99
        )

    [29] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>  Row 1 Col 3
                )

            [2] => 100
        )

    [30] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] =>         
                )

            [2] => 112
        )

    [31] => Array
        (
            [0] => tablecell_close
            [1] => Array
                (
                )

            [2] => 120
        )

    [32] => Array
        (
            [0] => tablerow_close
            [1] => Array
                (
                )

            [2] => 121
        )

    [33] => Array
        (
            [0] => table_close
            [1] => Array
                (
                )

            [2] => 121
        )

    [34] => Array
        (
            [0] => p_open
            [1] => Array
                (
                )

            [2] => 121
        )

    [35] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] => def

                )

            [2] => 122
        )

    [36] => Array
        (
            [0] => p_close
            [1] => Array
                (
                )

            [2] => 122
        )

    [37] => Array
        (
            [0] => document_end
            [1] => Array
                (
                )

            [2] => 122
        )

)

The top level array is simply a list. Each of it's child elements describes a callback function to be executed against the Renderer (see description of the Renderer below) as well as the byte index in the raw input text where that particular “element” of wiki syntax was found.

A Single Instruction

Considering a single child element (which represents a single instruction) from the above list of instructions;

    [35] => Array
        (
            [0] => cdata
            [1] => Array
                (
                    [0] => def

                )

            [2] => 122
        )

The first element (index 0 ) is the name of a method or function in the Renderer to execute.

The second element (index 1) is itself an array, each of it's elements being the arguments for the Renderer method that will be called.

In this case there is a single argument with the value "def\n", so the method call would be like;

$Render->cdata("def\n");

The third element (index 2) is the byte index of the first character that “triggered” this instruction in the raw text document. It should be the same as the value returned by PHP's strpos function. This can be used to retrieve sections of the raw wiki text, based on the positions of the instructions generated from it (example later).

Note: The Parser's parse method pads the raw wiki text with a preceeding and proceeding linefeed character, to make sure particular Lexer states exit correctly, so you may need to substact 1 from the byte index to get the correct location in the original raw wiki text. The Parser also normalizes linefeeds to Unix style (i.e. all \r\n becomes \n) so the document the Lexer sees may be smaller than the one you actually fed it.

An example of the instruction array of the syntax page can be found here

Renderer

The Renderer is a class (or a collection of functions can be used) which you define. The interface is defined in inc/parser/renderer.php and looks like;

<?php
class Doku_Renderer {
 
    // snip
 
    function header($text, $level) {}
 
    function section_open($level) {}
 
    function section_close() {}
 
    function cdata($text) {}
 
    function p_open() {}
 
    function p_close() {}
 
    function linebreak() {}
 
    function hr() {}
 
    // snip
}

It is used to document the Renderer although it could be also be extended if you wanted to write a Renderer which only captures certain calls.

The basic principle for how the instructions, returned from the parser, are used against a Renderer is similar to the notion of a SAX XML API - the instructions are a list of function / method names and their arguments. Looping through the list of instructions, each instruction can be called against the Renderer (i.e. the methods provided by the Renderer are callbacks). Unlike the SAX API, where only a few, fairly general, callbacks are available (e.g. tag_start, tag_end, cdata etc.), the Renderer defines a more explicit API, where the methods typically correspond one-to-one with the act of generating the output. In the section of the Renderer shown above, the p_open and p_close methods would be used to output the tags <p> and </p> in XHTML, respectively, while the header function takes two arguments - some text to display and the “level” of the header so a call like header('Some Title',1) would be output in XHTML like <h1>Some Title</h1>.

Invoking the Renderer with Instructions

It is left up to the client code using the Parser to execute the list of instructions against a Renderer. Typically this will be done using PHP's call_user_func_array function. For example;

// Get a list of instructions from the parser
$instructions = $Parser->parse($rawDoc);
 
// Create a renderer
$Renderer = & new Doku_Renderer_XHTML();
 
// Loop through the instructions
foreach ( $instructions as $instruction ) {
 
    // Execute the callback against the Renderer
    call_user_func_array(array(&$Renderer, $instruction[0]),$instruction[1]);
}

Renderer Link Methods

The key Renderer methods for handling the different kinds of link are;

Special attention is required for methods which take the $title argument, which represents the visible text of the link, for example;

<a href="http://www.example.com">This is the title</a>

The $title argument can have three possible types of value;

  1. NULL: no title was provided in the wiki document.
  2. string: a plain text string was used as the title
  3. array (hash): an image was used as the title.

If the $title is an array, it will containing associative values describing the image;

$title = array(
    // Could be 'internalmedia' (local image) or 'externalmedia' (offsite image)
    'type'=>'internalmedia',
 
    // The URL to the image (may be a wiki URL or http://static.example.com/img.png)
    'src'=>'wiki:php-powered.png',
 
    // For the alt attribute - a string or NULL
    'title'=>'Powered by PHP',
 
    // 'left', 'right', 'center' or NULL
    'align'=>'right',
 
    // Width in pixels or NULL
    'width'=> 50,
 
    // Height in pixels or NULL
    'height'=>75,
 
    // Whether to cache the image (for external images)
    'cache'=>FALSE,
);

Examples

The following examples show common tasks that would likely be performed with the parser, as well as raising performance considerations and notes on extending syntax.

Basic Invokation

To invoke the parser will all current modes, and parse the Dokuwiki syntax document;

require_once DOKU_INC . 'parser/parser.php';
 
// Create the parser
$Parser = & new Doku_Parser();
 
// Add the Handler
$Parser->Handler = & new Doku_Handler();
 
// Load all the modes
$Parser->addMode('listblock',new Doku_Parser_Mode_ListBlock());
$Parser->addMode('preformatted',new Doku_Parser_Mode_Preformatted()); 
$Parser->addMode('notoc',new Doku_Parser_Mode_NoToc());
$Parser->addMode('header',new Doku_Parser_Mode_Header());
$Parser->addMode('table',new Doku_Parser_Mode_Table());
 
$formats = array (
    'strong', 'emphasis', 'underline', 'monospace',
    'subscript', 'superscript', 'deleted',
);
foreach ( $formats as $format ) {
    $Parser->addMode($format,new Doku_Parser_Mode_Formatting($format));
}
 
$Parser->addMode('linebreak',new Doku_Parser_Mode_Linebreak());
$Parser->addMode('footnote',new Doku_Parser_Mode_Footnote());
$Parser->addMode('hr',new Doku_Parser_Mode_HR());
 
$Parser->addMode('unformatted',new Doku_Parser_Mode_Unformatted());
$Parser->addMode('php',new Doku_Parser_Mode_PHP());
$Parser->addMode('html',new Doku_Parser_Mode_HTML());
$Parser->addMode('code',new Doku_Parser_Mode_Code());
$Parser->addMode('file',new Doku_Parser_Mode_File());
$Parser->addMode('quote',new Doku_Parser_Mode_Quote());
 
// These need data files. The get* functions are left to your imagination
$Parser->addMode('acronym',new Doku_Parser_Mode_Acronym(array_keys(getAcronyms())));
$Parser->addMode('wordblock',new Doku_Parser_Mode_Wordblock(array_keys(getBadWords())));
$Parser->addMode('smiley',new Doku_Parser_Mode_Smiley(array_keys(getSmileys())));
$Parser->addMode('entity',new Doku_Parser_Mode_Entity(array_keys(getEntities())));
 
$Parser->addMode('multiplyentity',new Doku_Parser_Mode_MultiplyEntity());
$Parser->addMode('quotes',new Doku_Parser_Mode_Quotes());
 
$Parser->addMode('camelcaselink',new Doku_Parser_Mode_CamelCaseLink());
$Parser->addMode('internallink',new Doku_Parser_Mode_InternalLink());
$Parser->addMode('media',new Doku_Parser_Mode_Media());
$Parser->addMode('externallink',new Doku_Parser_Mode_ExternalLink());
$Parser->addMode('email',new Doku_Parser_Mode_Email());
$Parser->addMode('windowssharelink',new Doku_Parser_Mode_WindowsShareLink());
$Parser->addMode('filelink',new Doku_Parser_Mode_FileLink());
$Parser->addMode('eol',new Doku_Parser_Mode_Eol());
 
// Loads the raw wiki document
$doc = file_get_contents(DOKU_DATA . 'wiki/syntax.txt');
 
// Get a list of instructions
$instructions = $Parser->parse($doc);
 
// Create a renderer
require_once DOKU_INC . 'parser/xhtml.php';
$Renderer = & new Doku_Renderer_XHTML();
 
# Load data like smileys into the Renderer here
 
// Loop through the instructions
foreach ( $instructions as $instruction ) {
 
    // Execute the callback against the Renderer
    call_user_func_array(array(&$Renderer, $instruction[0]),$instruction[1]);
}
 
// Display the output
echo $Renderer->doc;

Selecting Text (for sections)

The following shows how to select a range of text from the raw document using instructions from the parser;

// Create the parser
$Parser = & new Doku_Parser();
 
// Add the Handler
$Parser->Handler = & new Doku_Handler();
 
// Load the header mode to find headers
$Parser->addMode('header',new Doku_Parser_Mode_Header());
 
// Load the modes which could contain markup that might be
// mistaken for a header
$Parser->addMode('listblock',new Doku_Parser_Mode_ListBlock());
$Parser->addMode('preformatted',new Doku_Parser_Mode_Preformatted()); 
$Parser->addMode('table',new Doku_Parser_Mode_Table());
$Parser->addMode('unformatted',new Doku_Parser_Mode_Unformatted());
$Parser->addMode('php',new Doku_Parser_Mode_PHP());
$Parser->addMode('html',new Doku_Parser_Mode_HTML());
$Parser->addMode('code',new Doku_Parser_Mode_Code());
$Parser->addMode('file',new Doku_Parser_Mode_File());
$Parser->addMode('quote',new Doku_Parser_Mode_Quote());
$Parser->addMode('footnote',new Doku_Parser_Mode_Footnote());
$Parser->addMode('internallink',new Doku_Parser_Mode_InternalLink());
$Parser->addMode('media',new Doku_Parser_Mode_Media());
$Parser->addMode('externallink',new Doku_Parser_Mode_ExternalLink());
$Parser->addMode('email',new Doku_Parser_Mode_Email());
$Parser->addMode('windowssharelink',new Doku_Parser_Mode_WindowsShareLink());
$Parser->addMode('filelink',new Doku_Parser_Mode_FileLink());
 
// Loads the raw wiki document
$doc = file_get_contents(DOKU_DATA . 'wiki/syntax.txt');
 
// Get a list of instructions
$instructions = $Parser->parse($doc);
 
// Use this to watch when we're inside the section we want
$inSection = FALSE;
$startPos = 0;
$endPos = 0;
 
// Loop through the instructions
foreach ( $instructions as $instruction ) {
 
    if ( !$inSection ) {
 
        // Look for the header for the "Lists" heading
        if ( $instruction[0] == 'header' &&
                trim($instruction[1][0]) == 'Lists' ) {
 
            $startPos = $instruction[2];
            $inSection = TRUE;
        }
    } else {
 
        // Look for the end of the section
        if ( $instruction[0] == 'section_close' ) {
            $endPos = $instruction[2];
            break;
        }
    }
}
 
// Normalize and pad the document in the same way the parse does
// so that byte indexes with match
$doc = "\n".str_replace("\r\n","\n",$doc)."\n";
 
// Get the text before the section we want
$before = substr($doc, 0, $startPos);
$section = substr($doc, $startPos, ($endPos-$startPos));
$after = substr($doc, $endPos);

Managing Data File Input for Patterns

Dokuwiki stores parts of some patterns in external data files (e.g. the smileys). Because the parsing and output of the document are now seperate stages, handled by different components, a different approach is required for using this data, compared to earlier parser versions.

For the relevant modes, each accepts a plain list of elements which it builds into a list of patterns for registering with the Lexer.

For example;

// A plain list of smiley tokens...
$smileys = array(
    ':-)',
    ':-(',
    ';-)',
    // etc.
    );
 
// Create the mode
$SmileyMode = & new Doku_Parser_Mode_Smiley($smileys);
 
// Add it to the parser
$Parser->addMode($SmileyMode);

The parser is not interested in the output format for the smileys.

The other modes this applies to are defined by the classes;

Each accepts a list of “interesting strings” to it's constructor, in the same way as the smileys.

In practice it is probably worth defining functions for retrieval of the data from the configuration files and storing the associative arrays in a static value e.g.;

function getSmileys() {
 
    static $smileys = NULL;
 
    if ( !$smileys ) {
 
        $smileys = array();
 
        $lines = file( DOKU_CONF . 'smileys.conf');
 
        foreach($lines as $line){
 
            //ignore comments
            $line = preg_replace('/#.*$/','',$line);
 
            $line = trim($line);
 
            if(empty($line)) continue;
 
            $smiley = preg_split('/\s+/',$line,2);
 
            // Build the associative array
            $smileys[$smiley[0]] = $smiley[1];
        }
    }
 
    return $smileys;
}

This function can now be used like;

// Load the smiley patterns into the mode
$SmileyMode = & new Doku_Parser_Mode_Smiley(array_keys(getSmileys()));
// Load the associate array in a renderer for lookup on output
$Renderer->smileys = getSmileys();

Note: Checking for links which should be blocked is handled in a seperate manner, as described below.

Testing Links for Spam

Ideally we want to be able to check for links to spam before storing a document (after editing).

This example should be viewed with caution. It makes a useful point of reference but having actually tested it since, it's very slow - probably easier just to use a simple function that is “syntax blind” but searches the entire document for links which match the blacklist. Meanwhile this example could be useful as a basis for building a 'wiki map' or finding 'wanted pages' by examining internal links. Probably best run as a cron job

This could be done by building a special Renderer that examines only the link-related callbacks and checks the URL against a blacklist.

A function is needed to load the spam.conf and bundle it into a single regex;

Recently tested this approach (single regex) against the latest blacklist from http://blacklist.chongqed.org/ and got errors about the final regex being too big. This should probably split the regex into smaller pieces and return them as an array
function getSpamPattern() {
    static $spamPattern = NULL;
 
    if ( is_null($spamPattern) ) {
 
        $lines = @file(DOKU_CONF . 'spam.conf');
 
        if ( !$lines ) {
 
            $spamPattern = '';
 
        } else {
 
            $spamPattern = '#';
            $sep = '';
 
            foreach($lines as $line){
 
                // Strip comments
                $line = preg_replace('/#.*$/','',$line);
 
                // Ignore blank lines
                $line = trim($line);
                if(empty($line)) continue;
 
                $spamPattern.= $sep.$line;
 
                $sep = '|';
            }
 
            $spamPattern .= '#si';
        }
    }
 
    return $spamPattern;
}

Now we need to extend the base Renderer with one that will examine links only;

require_once DOKU_INC . 'parser/renderer.php';
 
class Doku_Renderer_SpamCheck extends Doku_Renderer {
 
    // This should be populated by the code executing the instructions
    var $currentCall;
 
    // An array of instructions that contain spam
    var $spamFound = array();
 
    // pcre pattern for finding spam
    var $spamPattern = '#^$#';
 
    function internallink($link, $title = NULL) {
        $this->__checkTitle($title);
    }
 
    function externallink($link, $title = NULL) {
        $this->__checkLinkForSpam($link);
        $this->__checkTitle($title);
    }
 
    function interwikilink($link, $title = NULL) {
        $this->__checkTitle($title);
    }
 
    function filelink($link, $title = NULL) {
        $this->__checkLinkForSpam($link);
        $this->__checkTitle($title);
    }
 
    function windowssharelink($link, $title = NULL) {
        $this->__checkLinkForSpam($link);
        $this->__checkTitle($title);
    }
 
    function email($address, $title = NULL) {
        $this->__checkLinkForSpam($address);
        $this->__checkTitle($title);
    }
 
    function internalmedialink ($src) {
        $this->__checkLinkForSpam($src);
    }
 
    function externalmedialink($src) {
        $this->__checkLinkForSpam($src);
    }
 
    function __checkTitle($title) {
        if ( is_array($title) && isset($title['src'])) {
            $this->__checkLinkForSpam($title['src']);
        }
    }
 
    // Pattern matching happens here
    function __checkLinkForSpam($link) {
        if( preg_match($this->spamPattern,$link) ) {
            $spam = $this->currentCall;
            $spam[3] = $link;
            $this->spamFound[] = $spam;
        }
    }
}

Note the line $spam[3] = $link; in the __checkLinkForSpam method. This adds an additional element to the list of spam instructions found, making it easy to determine what the bad URLs were (e.g. for logging).

Finally we can use this spam checking renderer like;

// Create the parser
$Parser = & new Doku_Parser();
 
// Add the Handler
$Parser->Handler = & new Doku_Handler();
 
// Load the modes which could contain markup that might be
// mistaken for a link
$Parser->addMode('preformatted',new Doku_Parser_Mode_Preformatted()); 
$Parser->addMode('unformatted',new Doku_Parser_Mode_Unformatted());
$Parser->addMode('php',new Doku_Parser_Mode_PHP());
$Parser->addMode('html',new Doku_Parser_Mode_HTML());
$Parser->addMode('code',new Doku_Parser_Mode_Code());
$Parser->addMode('file',new Doku_Parser_Mode_File());
$Parser->addMode('quote',new Doku_Parser_Mode_Quote());
 
// Load the link modes...
$Parser->addMode('internallink',new Doku_Parser_Mode_InternalLink());
$Parser->addMode('media',new Doku_Parser_Mode_Media());
$Parser->addMode('externallink',new Doku_Parser_Mode_ExternalLink());
$Parser->addMode('email',new Doku_Parser_Mode_Email());
$Parser->addMode('windowssharelink',new Doku_Parser_Mode_WindowsShareLink());
$Parser->addMode('filelink',new Doku_Parser_Mode_FileLink());
 
// Loads the raw wiki document
$doc = file_get_contents(DOKU_DATA . 'wiki/spam.txt');
 
// Get a list of instructions
$instructions = $Parser->parse($doc);
 
// Create a renderer
require_once DOKU_INC . 'parser/spamcheck.php';
$Renderer = & new Doku_Renderer_SpamCheck();
 
// Load the spam regex
$Renderer->spamPattern = getSpamPattern();
 
// Loop through the instructions
foreach ( $instructions as $instruction ) {
 
    // Store the current instruction
    $Renderer->currentCall = $instruction;
 
    call_user_func_array(array(&$Renderer, $instruction[0]),$instruction[1