Thoughts on Lexers

August 4, 2024

Lexers are some of lowest level parts of toolchains. Lexers take text or raw bytes as input and divide the input into tokens which have a kind.

For something like:

1 + 2

A lexer might give back the following tokens:

("1", Number)
(" ", Space)
("+", PlusSign)
(" ", Space)
("2", Number)

The first element in the tuple is a slice of the original text. The second element is the kind of token.

Honestly, they are not terribly complex and are fairly straightforward, if not generally boring, pieces of code.

Still, I have some opinions on the design of lexers.

Design Considerations

No Context

Lexers should not need to keep any kind of contextual state. The input should be all that is required. If state is required, the code may be more like a parser than a lexer.

Repeated calls given the same inputs must return the same result. Tokenization should be purely functional and idempotent.

Bad Text/Byte Input Should Still Be a Token

Lexers should be able to handle "bad" input and identify it as such. It should not panic due to bad text/byte input.

The bad input should be identified as a token with an "error" Kind.

Smallest Terminals / Syntax

The smallest possible kinds of tokens should be returned with no combination or complex token kinds.

Suppose a programming language has >> as a bitshift operator while also using > as 1) the greater than operator and 2) the end of a list of generics/template arguments.

The lexer could have 2 token kinds like:

GreaterThan (>)
ShiftRight (>>)

However, having only a single token kind should be preferred:

GreaterThan (>)

In text like:

Vec<Vec<i32>>

When a parser is reading the lexer's tokens, the parser is probably looking for a single closing > token for Vec<i32. If >> was a token kind, the parser would have to check for >> in addition to >. The parser would also have to split the >> into two > tokens which increases the complexity of the parser.

If the lexer was "smarter" and knew when to correctly emit >> vs. > tokens, the lexer would require state and understanding the language's grammar which increases the lexer's complexity.

Therefore, the best tokenization is 2 individual > tokens for the end of the text.

No End of File Marker Token Kind

There should not be an "End of File" / termination token kind. If there is an EoF token, it should be added by the calling code which knows when the condition occurs.

Iteration

Lexers should be able to process one token per tokenize call. Tokenizing the entire input should not be required to get one token.

Bytes vs Strings

Depending on the use case, bytes or strings can be the given input to a lexer. To work with all types of arbitrary input, lexers should operate with bytes, but it is generally easier to work with strings.

Rust Example

A lexer usually has a function like:

struct Token<'a> {
  text: &'a str,
  kind: TokenKind,
}

fn tokenize(text: &'a str, pos: &mut usize) -> Option<Token<'a>> {
  if pos == text.len() {
    return None;
  }
  let input = &text[input..];

  todo!();

  let kind: TokenKind = /* identified kind */ todo!();
  let end = /* last input pos of token */ todo!();

  *pos += end;

  Some(Token {
    text: &input[..end],
    kind
  })
}

The caller would be something like:

let mut pos = 0;
let input = "1 + 2";

while let Some(token) = tokenize(input, &mut pos) {
  todo!()
}

Mutating in/out position

The pos is an offset into the input to begin tokenizing from.

The pos is mutated and updated to the end of the token's text within the function.

In most cases, the entire input is available and always given to the tokenize() function. However, in a few cases, the input might be a window into a streaming buffer. So if say the first 123 bytes of the streaming buffer are tokenized, some calling code could remove the first 123 bytes and the pos is adjusted (pos -= 123;) accordingly.

pos may be considered an opaque value by the caller, but the lexer must be able to handle tokenizing arbitrary input at any position.

Return Option vs. Result

In the above function example, Option::None should be returned when the pos has reached text.len(), but the function signature could be changed to return a Result and return an error instead.

If pos is invalid (such as out of bounds), the function may panic, so returning a Result with an error instead may be better.

In my experience, pos is only out of bounds when testing the function and more importantly is not usually a recoverable error, so returning an Option is simpler.

Kinds of Tokens

Tokens are identified with a Kind or type. For Rust, Kind should probably be an enum like:

enum TokenKind {
  /// Identifier
  Ident,
  /// `*` Syntax
  Asterisk,
  Error,
}

The Kind does not have to be a flat enum and can have variants with fields like:

enum Syntax {
  Asterisk,
  Plus,
}

enum TokenKind {
  Ident,
  Syntax(Syntax),
  Error,
}

However, the Kind should not include the token's slice of the input.

// AVOID
enum BadTokenKind {
  Ident(String),
  Syntax(String),
  Error,
}

// ALSO AVOID
enum BadTokenKind<'a> {
  Ident(&'a str),
  Syntax(&'a str),
  Error,
}

In many cases, a Kind value is all that is required for processing so it should not be a complex or large (in terms of memory size) type.

Returned Token Should Contain Kind and Slice of Input

The Token type should include at least the TokenKind and some way to identify the token's slice of the original input.

struct Token<'a> {
  text: &'a str,
  kind: TokenKind,
}

struct Token {
  len: usize,
  kind: TokenKind,
}

The len field can be used to construct the slice of the original text given the original text and pos arguments to the tokenize function.

For Rust, a slice of the original input is preferable than an owned representation (e.g. String). The calling code can always turn a string or byte slice into a String or Vec.

No alloc

Since an ideal lexer is a pure function with no side-effects, the lexer should not require any allocations (e.g. no new String).

Optional Iterator Trait

A lexer should ideally have both a tokenize() function and an Iterator implementation. A Iterator struct implementation could store the input text and the pos and just call tokenize() on next().