Lexers are some of lowest level parts of toolchains. Lexers
take text or raw bytes as input and divide the input into tokens
which have a
kind
.
For something like:
1 + 2
A lexer might give back the following tokens:
("1", Number)
(" ", Space)
("+", PlusSign)
(" ", Space)
("2", Number)
The first element in the tuple is a slice of the original text. The second element is the kind of token.
Honestly, they are not terribly complex and are fairly straightforward, if not generally boring, pieces of code.
Still, I have some opinions on the design of lexers.
Design Considerations
No Context
Lexers should not need to keep any kind of contextual state. The input should be all that is required. If state is required, the code may be more like a parser than a lexer.
Repeated calls given the same inputs must return the same result. Tokenization should be purely functional and idempotent.
Bad Text/Byte Input Should Still Be a Token
Lexers should be able to handle "bad" input and identify it as such. It should not panic due to bad text/byte input.
The bad input should be identified as a token with an "error" Kind
.
Smallest Terminals / Syntax
The smallest possible kinds of tokens should be returned with no combination or complex token kinds.
Suppose a programming language has >>
as a bitshift operator while also using
>
as 1) the greater than operator and 2) the end of a list of generics/template
arguments.
The lexer could have 2 token kinds like:
- GreaterThan (
>
) - ShiftRight (
>>
)
However, having only a single token kind should be preferred:
- GreaterThan (
>
)
In text like:
Vec<Vec<i32>>
When a parser is reading the lexer's tokens, the parser is probably looking for
a single closing >
token for Vec<i32
. If >>
was a token kind, the parser
would have to check for >>
in addition to >
. The parser would also have to
split the >>
into two >
tokens which increases the complexity of the parser.
If the lexer was "smarter" and knew when to correctly emit >>
vs. >
tokens,
the lexer would require state and understanding the language's grammar which
increases the lexer's complexity.
Therefore, the best tokenization is 2 individual >
tokens for the end of the
text.
No End of File Marker Token Kind
There should not be an "End of File" / termination token kind. If there is an EoF token, it should be added by the calling code which knows when the condition occurs.
Iteration
Lexers should be able to process one token per tokenize
call. Tokenizing the
entire input should not be required to get one token.
Bytes vs Strings
Depending on the use case, bytes or strings can be the given input to a lexer. To work with all types of arbitrary input, lexers should operate with bytes, but it is generally easier to work with strings.
Rust Example
A lexer usually has a function like:
struct Token<'a> {
text: &'a str,
kind: TokenKind,
}
fn tokenize(text: &'a str, pos: &mut usize) -> Option<Token<'a>> {
if pos == text.len() {
return None;
}
let input = &text[input..];
todo!();
let kind: TokenKind = /* identified kind */ todo!();
let end = /* last input pos of token */ todo!();
*pos += end;
Some(Token {
text: &input[..end],
kind
})
}
The caller would be something like:
let mut pos = 0;
let input = "1 + 2";
while let Some(token) = tokenize(input, &mut pos) {
todo!()
}
Mutating in/out position
The pos
is an offset into the input to begin tokenizing from.
The pos
is mutated and updated to the end of the token's text within the function.
In most cases, the entire input is available and always given to the
tokenize()
function. However, in a few cases, the input might be a window into
a streaming buffer. So if say the first 123 bytes of the streaming buffer are
tokenized, some calling code could remove the first 123 bytes and the pos
is
adjusted (pos -= 123;
) accordingly.
pos
may be considered an opaque value by the caller, but the lexer must be
able to handle tokenizing arbitrary input at any position.
Return Option vs. Result
In the above function example, Option::None
should be returned when the pos
has reached text.len()
, but the function signature could be changed to return
a Result
and return an error instead.
If pos
is invalid (such as out of bounds), the function may panic, so
returning a Result
with an error instead may be better.
In my experience, pos
is only out of bounds when testing the function and more
importantly is not usually a recoverable error, so returning an Option
is
simpler.
Kinds of Tokens
Tokens are identified with a Kind
or type. For Rust, Kind
should probably be
an enum
like:
enum TokenKind {
/// Identifier
Ident,
/// `*` Syntax
Asterisk,
Error,
}
The Kind
does not have to be a flat enum
and can have variants with fields like:
enum Syntax {
Asterisk,
Plus,
}
enum TokenKind {
Ident,
Syntax(Syntax),
Error,
}
However, the Kind
should not include the token's slice of the input.
// AVOID
enum BadTokenKind {
Ident(String),
Syntax(String),
Error,
}
// ALSO AVOID
enum BadTokenKind<'a> {
Ident(&'a str),
Syntax(&'a str),
Error,
}
In many cases, a Kind
value is all that is required for processing so it
should not be a complex or large (in terms of memory size) type.
Returned Token Should Contain Kind and Slice of Input
The Token
type should include at least the TokenKind
and some way to
identify the token's slice of the original input.
struct Token<'a> {
text: &'a str,
kind: TokenKind,
}
or
struct Token {
len: usize,
kind: TokenKind,
}
The len
field can be used to construct the slice of the original text given
the original text
and pos
arguments to the tokenize
function.
For Rust, a slice of the original input is preferable than an owned
representation (e.g. String
). The calling code can always turn a string or
byte slice into a String
or Vec
.
No alloc
Since an ideal lexer is a pure function with no side-effects, the lexer should
not require any allocations (e.g. no new String
).
Optional Iterator Trait
A lexer should ideally have both a tokenize()
function and an Iterator
implementation. A Iterator
struct implementation could store the input text
and the pos
and just call tokenize()
on next()
.