Scanner
As described in the previous chapter. The scanner is the first component in our system. It reads the input file and produce a stream of tokens.
Interface
Finally we get to some code, here is how the scanner look like:
class scanner
{
public:
scanner(const char* source_file);
void scan();
token_type get_token_type();
char* get_token_string();
...
};
...
Note that this is incomplete - the full interface for exercise can be found in the given code template.
Interface explained
The scanner is constructed with the source file, initially, it is pointing at the beginning of the file and nothing is read yet.
A caller (e.g. the parser) will start with calling scan()
, after calling it, it points to the end of the first token, and the get_token_type()
and get_token_string()
can return the type and the string of the first token.
Implementation
For best learning, you, the reader, is supposed to implement this. In this section, we will talk about my implementation. You can also reference this as an example:
My scanner started by skipping through the whitespaces. It checks if the current string has a keyword prefix, if so, return the keyword. Otherwise, it is either a number or an identifier, just scan forward and find those.
Practical issues
A typical compiler book would embark a whole chapter (or even a few) on regular expressions, automata and maybe examples to lexical analyzer generators such as flex. These are good techniques, but frequently compiler writer choose not to use them because generated code are painful to maintain (in particular, to change or debug). Compared with those approaches, our approach compare the string with multiple keywords, making it goes through the same character multiple times, for large number of keywords, that could be a bummer in performance. If that's truly an issue, we can build a trie of those keywords, and that will speed up the input scanning process.