Simple tokenization with Python

Tokenization is the process of splitting a string into smaller (elementary) units that are further processed afterwards. For the task of preprocessing arithmetic expressions such as 1+2*(3-4) there is no need for a sophisticated tokenizer. Instead, we can use re.findall:

The function also already tackles whitespaces and ignores them correctly. Note that the expression only identifies integers (no floating numbers) correctly.

References

  • [1] re module reference

Leave a Reply