Simple tokenization with Python

Tokenization is the process of splitting a string into smaller (elementary) units that are further processed afterwards. For the task of preprocessing arithmetic expressions such as 1+2*(3-4) there is no need for a sophisticated tokenizer. Instead, we can use re.findall:

The function also already tackles whitespaces and ignores them correctly. Note that the expression only identifies integers (no floating numbers) correctly.

References

  • [1] re module reference

Leave a Reply

Your email address will not be published. Required fields are marked *

Please type the characters of this captcha image in the input box

Please type the characters of this captcha image in the input box