It first builds a token list then sends that to a function to output the
data. The tokeniser is designed to be as accurate as possible, for example
++ will be read as one operator token (increment), while
+- will be
read as two operator tokens (plus then minus). Although this is not strictly
required for a syntax highlighter, it is implemented so that I may later extend
this code to allow crushing/beautifying and/or actual parsing and executing. This
level of accuracy does actually have benefits to syntax highlighting however,
as tokens such as
} can be coloured differently depending on whether
they are block level indicators or object literals.
Every open source Syntax highlighter that I have tried fails on at least some valid JS. Common pitfalls include:
- Failure to recognise numbers that start with a period, eg:
- Failure to recognise that the second period in the following is
an operator, and not part of the number:
0.1.method(); // yes this is valid JS
- Failure to handle multiline strings
- Interpreting the following as containing a regular expression:
- Ending regular expression tokens prematurely, eg:
- Recognising some edge case 'divide' operators as regular expression opening
/regexp/In this example the first line does not have a semicolon, so the first
/on the next line immediately becomes a division operation, with
I know of only one bug in my implementation: if an object literal
is placed in the false branch of a tertiary statement then it is highlighted
as a block level token (though most other libraries don't distinguish between these).
Christian Krebbs writes in to inform of another bug, a regular expression following a variable declaration (without an assignment or semicolon) is interpreted as a division. This one is going to be particularly difficult to fix. If you find another code sequence that fails to highlight correctly please contact me.
Note: This is not designed to necessarily fail gracefully on invalid input.