2025-05-25 # Syntax Highlighting

Ever since I made this website, I've been thinking of how I would use it. I want to use it as a blog, sharing my interests. I want to use it to host web projects of mine. My last article included a snippet of Zig code. Being a lover of good syntax highlighting, I was a annoyed to discover that highlight.js doesn't support Zig out-of-the-box. Of course there is currently third party support for zig syntax highlighting, but I would prefer not to set it up and keep it updated. A bit more digging revealed prism.js, which looked more promising.

After some experimentation, I realized that the default setup of prism.js just wasn't what I wanted. I already had good CSS for displaying blocks of code, and I considered prism.js pretty intrusive. It would place an off-color block in the middle of my webpage. Instead of spending a few minutes reconfiguring prism.js, or learning how to use third party zig syntax highlighting support with highlight.js, I decided to roll my own syntax highlighting.

I didn't want to look through someone else's source code. I wanted to create this from core principles. This meant I needed to figure out exactly what syntax highlighting was, how it worked, etc. For this project, I decided to use JavaScript. I have more experiments down the line I want to try with WebAssembly, yet I felt JavaScript would work just fine for this purpose.

What information will we convey through highlighting?

I don't need anything fancy here. Modern text editors can perform advanced syntax highlighting with the help of a language server. They go beyond simple highlighting of syntax into more fancy structure highlighting, static analysis highlighting, runtime analysis highlighting, etc. I do not want to burden myself with the complexity and runtime cost of an LSP here. Well let's analyze some basic zig code, and pick out the parts we wish to differentiate:

const std = @import("std");

pub fn main() !void {
    // standard output Writer interface
    const stdout = std.io.getStdOut().writer();

    var i: usize = 1;
    while (i <= 16) : (i += 1) {
        if (i % 15 == 0) {
            try stdout.writeAll("ZigZag\n");
        } else if (i % 3 == 0) {
            try stdout.writeAll("Zig\n");
        } else if (i % 5 == 0) {
            try stdout.writeAll("Zag\n");
        } else {
            try stdout.print("{d}\n", .{i});
        }
    }
}

I first see the keywords. These are const, pub, var, while, if, try, and else. Of course zig has more keywords - this is just a sample. When looking at the code, I also see comments, which all start with //. There are additionally builtins like @import, string literals like "Zig\n" and "Zag\n", number literals like 1, 16, 15, 3, and 5.

There are also primitives like fn, void, and usize, variable names like stdout and i, and function names like main(), writeAll(), getStdOut(), writer(), and print(). Not shown here are Types. They are pascal-cased, and represent user-created types.

That seems like a fairly good list. Here's what we have:

Numbers - 0x37, 64, 0b1101, 1.01...
Keywords - const, var, switch, for...
Builtins - @import, @intCast, @splat...
Primitives - usize, i32, bool, void...
Functions - main(), fooBarBazBat(), write()...
Types - StaticStringMap, Compressor, Cat...
Variables - index, num_elephants, status_2319...
Strings - "\"standard\" strings", \\multiline strings...

What colors should the code use?

When I look at colors in normal syntax highlighted code, I notice that the colors have contrast. You will rarely find that the color for a function is barely a shade off from the color for a string literal. Because I have 8 elements of Zig code that I want to distinguish, I've decided to just use the whole rainbow. Some people may find this unpleasant, as it has no overall "theme", and may clash in some way. If you are reading this for inspiration, feel free to do colors your own way.

I initially split a standard HSV rainbow into discrete bands of color. I was not pleased to discover the visual contrast between certain adjacent colors was a much lower than other adjacent colors. I had a hard time telling apart the yellow-green, pure-green, and light-blue-green colors. The hues would blend together. This is not a good experience.

discretized rainbow - the green hues are hard to distinguish

Our color perception isn't linear. Our eyes will register some wavelength deltas as lower contrasting than others. After some research, I found that you can determine the approximate uniformity of colors using CIEDE2000. It is a neat little algorithm that attempts to match the average visual perception of humans and output a perceptual difference metric. In theory, you could use this metric to find some number of colors that had an equal minimum contrast with any other color. One well worded LLM prompt later, and I had a rough python program to generate a more visually uniform color spectrum. Some colors still feel closer to others, but it is a lot better than before. For all I know, I may be slightly color blind anyways.

discretized rainbow - each hue is easily distinguished from another

We could work further with this. One idea is to also vary the luminance and saturation of these colors. Another idea is to define a region of colors to select from, ending up with a contrasting set of colors for a certain color theme. In the goal for simplicity, I decided to just go with the full rainbow:

Dark Mode:

Comments:    #808080 (RGB: 128, 128, 128)
Builtins:    #ff7065 (RGB: 255, 112, 101)
Keywords:    #ffbb65 (RGB: 255, 187, 101)
Strings:     #deff65 (RGB: 222, 255, 101)
Numbers:     #65ffc3 (RGB: 101, 255, 195)
Types:       #65dfff (RGB: 101, 223, 255)
functions(): #659cff (RGB: 101, 156, 255)
var_names:   #b565ff (RGB: 181, 101, 255)
Primitives:  #ff65d3 (RGB: 255, 101, 211)
Default:     #ffffff (RGB: 255, 255, 255)

Light Mode:

Comments:    #707070 (RGB: 112, 112, 112)
Builtins:    #6f150f (RGB: 111,  21,  15)
Keywords:    #6f470f (RGB: 111,  71,  15)
Strings:     #606f0f (RGB:  96, 111,  15)
Numbers:     #0f6f4f (RGB:  15, 111,  79)
Types:       #0f5b6f (RGB:  15,  91, 111)
functions(): #0f2f6f (RGB:  15,  47, 111)
var_names:   #4a0f6f (RGB:  74,  15, 111)
Primitives:  #6f0f51 (RGB: 111,  15,  81)
Default:     #000000 (RGB:   0,   0,   0)

How will we encode the color information?

Off the top of my head, I had a few thoughts for how the JavaScript program could encode color. One idea is to encode color with classes: <span class="code_red"> @splat(0) </span>. This would have the advantage that I can control each of these colors in a CSS file, handling dark mode and light mode pretty easily. Of course, this means that I now need to place 10 more rules in my CSS for each of dark and light mode.

Instead of tagging each selected part with a class, we could also just directly style the code using <span style="color: #RRGGBB"> unreachable </span>. The advantage to encoding the color directly here is that I won't need anything extra in my CSS file. It means that I can write JavaScript and things will simply work. Of course, it may be slightly more complicated code because now I have to check for the preferred color scheme in JavaScript.

I decided that I would go for the first option. JavaScript would run through code blocks annotated with a "language-zig" class, and tag the selected selected code with color classes. While it is a hastle to throw the colors in CSS, it means that I can dedicate my CSS to *styling*, while my JavaScript performs the actual logic.

How will we annotate the zig code?

To annotate the code, we first need to determine what needs to be highlighted. We can build our own tokenizer / parser, or we can use regex to match parts of the code. After we determine where the interesting portions of the code are, we need to find a way to individually tag these portions and reconstruct the original structure of the code. We also need to avoid overlapping our annotated code portions.

JavaScript has exceptional regex support. Coming from low level languages, I had mistakenly assumed that rolling my own parser would be faster than using a regexp. Because regex is part of the language specification, JavaScript interpreters will compile regex patterns into well optimized finite state machines.

How will we know what is zig code, and what isn't? Well fortunately for us, there is a precendence to marking certain programming language code using a language-xxx class in HTML. Ok, so we check for matching expressions in code marked with class="language-zig", then we use regex to match all the appropriate parts, then we tag all of these parts with an indivual class, right? We can use these classes in CSS to color each part.

Well it seems the parsing is easy enough then. No finite state automata for us, we can use basic regex. There are nuances with parsing code that we don't want to care for here. I'd say that if we can match an integer correctly 95% of the time with a 20 byte regexp, it is much better for us than trying to make the regex 100% perfect at the cost of a 150 byte regexp.

Ok, let's construct these regular expressions. One resource I found particularly useful was regex101.com. A very lovely website, I learned a lot about how pattern matching worked in a regular expression. Here are the expressions:

Strings:
```
/"(?:\\.|[^\n"])*"|\\\\.*/g
```
Comments:
```
/\/\/.*/g
```
Builtins:
```
/@\w+/g
```

Keywords:

/\b(?:addrspace|align|allowzero|and|anyframe|anytype|asm|async|await|break|callconv|catch|comptime|const|continue|defer|else|enum|errdefer|error|export|extern|fn|for|if|inline|noalias|noinline|nosuspend|opaque|or|orelse|packed|pub|resume|return|linksection|struct|suspend|switch|test|threadlocal|try|union|unreachable|usingnamespace|var|volatile|while)\b/g

Primitives:

/\b(?:[uif]\d+|isize|usize|bool|anyopaque|void|noreturn|type|anyerror|comptime_int|comptime_float)\b/g

Numbers:

/(?:-|\b)[0-9][xo\.+\-\wpP]*|\b(?:true|false|null|undefined)\b|'(?:\\.|[^\n'])*'/g

Types:
```
/\b[A-Z]\w*/g
```
Functions:
```
/\b[a-z_]\w*(?=\()/g
```
Variables:
```
/\b\w+/g
```

They look pretty nice, right? The regexp for matching variables is especially simple. News flash - I cheated. We have a few problems that need to be solved here. What happens if we have a string in the middle of a comment? What if we have a comment start in the middle of a string? What happens if keywords appear in variable names? Bad things happen.

We can't *just* do this pattern matching replacement. We need to remove matching portions from the text before we match other regular expressions. Of course this isn't perfect - in this order, strings inside of comments will still be matched. Oh and one more problem - what if in removing this sections of matched text, some lower-chain regexp matches the span across that gap?

Well we already know that we will need to parse these expressions in the right order. I took advantage of that thought as I wrote the regexp for matching variable names. You can tell that it will match any sequence of "word" bytes (a through z, A through Z, 0 through 9, and underscores). This would previously have matched some numbers, but as we parse numbers before variables, this is a non-issue.

One more thing to note is the particularly fat numbers regexp. This regexp is more complicated than it needs to be. It not only heuristically matches all integer and floating point literals, but also true, false, null, undefined, and all character literals. Under the hood of the compiler, character literals are integers, and so are true and false. I threw in null and undefined for completeness, even those don't quite line up semantically.

So we have two problems. The first problem: some regexp may match portions of the text that another regexp may match. The second problem: when trying to solve the first problem by removing matches from the text, some regexp may match the span across the gap of that text.

How can we solve this? We can write a separate function for each stage of tagging the parts of our code. The first stage will call the second stage on unmatched portions, otherwise it will annotate that matched portion of the input text. It will concatenate these matched and unmatched strings together to form the original structure.

This should kill two bugs with one rock - we not only avoid duplicate matching portions, but also avoid matching spans across gaps that may be introduced with a subpar masking approach. In the end, it ends up being simple as well!

The Code:

For those of you who skipped right to the code, I would like to remind you that comprehension is optional. If you would like to learn more about a process, shortcuts aren't the answer. Until you know the low level, you can't fully appreciate the high level. At last, here's the JavaScript:

// This function uses the regex to differentiate different parts of the source.
// Matching portions are passed through tag_fn, and unmatching portions are
// passed through nomatch_fn. The result is concatenated in the original order.
function matcher(regex, tag_fn, nomatch_fn) {
    return (source) => {
        let last_index = 0;
        let annotated = "";

        for (const match of source.matchAll(regex)) {
            // Pass any unmatched spans down the chain
            if (match.index > last_index) {
                const span = source.slice(last_index, match.index);
                annotated += nomatch_fn(span);
            }
            // Annotate & append the matched span
            annotated += tag_fn(match[0]);
            last_index = match.index + match[0].length;
        }

        // Pass along any last unmatched trailing span
        if (last_index < source.length) {
            const span = source.slice(last_index);
            annotated += nomatch_fn(span);
        }

        return annotated;
    };
}

// This function tags the match with the class of name "class_name"
function tagger(class_name) {
    return (match) => `<span class="${class_name}">${match}</span>`;
}

// These regexes will match the sections of zig code which we are looking to
// highlight. They can be improved, but are a good heuristic for now.
const string_regex = /"(?:\\.|[^\n"])*"|\\\\.*/g;
const comment_regex = /\/\/.*/g;
const builtin_regex = /@\w+/g;
const keyword_regex = /\b(?:addrspace|align|allowzero|and|anyframe|anytype|asm|async|await|break|callconv|catch|comptime|const|continue|defer|else|enum|errdefer|error|export|extern|fn|for|if|inline|noalias|noinline|nosuspend|opaque|or|orelse|packed|pub|resume|return|linksection|struct|suspend|switch|test|threadlocal|try|union|unreachable|usingnamespace|var|volatile|while)\b/g;
const primitive_regex = /\b(?:[uif]\d+|isize|usize|bool|anyopaque|void|noreturn|type|anyerror|comptime_int|comptime_float)\b/g;
const number_regex = /(?:-|\b)[0-9][xo\.+\-\wpP]*|\b(?:true|false|null|undefined)\b|'(?:\\.|[^\n'])*'/g;
const type_regex = /\b[A-Z]\w*/g;
const function_regex = /\b[a-z_]\w*(?=\()/g;
const variable_regex = /\b\w+/g;

// These functions define a chain of precedence between highlighting items.
// The lower numbered matching functions have higher precedence, allowing
// the higher numbered functions' regular expressions to be more relaxed.
const match_9 = tagger("zig_other");
const match_8 = matcher(variable_regex, tagger("zig_variable"), match_9);
const match_7 = matcher(function_regex, tagger("zig_function"), match_8);
const match_6 = matcher(type_regex, tagger("zig_type"), match_7);
const match_5 = matcher(number_regex, tagger("zig_number"), match_6);
const match_4 = matcher(primitive_regex, tagger("zig_primitive"), match_5);
const match_3 = matcher(keyword_regex, tagger("zig_keyword"), match_4);
const match_2 = matcher(builtin_regex, tagger("zig_builtin"), match_3);
const match_1 = matcher(comment_regex, tagger("zig_comment"), match_2);
const match_0 = matcher(string_regex, tagger("zig_string"), match_1);

// Annotate all code blocks marked by the "language-zig" class
const zig_blocks = document.getElementsByClassName("language-zig");
for (const block of [...zig_blocks]) {
    block.innerHTML = match_0(block.textContent);
}

And don't forget that we need to actually do something with the tags as well! Here's the CSS for the code: (Note: not shown is the CSS for the block that the code is in. The CSS here does not color the background of the code.)

/* Zig syntax highlighting */

.zig_comment   { color: #808080; }
.zig_builtin   { color: #ff7065; }
.zig_keyword   { color: #ffbb65; }
.zig_string    { color: #deff65; }
.zig_number    { color: #65ffc3; }
.zig_type      { color: #65dfff; }
.zig_function  { color: #659cff; }
.zig_variable  { color: #b565ff; }
.zig_primitive { color: #ff65d3; }
.zig_other     { color: #ffffff; }

@media (prefers-color-scheme: light) {
    .zig_comment   { color: #707070; }
    .zig_builtin   { color: #6f150f; }
    .zig_keyword   { color: #6f470f; }
    .zig_string    { color: #606f0f; }
    .zig_number    { color: #0f6f4f; }
    .zig_type      { color: #0f5b6f; }
    .zig_function  { color: #0f2f6f; }
    .zig_variable  { color: #4a0f6f; }
    .zig_primitive { color: #6f0f51; }
    .zig_other     { color: #000000; }
}

Final Results:

I admit, I am very proud of this code. It is the first non-trivial JavaScript code I've written. I'm eager to try it out on some regular zig code though! So let's take it for a spin:

const std = @import("std");

pub fn main() !void {
    // standard output Writer interface
    const stdout = std.io.getStdOut().writer();

    var i: usize = 1;
    while (i <= 16) : (i += 1) {
        if (i % 15 == 0) {
            try stdout.writeAll("ZigZag\n");
        } else if (i % 3 == 0) {
            try stdout.writeAll("Zig\n");
        } else if (i % 5 == 0) {
            try stdout.writeAll("Zag\n");
        } else {
            try stdout.print("{d}\n", .{i});
        }
    }
}

The results speak for themselves! Thanks for reading.