Nibble

Github Source

I recently gained some curiosity on LLVM and attempted a recursive descent parser that (this time around) emits LLVM directly with printf. I set some limitations for myself while writing Nibble, namely no-malloc, no-free (and thus no AST), and no reliance on external libraries. The goal was to demystify compiler design for others, and have the compiler produce real tangible software like this simulation of a noisy chicken coop:

While I won’t cover the internals emitting LLVM, I will add that parsing yielded the most rewarding discovery. Two drivers, read_ltor or read_rtol, meaning read left-to-right, or right-to-left, respectively, take function pointer value_t with() that return value_t:

value_t read_ltor(value_t with(), precedence_t precedence)
{
    auto operators = get_operators(precedence);
    auto left = with();
    while(str_in(peek_operator(), operators))
    {
        auto operator = read_operator();
        auto rite = with();
        left = operate(left, rite, operator);
    }
    return left;
}

value_t read_rtol(value_t with(), precedence_t precedence)
{
    auto operators = get_operators(precedence);
    auto left = with();
    if(str_in(peek_operator(), operators))
    {
        auto operator = read_operator();
        auto rite = read_rtol(with, precedence);
        left = operate(left, rite, operator);
    }
    return to_rvalue(left);
}

By defining the lowest precedence read_p0 as either a direct value load of character, string, numeric character, or identifier, a chain of increasing precedence functions read_px can chain and pass values down a correct precedence hierarchy:

value_t read_p0()
{
    auto peek = next_char();
    if(is_character_load(peek)) return load_character();
    if(is_string_load(peek)) return load_string();
    if(is_direct_load(peek)) return load_direct();
    if(is_identifier_load(peek_alnum())) return read_identifier();
    return read_postfix(read_prefix());
}

value_t read_p1() { return read_ltor(read_p0, g_precedence_arithmetic_0); }
value_t read_p2() { return read_ltor(read_p1, g_precedence_arithmetic_1); }
value_t read_p3() { return read_ltor(read_p2, g_precedence_shift);        }
value_t read_p4() { return read_ltor(read_p3, g_precedence_relational_0); }
value_t read_p5() { return read_ltor(read_p4, g_precedence_relational_1); }
value_t read_p6() { return read_ltor(read_p5, g_precedence_bitwise_and);  }
value_t read_p7() { return read_ltor(read_p6, g_precedence_bitwise_xor);  }
value_t read_p8() { return read_ltor(read_p7, g_precedence_bitwise_or);   }
value_t read_p9() { return read_rtol(read_p8, g_precedence_assignment);   }

value_t read_expression()
{
    return read_p9();
}

Note that lower precedence here means stronger-binding-power. Some define that as the other direction.

With get_operators() simply returning any of the following operators by precedence level, read_expression() can freely parse any C-like expressions that support the following binary operators:

static char* g_operators_by_precedence[g_precedence_count][8] = {
    [ g_precedence_arithmetic_0 ]  = { g_multiply, g_divide, g_mod                            },
    [ g_precedence_arithmetic_1 ]  = { g_add, g_subtract                                      },
    [ g_precedence_shift        ]  = { g_shift_left, g_shift_rite                             },
    [ g_precedence_relational_0 ]  = { g_less, g_less_equal_to, g_greater, g_greater_equal_to },
    [ g_precedence_relational_1 ]  = { g_equal_to, g_not_equal_to                             },
    [ g_precedence_bitwise_and  ]  = { g_bitwise_and                                          },
    [ g_precedence_bitwise_xor  ]  = { g_bitwise_xor                                          },
    [ g_precedence_bitwise_or   ]  = { g_bitwise_or                                           },
    [ g_precedence_assignment   ]  = { g_equals                                               },
};

The remainder of Nibble carries on with prefix, postfix, LLVM IR, branching, and looping, in about ~3000 lines of a single main.c. I reckon the core parse and emit engine is roughly 1000 lines, and might be worth studying for those inclined.