Engine Simulator 4, or ensim4, is a rewrite of ensim3 with emphasis on cache locality, and now includes SIMD based branchless 1D-CFD (Computational Fluid Dynamics) LF (Lax–Friedrichs) for audio generation.

Seen here, an inline 8 engine with modeled fuel injection, combustion, isentropic mass flow, and two exhaust plenums, each output displacement vs. amplitude in their wave_0_pa and wave_1_pa widgets to the right:

Having explored both HLLC and HLLE Riemann solvers (which promised better audio fidelity due to their better contact surface restoration characteristics) I found that the audio fidelity of a branchless Riemann-like Lax–Friedrichs solver is almost imperceptible from the Riemann powerhouses, especially for straight pipes where flow is almost guaranteed to be subsonic.

static struct wave_flux_s
calc_solver_flux(struct wave_prim_s ql, struct wave_prim_s qr)
{
    struct wave_cons_s ul = prim_to_cons(ql);
    struct wave_cons_s ur = prim_to_cons(qr);
    double cl = sqrt(g_wave_gamma * ql.p / ql.r);
    double cr = sqrt(g_wave_gamma * qr.p / qr.r);
    struct wave_flux_s fl = { .r = ul.m, .m = ul.m * ql.u + ql.p, .e = (ul.e + ql.p) * ql.u };
    struct wave_flux_s fr = { .r = ur.m, .m = ur.m * qr.u + qr.p, .e = (ur.e + qr.p) * qr.u };
    double a = fmax(fabs(ql.u) + cl, fabs(qr.u) + cr);
    /*
     *       1               1
     * FC = --- (FL + FR) - --- a * (UR - UL)
     *       2               2
     */
    return (struct wave_flux_s) {
        .r = 0.5 * (fl.r + fr.r) - 0.5 * a * (ur.r - ul.r),
        .m = 0.5 * (fl.m + fr.m) - 0.5 * a * (ur.m - ul.m),
        .e = 0.5 * (fl.e + fr.e) - 0.5 * a * (ur.e - ul.e),
    };
}

The branchless nature is SIMD friendly and readily auto-vectorizes, at least with clang.

For reference, consider the inline 8 from ensim2 which did not model exhaust pipe CFD, where simply the exhaust plenum’s chamber pressure was sampled:

Performance

Whole program performance across 3 threads averages ~1.7 instructions per cycle (IPS) for a 10 second run:

perf stat metrics on a laptop from 2019 using an Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz (CPU max MHz: 4800.0000):

perf stat -e cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,branch-instructions,branch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses ./ensim4

    45,549,973,120      cpu-cycles:u                                                            (28.80%)
    77,653,751,545      instructions:u                   #    1.70  insn per cycle              (35.93%)
    18,466,717,215      L1-dcache-loads:u                                                       (35.86%)
     1,315,232,046      L1-dcache-load-misses:u          #    7.12% of all L1-dcache accesses   (35.94%)
        53,871,616      LLC-loads:u                                                             (35.86%)
         8,979,688      LLC-load-misses:u                #   16.67% of all LL-cache accesses    (35.77%)
     4,822,238,256      branch-instructions:u                                                   (28.67%)
        34,444,526      branch-misses:u                  #    0.71% of all branches             (28.47%)
    18,657,094,013      dTLB-loads:u                                                            (28.48%)
           559,976      dTLB-load-misses:u               #    0.00% of all dTLB cache accesses  (28.33%)
     8,942,910,928      dTLB-stores:u                                                           (28.26%)
           422,652      dTLB-store-misses:u                                                     (28.45%)
           762,642      iTLB-loads:u                                                            (28.50%)
           414,639      iTLB-load-misses:u               #   54.37% of all iTLB cache accesses  (28.61%)

A compiler’s ability to communicate missed vectorization optimizations is certainly useful, as with clang’s -Rpass-missed=loop-vectorize flag:

clang -std=c23 -Rpass-missed=loop-vectorize ... | grep "loop not vectorized" | sort | uniq

Addressing missed auto-vectorization is certainly the next best step:

src/engine_s.h:189:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:206:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:219:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:274:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:315:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:328:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:356:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:67:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:76:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:92:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:102:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:131:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:138:17: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:72:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:147:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:285:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:325:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:353:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:375:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:454:13: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:485:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:611:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:650:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:758:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:782:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:155:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:246:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:258:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:280:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]