Ensim4
Engine Simulator 4, or ensim4, is a rewrite of ensim3 with emphasis on cache locality, and now includes SIMD based branchless 1D-CFD (Computational Fluid Dynamics) LF (Lax–Friedrichs) for audio generation.
Seen here, an inline 8 engine with modeled fuel injection, combustion, isentropic mass flow,
and two exhaust plenums, each output displacement vs. amplitude in their wave_0_pa
and wave_1_pa
widgets to the right:
Having explored both HLLC and HLLE Riemann solvers (which promised better audio fidelity due to their better contact surface restoration characteristics) I found that the audio fidelity of a branchless Riemann-like Lax–Friedrichs solver is almost imperceptible from the Riemann powerhouses, especially for straight pipes where flow is almost guaranteed to be subsonic.
static struct wave_flux_s
calc_solver_flux(struct wave_prim_s ql, struct wave_prim_s qr)
{
struct wave_cons_s ul = prim_to_cons(ql);
struct wave_cons_s ur = prim_to_cons(qr);
double cl = sqrt(g_wave_gamma * ql.p / ql.r);
double cr = sqrt(g_wave_gamma * qr.p / qr.r);
struct wave_flux_s fl = { .r = ul.m, .m = ul.m * ql.u + ql.p, .e = (ul.e + ql.p) * ql.u };
struct wave_flux_s fr = { .r = ur.m, .m = ur.m * qr.u + qr.p, .e = (ur.e + qr.p) * qr.u };
double a = fmax(fabs(ql.u) + cl, fabs(qr.u) + cr);
/*
* 1 1
* FC = --- (FL + FR) - --- a * (UR - UL)
* 2 2
*/
return (struct wave_flux_s) {
.r = 0.5 * (fl.r + fr.r) - 0.5 * a * (ur.r - ul.r),
.m = 0.5 * (fl.m + fr.m) - 0.5 * a * (ur.m - ul.m),
.e = 0.5 * (fl.e + fr.e) - 0.5 * a * (ur.e - ul.e),
};
}
The branchless nature is SIMD friendly and readily auto-vectorizes, at least with clang.
For reference, consider the inline 8 from ensim2 which did not model exhaust pipe CFD, where simply the exhaust plenum’s chamber pressure was sampled:
Performance
Whole program performance across 3 threads averages ~1.7 instructions per cycle (IPS) for a 10 second run:
perf stat
metrics on a laptop from 2019 using an Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz (CPU max MHz: 4800.0000)
:
perf stat -e cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,branch-instructions,branch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses ./ensim4
45,549,973,120 cpu-cycles:u (28.80%)
77,653,751,545 instructions:u # 1.70 insn per cycle (35.93%)
18,466,717,215 L1-dcache-loads:u (35.86%)
1,315,232,046 L1-dcache-load-misses:u # 7.12% of all L1-dcache accesses (35.94%)
53,871,616 LLC-loads:u (35.86%)
8,979,688 LLC-load-misses:u # 16.67% of all LL-cache accesses (35.77%)
4,822,238,256 branch-instructions:u (28.67%)
34,444,526 branch-misses:u # 0.71% of all branches (28.47%)
18,657,094,013 dTLB-loads:u (28.48%)
559,976 dTLB-load-misses:u # 0.00% of all dTLB cache accesses (28.33%)
8,942,910,928 dTLB-stores:u (28.26%)
422,652 dTLB-store-misses:u (28.45%)
762,642 iTLB-loads:u (28.50%)
414,639 iTLB-load-misses:u # 54.37% of all iTLB cache accesses (28.61%)
A compiler’s ability to communicate missed vectorization optimizations is certainly useful, as with clang’s -Rpass-missed=loop-vectorize
flag:
clang -std=c23 -Rpass-missed=loop-vectorize ... | grep "loop not vectorized" | sort | uniq
Addressing missed auto-vectorization is certainly the next best step:
src/engine_s.h:189:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:206:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:219:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:274:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:315:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:328:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:356:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:67:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:76:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/engine_s.h:92:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:102:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:131:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:138:17: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/node_s.h:72:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:147:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:285:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:325:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:353:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:375:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:454:13: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:485:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:611:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:650:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:758:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/sdl.h:782:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:155:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:246:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:258:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
src/wave_s.h:280:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]