Back to Projects

Case Study

- 3 min read

Fornax

C++ Python CMake AVX-512 SIMD RAPL
Fornax project hero image

Overview

Fornax is a benchmarking tool I built to investigate an interesting quirk of modern Intel CPUs: running AVX-512 instructions causes the processor to reduce its clock speed. The tool explores whether pulsing the workload—running AVX hard, then backing off—could maintain higher average frequencies than running continuously. It operates entirely in userspace, requiring no BIOS access or kernel modifications.

For a full narrative writeup of the experience, check out my blog post.


The Problem

When executing wide SIMD instructions, Intel CPUs apply frequency “offsets” to prevent voltage instability. Heavy AVX-512 FMA operations can cause a 300–500 MHz frequency penalty on high-end desktop silicon. Beyond the frequency drop, each voltage state transition costs 10–20 microseconds of stabilization delay.

The conventional solution—disabling AVX offsets in BIOS—doesn’t hold up in practice. The CPU’s Power Control Unit enforces thermal limits regardless of BIOS settings, cloud deployments rarely expose BIOS access, and microcode-level frequency caps cannot be overridden. Static BIOS tuning also doesn’t adapt to changing workloads.

I wanted to see if a software-only, adaptive approach could do better.


Technical Approach

Architecture

Fornax uses a two-thread architecture pinned to separate cores:

  • Monitor thread (Core 0): Reads power consumption from Intel RAPL and current frequency from sysfs. Implements three control modes—Schmitt trigger (hysteresis-based), manual duty cycle (fixed work/pause ratio), and adaptive gradient (online throughput optimization).
  • Worker thread (Core 1): Runs configurable SIMD workloads including synthetic FMA stress tests and financial kernels (Black-Scholes pricing, Monte Carlo simulation, covariance matrix computation).

Key Design Decisions

False sharing prevention. Each shared atomic variable is aligned to its own 64-byte cache line. Without this padding, throughput dropped ~30% due to cache line ping-pong between cores.

   struct alignas(64) SharedState {
    std::atomic<bool> throttle_signal;
    char padding1[63];
    std::atomic<uint64_t> iteration_count;
    char padding2[56];
};

Relaxed memory ordering. The hot path uses memory_order_relaxed for atomics. We don’t need synchronization—just visibility. The signal propagates within nanoseconds, which is sufficient for the control loop’s granularity.

Cross-platform abstraction. The code compiles on both x86_64 (AVX-512/AVX2, RDTSC, PAUSE) and ARM64 (NEON, CNTVCT, YIELD). ARM doesn’t exhibit AVX-style offsets, but the abstraction layer was useful for development on Apple Silicon and forced cleaner architecture decisions.


Results & Impact

ARM (M1 — Control Group)

No frequency scaling effects, as expected. The duty cycle mechanism introduces only ~1.6% overhead from throttle-checking logic:

Duty CycleIterations/secOverhead
0%103,228Baseline
50%101,5731.6%
100%101,5961.6%

x86 (Intel i9-10900K)

Clear frequency scaling effects visible. The trade-off between frequency and throughput is non-linear:

Duty CycleAvg FrequencyThroughputEffective Work
0% (continuous)4.75 GHz2,847,294 iter/s100% (baseline)
50%5.15 GHz1,423,647 iter/s50%
70%5.05 GHz854,188 iter/s30%

For raw throughput, continuous execution wins. The frequency penalty is smaller than the time lost during pause phases. For latency-sensitive applications, pulsing provides higher instantaneous burst frequencies and more predictable performance. For power-constrained environments, 50% duty cycle cuts power consumption ~48% while retaining responsive compute.


What I Learned

The AVX offset is the result of careful engineering by people much smarter than me. The CPU is probably making a good decision when it throttles. But the exercise proved the frequency behavior is deterministic, controllable from software, and measurable without special privileges.

The project forced me to learn systems programming the hard way—cache line alignment, memory ordering, platform-specific ISA intrinsics, and power management interfaces. It was my first serious C++ project and my first time reading Intel architecture manuals. The uncomfortable feeling of not knowing what I was doing was the whole point.