Finding a bug in Apple's Metal Shader compiler
Metal GPU Compiler Bug: Discovery Process
How the bug was found, step by step
Origin: A Commented-Out Test in tinygrad
The trail starts at test/null/test_uops.py:164:
# @unittest.skipIf(Device.DEFAULT == "METAL", "compiler bug")
This commented-out skip decorator sits above the TestFastIdiv class, which tests tinygrad’s fast integer division optimization — replacing x / D and x % D with the classic multiply-shift trick: (int)(((long)(x) * magic) >> shift). The comment tells the whole story: Metal was producing wrong results, and a compiler bug was suspected.
The actual triggering workload was SHA3/Keccak: Tensor(bytes(range(9))).keccak("sha3_256"). Keccak’s permutation operates on a 5x5 state, so it heavily uses % 5 to wrap indices — which tinygrad compiles into fast_idiv with magic constant 1717986919 and shift 33.
Phase 1: metalbug_minimal.py — Finding the Input Size Boundary
The first investigation was at the tensor level, not the shader level. SHA3 was run on inputs of sizes 1 through 10 bytes (metalbug_minimal.py:56):
for size in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
result_hex = bytes(Tensor(data).keccak("sha3_256").tolist()).hex()
expected_hex = hashlib.sha3_256(data).digest().hex()
8 bytes: PASS. 9 bytes: FAIL.
The reason 9 bytes is the critical boundary is visible in the generated kernel. tinygrad names the kernel E_25n3 and its data buffer data2_9 — 9 unsigned chars. The kernel conditionally loads from this buffer using the remainder as an index:
unsigned char val1 = (cond) ? *(data2_9 + (remainder << 3)) : (unsigned char)(0u);
When remainder == 1, the load address is 1 << 3 = 8, i.e., the 9th byte (data2_9[8]). With a 9-byte input like bytes(range(9)), that byte exists and has value 8. The bug causes the load to return 0 instead. With 8 bytes of input, the buffer is only 8 bytes long — data[8] doesn’t exist, so the kernel shape changes and the buggy pattern doesn’t arise.
The earlier scratch file metalbug.py also shows the initial investigation path: first testing basic division and modulo operations (which passed), then testing at the SHA3-256 rate boundary of 136 bytes with sha3_224 (which also failed), before narrowing down to the exact byte count.
Phase 2: metalbug_e25n3.py — The Exact Generated Kernel
The exact kernel tinygrad generates for Tensor(bytes(range(9))).keccak("sha3_256") was captured — the E_25n3 kernel (25 threads). This ~130-line kernel does:
- Load a permuted index (
reorder_indexesfrom the Keccak pi step) - Compute
(alu0+1) % 5and(alu0+4) % 5using fast_idiv — twice each, once via long arithmetic (for the condition) and once via int arithmetic (for the address) - Conditionally load bytes using the int-path remainder as address
- Assemble 8 bytes into
unsigned longvalues via XOR chains - Apply rotation and mix operations
The generated code has two separate computations of (alu0+1) % 5:
// Long-path: for the condition
long alu2 = (cast0 + 1ll);
int cast1 = ((int)((alu2 - (5ll * (((alu2 * 1717986919ll) >> 33ll) + alu3)))));
bool cond = (cast1 < 2) & ...;
// Int-path: for the buffer address (inside the ternary)
unsigned char val1 = (cond) ? *(data2_9 + ((alu1 - (5 * (((int)(((long)(alu1)) * 1717986919ll >> 33ll)) + alu4))) << 3)) : 0;
Both paths compute the same mathematical value ((alu0+1) % 5), but using different intermediate types. This is the pattern that triggers the optimizer to make different decisions about each one.
A diagnostic variant showed that the fast_idiv remainders themselves were correct when output to a buffer. The bug was in how the remainder was used as a buffer load address inside a conditional.
Phase 3: metalbug_paredown.py — Systematic Reduction
This was the key breakthrough file. The E_25n3 kernel was stripped layer by layer:
| Version | What it tests | Result | Significance |
|---|---|---|---|
| A | Just the first conditional load (val1) — no XOR chains, no rotation |
FAIL | Bug doesn’t need full Keccak complexity |
| B | Just the second conditional load (val2, using +4 instead of +1) |
FAIL | Both modulo variants affected |
| C | Both loads + XOR assembly | FAIL | Expected (superset of A/B) |
| D | Int-path fast_idiv for both condition and index (no long-path) | PASS | Critical: bug needs different variables for condition vs index |
| E | Direct buffer load using fast_idiv, no conditional | PASS | The conditional branch is necessary |
| F | Full kernel but only ONE fast_idiv (val1), rest use native % |
FAIL | One fast_idiv suffices |
Key insight: The bug triggers when the condition is derived from a different computation (long-path modulo) than the address (int-path fast_idiv). Using the same variable for both avoids the bug because the optimizer makes structurally different decisions about the IR (see Phase 6).
Phase 4: metalbug_minimal_repro.py — Hypothesis Testing
With the minimal failing kernel identified (Version A: ~15 lines of Metal), a systematic hypothesis battery was run:
| Test | Setup | Result | Conclusion |
|---|---|---|---|
| H1 | Native % for condition, fast_idiv for index |
FAIL | Condition doesn’t need to be fast_idiv — any separate variable triggers it |
| H2 | Fast_idiv for condition, native % for index |
PASS | Bug requires fast_idiv specifically in the address/index |
| H3 | Both long-path fast_idiv | PASS | Same-type computation avoids bug |
| H4 | Both int-path fast_idiv | PASS | Same variable for cond+addr avoids bug |
| H5 | Different div method for condition + fast_idiv index | FAIL | Confirmed: any separate condition + fast_idiv index |
| H6 | volatile to block CSE/optimizer fusion |
PASS | Smoking gun: preventing optimizer fusion avoids the bug |
| H7 | Store int-path to variable first | FAIL | Storing to variable alone doesn’t help (optimizer sees through it) |
H6 was the smoking gun: volatile prevents the optimizer from fusing the multiply-by-(-5) with the shift, and the bug disappears.
Phase 5: metalbug_deep.py — Pinpointing the Mechanism
To understand whether the cause was wrong address computation, wrong branch elimination, or wrong load:
- H8/H10: Unconditional loads with long-path present but unused in condition — remainders correct, loads correct. The fast_idiv itself is computed correctly in isolation.
- H9: Always-true condition (
cast1 < 100) with the buggy pattern — still FAIL. Even though the condition is provably always true for inputs 0-24, the optimizer still emits a conditional branch and moves the address computation inside it, triggering the bug. - H11: Output the condition, remainder, address, AND loaded value simultaneously — revealed that the condition was true (correct), the remainder was correct, the address was correct, but the load returned 0 for certain threads. The GPU backend was misexecuting the load from a correctly computed address.
Phase 6: metalbug_ir.py + metalbug_asm.py — LLVM IR Analysis
The buggy and passing kernels were compiled to LLVM IR text (xcrun metal -O2 -S -emit-llvm). The generated .ll files are in test/air-llvm/. Comparing them reveals exactly what the optimizer does differently.
How -O2 transforms the buggy kernel
At -O0 (air-llvm/bug_O0.ll), the buggy kernel’s address computation is straightforward and correct. Inside the true-branch (label %80, line 106), the address is computed using the original formula with mul nsw i32 5, %89 (multiply by 5), then subtracted:
; bug_O0.ll lines 108-121 — inside the true-branch at -O0
%84 = sext i32 %83 to i64
%85 = mul nsw i64 %84, 1717986919 ; fast-div: alu1 * magic
%86 = ashr i64 %85, 33 ; >> 33 = quotient
%87 = trunc i64 %86 to i32
%89 = add nsw i32 %87, %88 ; q + neg_correction
%90 = mul nsw i32 5, %89 ; 5 * (q + neg)
%91 = sub nsw i32 %82, %90 ; alu1 - 5*q = remainder
%92 = shl i32 %91, 3 ; remainder << 3 = byte offset
%93 = sext i32 %92 to i64
%94 = getelementptr inbounds i8, ... %93
%95 = load i8, ... ; load data[remainder * 8]
At -O2 (air-llvm/bug_O2.ll), the optimizer moves the address computation inside the conditional branch and fuses mul by -5 with shl by 3. The entire true-branch (label %31, lines 36-51) becomes:
; bug_O2.ll lines 36-50 — inside the true-branch at -O2
31:
%32 = zext i1 %14 to i32
%33 = add nsw i32 %11, 1 ; alu1 = alu0 + 1
%34 = sext i32 %33 to i64
%35 = mul nsw i64 %34, 1717986919 ; fast-div: alu1 * magic
%36 = lshr i64 %35, 33 ; >> 33 = quotient
%37 = trunc i64 %36 to i32
%38 = add nuw i32 %37, %32 ; q + neg_correction
%39 = mul i32 %38, 536870907 ; <<< THE FUSED CONSTANT
%40 = add i32 %39, %33 ; + alu1
%41 = shl i32 %40, 3 ; << 3
%42 = sext i32 %41 to i64 ; sign-extend for address
%43 = getelementptr inbounds i8, i8 addrspace(1)* %2, i64 %42
%44 = load i8, i8 addrspace(1)* %43 ; MISCOMPILED LOAD
%45 = zext i8 %44 to i32
Line 39 is the critical transformation. Instead of alu1 - 5 * q then << 3, the optimizer algebraically fused the operations into q * 536870907 + alu1 then << 3. This is valid because in 32-bit modular arithmetic: (alu1 - 5*q) << 3 == (q * 536870907 + alu1) << 3, since 536870907 * 8 = 4,294,967,256 = 2^32 - 40 ≡ -40 (mod 2^32) and -5 * 8 = -40.
The LLVM IR is mathematically correct. The GPU backend miscompiles it.
How the passing kernel avoids the bug
In air-llvm/pass_O2.ll, the remainder is needed before the branch (for the condition), so the optimizer computes it pre-branch using mul i32 %19, -5:
; pass_O2.ll lines 19-34 — BEFORE the branch
%16 = mul nsw i64 %15, 1717986919 ; fast-div: alu1 * magic
%17 = ashr i64 %16, 33 ; >> 33 = quotient
%18 = trunc i64 %17 to i32
%19 = add nsw i32 %18, %14 ; q + neg_correction
%20 = mul i32 %19, -5 ; q * (-5) ← SAFE CONSTANT
%21 = add i32 %20, %12 ; alu1 + (-5)*q = remainder
%22 = icmp slt i32 %21, 2 ; condition: remainder < 2
...
br i1 %30, label %37, label %31 ; branch on condition
31: ; true-branch: just shift and load
%32 = shl i32 %21, 3 ; remainder << 3 (reuses %21 from above)
%33 = sext i32 %32 to i64
%34 = getelementptr inbounds i8, i8 addrspace(1)* %2, i64 %33
%35 = load i8, i8 addrspace(1)* %34 ; CORRECT LOAD
The true-branch is trivial — just shl, sext, gep, load. No mul i32 X, 536870907, no i32 overflow.
The H9 kernel: even always_true gets the fused constant
In air-llvm/h9_O2.ll, the condition is cast1 < 100 (always true for inputs 0-24). But the optimizer still emits a conditional branch (it can’t statically prove always-true for arbitrary runtime inputs), and the true-branch at label %23 still contains the fused constant:
; h9_O2.ll lines 28-41 — inside the true-branch
23:
...
%31 = mul i32 %30, 536870907 ; same fused constant
%32 = add i32 %31, %25
%33 = shl i32 %32, 3
...
%36 = load i8, i8 addrspace(1)* %35 ; MISCOMPILED LOAD
This confirms the bug isn’t about condition evaluation — it’s purely about what IR pattern ends up inside the branch.
The T6 kernel: same variable forces -5
In air-llvm/t6_O2.ll, using the same remainder variable for both condition and address produces identical IR structure to pass_O2.ll:
; t6_O2.ll lines 23-24 — BEFORE the branch
%20 = mul i32 %19, -5 ; q * (-5) ← SAFE
%21 = add i32 %20, %12 ; remainder
; t6_O2.ll lines 36-40 — inside the true-branch
31:
%32 = shl i32 %21, 3 ; just shift and load
No 536870907, no bug.
Summary of the IR evidence
| File | 536870907 in branch? |
mul i32 X, -5 pre-branch? |
Runtime result |
|---|---|---|---|
bug_O2.ll |
Yes (line 44) | No | FAIL |
h9_O2.ll |
Yes (line 36) | No | FAIL |
t1_O2.ll |
Yes | No | FAIL |
t2_O2.ll |
Yes | No | FAIL |
pass_O2.ll |
No | Yes (line 23) | PASS |
t6_O2.ll |
No | Yes (line 23) | PASS |
reproducer_O2.ll @buggy |
Yes (line 39) | No | FAIL |
reproducer_O2.ll @working |
No | Yes (line 72) | PASS |
100% correlation across all test variants.
Phase 7: metalbug_constant.py — Confirming the Constant
Six targeted tests varying the constant and the context:
| Test | Setup | Result | IR contains 536870907? |
|---|---|---|---|
| T1 | Original bug pattern | FAIL | Yes |
| T2 | Explicit 536870907 written in source |
FAIL | Yes |
| T3 | Explicit -5, remainder as variable |
PASS | No |
| T4 | 536870907 but unconditional (no branch) |
PASS | Yes (but no branch) |
| T5 | volatile prevents fusion |
PASS | No |
| T6 | Same variable for cond+addr | PASS | No (uses -5) |
T4 is especially revealing: 536870907 in the source without a conditional branch PASSES. The bug requires both ingredients: the fused constant 536870907 AND a conditional branch containing the address computation.
Phase 8: The Final Reproducers
Three standalone reproducers were created for the Apple bug report:
metalbug_reproducer.m— Objective-C, self-contained, compiles withclang -framework Metal -framework Foundationmetalbug_reproducer.swift+metalbug_cpu.cpp— Swift+C++ interop, also runs the same logic on CPU to prove the algorithm is correct and the GPU backend is at faultmetalbug_reproducer.metal— Clean two-kernel Metal source (buggy vs working)
And the formal write-up: metalbug_apple_report.md.
Root Cause Summary
The Metal GPU backend (the AGX code generator that runs after LLVM IR → AIR optimization) miscompiles the following IR pattern when it appears inside a conditional branch:
; From reproducer_O2.ll @buggy, lines 39-44 (inside label %25, the true-branch)
%34 = mul i32 %33, 536870907 ; quotient * 536870907 (overflows i32)
%35 = add i32 %34, %28 ; + alu1
%36 = shl i32 %35, 3 ; << 3
%37 = sext i32 %36 to i64 ; sign-extend for 64-bit address
%38 = getelementptr inbounds i8, i8 addrspace(1)* %2, i64 %37
%39 = load i8, i8 addrspace(1)* %38 ; ← returns 0 instead of correct value
The load returns 0 instead of the correct value for inputs where %33 (the quotient) is > 0, causing %34 to overflow i32. This happens for alu0 ∈ {5, 10, 15, 20} (quotient = 1, 2, 3, 4). The thread with alu0 = 0 (quotient = 0) passes because 0 * 536870907 = 0 — no overflow.
The equivalent safe pattern, produced when the remainder is computed pre-branch:
; From reproducer_O2.ll @working, lines 72-73 (BEFORE the branch)
%20 = mul i32 %19, -5 ; quotient * (-5) — no problematic overflow
%21 = add i32 %20, %12 ; + alu1 = remainder (small value 0-4)
Both patterns are algebraically identical in 32-bit modular arithmetic. The LLVM IR is provably correct in both cases. The bug is in the GPU backend’s code generation for the first pattern.
Workaround
Ensure the fast-division remainder is used for both the branch condition and the load address. This forces the LLVM optimizer to compute the remainder before the branch using mul i32 X, -5 (no fusion with the shift), which the GPU backend handles correctly.
Reproducing the LLVM IR
Prerequisites
Install the Metal Toolchain (required for xcrun metal with -S -emit-llvm):
xcodebuild -downloadComponent MetalToolchain
Compiling Metal to human-readable LLVM IR
Each .metal file is compiled at two optimization levels. The -S -emit-llvm flags tell the Metal compiler to emit LLVM IR as text instead of binary AIR bitcode.
# From the tinygrad root directory
mkdir -p test/air-llvm
# Compile all kernels in test/metalbug_ir/ at -O0 and -O2
for f in test/metalbug_ir/*.metal; do
name=$(basename "$f" .metal)
xcrun metal -O0 -fno-fast-math -std=metal3.1 -S -emit-llvm -o "test/air-llvm/${name}_O0.ll" "$f"
xcrun metal -O2 -fno-fast-math -std=metal3.1 -S -emit-llvm -o "test/air-llvm/${name}_O2.ll" "$f"
done
# Compile the standalone reproducer (in test/, not test/metalbug_ir/)
xcrun metal -O0 -fno-fast-math -std=metal3.1 -S -emit-llvm -o test/air-llvm/reproducer_O0.ll test/metalbug_reproducer.metal
xcrun metal -O2 -fno-fast-math -std=metal3.1 -S -emit-llvm -o test/air-llvm/reproducer_O2.ll test/metalbug_reproducer.metal
Compiling Metal to binary AIR (intermediate step)
If you want the binary AIR bitcode files (e.g. to link into a .metallib or inspect with metal-objdump):
# Compile to .air (binary LLVM bitcode in Apple's AIR format)
xcrun metal -O2 -fno-fast-math -std=metal3.1 -c -o test/metalbug_ir/bug.air test/metalbug_ir/bug.metal
# Link into a .metallib (Metal library archive)
xcrun metallib -o test/metalbug_ir/bug.metallib test/metalbug_ir/bug.air
# Disassemble the .metallib to see AIR metadata
xcrun metal-objdump -d test/metalbug_ir/bug.metallib
Building the standalone reproducer binary
cd test
swiftc -g -cxx-interoperability-mode=default \
-import-objc-header metalbug_cpu.h \
metalbug_cpu.cpp metalbug_reproducer.swift \
-framework Metal -framework Foundation \
-o metalbug_reproducer
./metalbug_reproducer
Quick verification: grep for the fused constant
# Should appear in all FAIL kernels, absent from all PASS kernels
grep -l '536870907' test/air-llvm/*_O2.ll
File Index
| File | Role |
|---|---|
test/null/test_uops.py:164 |
The commented-out skip that started it all |
test/metalbug.py |
Initial scratch: keccak impl, early modulo/division tests |
test/metalbug_minimal.py |
Phase 1: tensor-level SHA3 tests, 8-byte vs 9-byte boundary |
test/metalbug_e25n3.py |
Phase 2: exact generated E_25n3 kernel reproduction |
test/metalbug_paredown.py |
Phase 3: systematic kernel reduction (Versions A-F) |
test/metalbug_minimal_repro.py |
Phase 4: hypothesis tests (H1-H7) on minimal kernel |
test/metalbug_deep.py |
Phase 5: deep diagnosis (H8-H11), pinpointing the load |
test/metalbug_ir.py |
Phase 6: LLVM IR compilation and comparison |
test/metalbug_asm.py |
Phase 6: broader Metal toolchain exploration |
test/metalbug_constant.py |
Phase 7: targeted constant tests (T1-T6) |
test/metalbug_reproducer.metal |
Final: clean two-kernel Metal source |
test/metalbug_reproducer.m |
Final: Objective-C standalone reproducer |
test/metalbug_reproducer.swift |
Final: Swift standalone reproducer |
test/metalbug_cpu.cpp / .h |
Final: CPU reference implementation proving algorithm correctness |
test/metalbug_apple_report.md |
Final: formal bug report for Apple |
test/air-llvm/*.ll |
Human-readable LLVM IR from all Metal kernels at -O0 and -O2 |
Metal GPU Compiler Bug: Incorrect Code Generation for mul i32 X, 536870907 Inside Conditional Branch
Summary
The Metal GPU compiler (the runtime JIT that converts AIR to GPU machine code) produces incorrect results for a specific LLVM IR pattern: when the LLVM optimizer fuses a multiply-by-(-5) with a shift-left-by-3 into multiply-by-536870907 then shift-left-by-3, and this computation occurs inside a conditional branch used as a buffer load address, the load returns 0 instead of the correct value for certain inputs.
The LLVM IR produced by the Metal frontend is provably correct. The bug is in the GPU backend.
Environment
- Apple M3 Pro, macOS 15.x
- Metal compiler:
metalfe-32023.622 - Metal language: Metal 3.1
- Xcode toolchain: XcodeDefault
Minimal Reproducer
Two Metal kernels perform mathematically equivalent computations but produce different results. Both kernels:
- Read an index value from a buffer
- Compute
(index + 1) % 5using the standard multiply-shift fast division pattern (x * 1717986919 >> 33) - Use the remainder as a byte offset (
remainder * 8) to conditionally load from a data buffer
The ONLY difference: in the buggy kernel, the condition and the address use different intermediate variables (causing the LLVM optimizer to move the address computation inside the branch and fuse it with the shift, producing the constant 536870907). In the working kernel, the condition and address use the same intermediate (so the optimizer computes the remainder before the branch and keeps it as a standalone value, using the constant -5).
Kernel A: BUGGY (produces wrong results)
#include <metal_stdlib>
using namespace metal;
kernel void buggy(device int* output, device int* indices, device unsigned char* data,
uint3 gid [[threadgroup_position_in_grid]]) {
int gidx0 = gid.x;
int val0 = indices[gidx0];
int alu0 = (val0 < 0) ? (val0 + 25) : val0;
int alu1 = alu0 + 1;
// Condition: compute (alu0+1) % 5 via LONG arithmetic (native modulo)
long alu2 = (long)(alu0) + 1ll;
int cast1 = (int)(alu2 % 5ll);
bool cond = (cast1 < 2) & (alu0 >= 0) & (alu0 < 25);
// Address: compute (alu1) % 5 via INT fast-division (multiply-shift)
// This is mathematically identical to cast1 since alu1 == (int)alu2
int q = (int)(((long)(alu1) * 1717986919ll) >> 33ll);
int alu4 = (alu1 < 0) ? 1 : 0;
// remainder = alu1 - 5 * (q + alu4), then address = remainder * 8
// The optimizer fuses "- 5 * X" with "<< 3" into "X * 536870907 ... << 3"
unsigned char val1 = cond
? *(data + ((alu1 - 5 * (q + alu4)) << 3))
: (unsigned char)(0u);
output[gidx0] = (int)val1;
}
Kernel B: WORKING (produces correct results)
#include <metal_stdlib>
using namespace metal;
kernel void working(device int* output, device int* indices, device unsigned char* data,
uint3 gid [[threadgroup_position_in_grid]]) {
int gidx0 = gid.x;
int val0 = indices[gidx0];
int alu0 = (val0 < 0) ? (val0 + 25) : val0;
int alu1 = alu0 + 1;
// Compute (alu1) % 5 via INT fast-division (same formula as Kernel A's address)
int q = (int)(((long)(alu1) * 1717986919ll) >> 33ll);
int alu4 = (alu1 < 0) ? 1 : 0;
int remainder = alu1 - 5 * (q + alu4);
// Use the SAME remainder for BOTH condition and address
bool cond = (remainder < 2) & (alu0 >= 0) & (alu0 < 25);
unsigned char val1 = cond
? *(data + (remainder << 3))
: (unsigned char)(0u);
output[gidx0] = (int)val1;
}
Why these kernels are equivalent
Both kernels compute (alu0 + 1) % 5 and use it the same way:
- Condition:
remainder < 2(and bounds checks on alu0) - Address:
remainder * 8(i.e.,remainder << 3)
In Kernel A, the condition uses cast1 = (int)((long)(alu0) + 1) % 5) and the address uses alu1 - 5 * quotient where alu1 = alu0 + 1. Since (long)(alu0) + 1 == (long)(alu0 + 1) == (long)(alu1), both cast1 and the address remainder compute the same value: (alu0 + 1) % 5.
In Kernel B, a single remainder variable is used for both.
The mathematical equivalence is straightforward and can be verified for all 25 input values (alu0 = 0 through 24).
Test setup
indicesbuffer: 25 ints containing a permutation of 0-24:[0, 6, 12, 18, 24, 3, 9, 10, 16, 22, 1, 7, 13, 19, 20, 4, 5, 11, 17, 23, 2, 8, 14, 15, 21]databuffer: 9 bytes containing[0, 1, 2, 3, 4, 5, 6, 7, 8]- Launch 25 threads (one per index)
Expected results (both kernels)
For each thread, (alu0 + 1) % 5 determines whether to load:
- If remainder < 2 (i.e., remainder is 0 or 1): load
data[remainder * 8]- remainder 0 →
data[0]= 0 - remainder 1 →
data[8]= 8
- remainder 0 →
- Otherwise: output 0
Threads where alu0 ∈ {0, 5, 10, 15, 20} have remainder = 1, so they should output 8.
Actual results
| Thread (gidx0) | alu0 | Expected | Kernel B (working) | Kernel A (buggy) |
|---|---|---|---|---|
| 0 | 0 | 8 | 8 | 8 |
| 7 | 10 | 8 | 8 | 0 |
| 14 | 20 | 8 | 8 | 0 |
| 16 | 5 | 8 | 8 | 0 |
| 23 | 15 | 8 | 8 | 0 |
Kernel A returns 0 (the else-branch value) for 4 out of 5 threads where remainder = 1. The thread with alu0 = 0 passes; the threads with alu0 = 5, 10, 15, 20 fail. All failing threads have fast-division quotient > 0.
Root Cause Analysis (LLVM IR)
Compiling both kernels with xcrun metal -O2 -S -emit-llvm reveals the cause. The LLVM IR is correct for both kernels, but they have different structure due to valid optimizer transformations.
Kernel B (working) — IR structure
The optimizer computes the remainder before the branch because it’s needed for both the condition and the address:
; Before branch: compute remainder using mul i32 %quotient, -5
%20 = mul i32 %19, -5 ; quotient * (-5)
%21 = add i32 %20, %12 ; alu1 + (-5)*quotient = remainder
%22 = icmp slt i32 %21, 2 ; condition: remainder < 2
...
br i1 %30, label %37, label %31 ; branch
31: ; true branch: just shift + load
%32 = shl i32 %21, 3 ; remainder << 3
...load...
Kernel A (buggy) — IR structure
The optimizer computes the remainder inside the branch (since it’s only needed for the address, not the condition), and fuses *(-5) with <<3:
; Before branch: compute condition using srem
%14 = srem i64 %13, 5 ; native modulo for condition
...
br i1 %24, label %41, label %25 ; branch
25: ; true branch: full recomputation
%28 = mul nsw i64 %27, 1717986919 ; fast-div multiply
%29 = lshr i64 %28, 33 ; shift for quotient
...
%34 = mul i32 %33, 536870907 ; FUSED: quotient * 536870907
%35 = add i32 %34, %26 ; + alu1
%36 = shl i32 %35, 3 ; << 3
%37 = sext i32 %36 to i64 ; sign-extend for address
...load...
The fused constant 536870907
The optimizer transforms (alu1 - 5*q) << 3 into (q * 536870907 + alu1) << 3. This is valid because 536870907 * 8 ≡ -40 (mod 2^32) and (-5) * 8 = -40, so after the shift, both expressions produce the same 32-bit result. The constant 536870907 is the unique value in [0, 2^29) satisfying this equivalence.
This transformation is algebraically correct — the intermediate i32 value overflows but the final result after the shift and truncation is identical. The LLVM IR is valid.
Correlation is perfect
Across 10+ test variants, every kernel whose IR contains mul i32 X, 536870907 fails, and every kernel without it passes. The correlation is 100%.
Conclusion
The Metal GPU compiler’s runtime JIT (the stage that converts AIR/LLVM bitcode to AGX machine instructions) miscompiles the pattern:
%a = mul i32 %x, 536870907 ; inside conditional branch
%b = add i32 %a, %y
%c = shl i32 %b, 3
%d = sext i32 %c to i64
%e = getelementptr inbounds i8, i8 addrspace(1)* %ptr, i64 %d
%f = load i8, i8 addrspace(1)* %e
The load returns 0 instead of the correct value for inputs where the intermediate %a causes i32 overflow (quotient > 0). The LLVM IR is provably correct; the bug is in the backend code generation.
Workaround
Ensure the fast-division remainder is used for both the branch condition and the load address. This forces the LLVM optimizer to compute the remainder before the branch using mul i32 X, -5 (no fusion with the shift), which the GPU backend handles correctly.
Attachments
buggy.metal— Kernel A source (reproduces the bug)working.metal— Kernel B source (correct behavior)reproducer.m— Self-contained Objective-C command-line program that compiles both kernels, runs them, and shows the discrepancy