-
Notifications
You must be signed in to change notification settings - Fork 794
Description
🐛 Describe the bug
When a CoreML model is cached to disk, subsequent runs can produce corrupted outputs. The corrupted output can be fixed by clearing the model cache manually and re-compiling.
The bug appears to be model-specific. So far, I have only observed it with stories110M. I do not see it on Llama1B (same architecture). The corruption manifests differently depending on compute units.
- CPU_ONLY: Produces
<unk>tokens (completely invalid logits) - CPU_AND_NE: Produces nonsensical but consistent text (partially corrupted)
Guess at Root Cause: CoreML framework bug where certain models loaded from cache have corrupted output buffers.
Repro
Environment
- macOS (Apple Silicon)
- ExecutorTorch with CoreML backend
- Tested on macOS 26.2
Prerequisites
- Check out (Add C++ static runner for CoreML #16463), ensure you're in the executorch repo root, and set up ExecutorTorch:
cd /path/to/executorch
python install_executorch.py --editableStep 1: Build the Runner
# Clean and build
rm -rf cmake-out
cmake -S . -B cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_ENABLE_LOGGING=ON \
-DEXECUTORCH_BUILD_EXTENSION_LLM=ON \
-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
-DEXECUTORCH_BUILD_COREML=ON \
-G Ninja
cmake --build cmake-out -j --target run_static_llm_coremlStep 2: Download Model Artifacts and Export
cd examples/apple/coreml/llama
# Download stories110M model artifacts
curl -Ls https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt --output stories110M.pt
curl -Ls https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model --output tokenizer.model
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
# Export with CPU_ONLY (triggers severe bug - <unk> tokens)
python export_static_llm_coreml.py --checkpoint stories110M.pt --params params.json --output model_cpu.pte --cpu_only
# Export with CPU_AND_NE (triggers mild bug - nonsensical text)
python export_static_llm_coreml.py --checkpoint stories110M.pt --params params.json --output model_ane.pte
cd ../../../.. # back to repo rootStep 3: Reproduce the CPU_ONLY Bug
# Clear the CoreML cache
rm -rf ~/Library/Caches/executorchcoreml/models/*
# First run - WORKS (compiles and caches the model)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
--model examples/apple/coreml/llama/model_cpu.pte \
--params examples/apple/coreml/llama/params.json \
--tokenizer examples/apple/coreml/llama/tokenizer.model \
--prompt "Once upon a time," \
--max_new_tokens 20
# Second run - FAILS (loads from cache, produces <unk> tokens)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
--model examples/apple/coreml/llama/model_cpu.pte \
--params examples/apple/coreml/llama/params.json \
--tokenizer examples/apple/coreml/llama/tokenizer.model \
--prompt "Once upon a time," \
--max_new_tokens 20Actual Output (CPU_ONLY)
First Run (cache miss):
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and
Second Run (cache hit):
Once upon a time,<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
Step 4: Reproduce the CPU_AND_NE Bug
# Clear the CoreML cache
rm -rf ~/Library/Caches/executorchcoreml/models/*
# First run - WORKS
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
--model examples/apple/coreml/llama/model_ane.pte \
--params examples/apple/coreml/llama/params.json \
--tokenizer examples/apple/coreml/llama/tokenizer.model \
--prompt "Once upon a time," \
--max_new_tokens 20
# Second run - BUGGY (nonsensical but consistent output)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
--model examples/apple/coreml/llama/model_ane.pte \
--params examples/apple/coreml/llama/params.json \
--tokenizer examples/apple/coreml/llama/tokenizer.model \
--prompt "Once upon a time," \
--max_new_tokens 20
# Third run - Same buggy output (deterministic corruption)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
--model examples/apple/coreml/llama/model_ane.pte \
--params examples/apple/coreml/llama/params.json \
--tokenizer examples/apple/coreml/llama/tokenizer.model \
--prompt "Once upon a time," \
--max_new_tokens 20Actual Output (CPU_AND_NE)
First Run (cache miss):
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and
Second Run (cache hit):
Once upon a time, heal, heal, hiss name, named Timmy, named Samantha toad
Third Run (cache hit):
Once upon a time, heal, heal, hiss name, named Timmy, named Samantha toad
Note: The corrupted output is deterministic/consistent across subsequent cache hits.
Models Tested
| Model | Compute Units | First Run | Cache Hit Runs |
|---|---|---|---|
| stories110M | CPU_ONLY | ✅ Valid | ❌ <unk> tokens |
| stories110M | CPU_AND_NE | ✅ Valid | |
| llama1b | CPU_AND_NE | ✅ Valid | ✅ Valid |
Versions
Collecting environment information...
PyTorch version: 2.11.0.dev20251222
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 26.2 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.3.19.1)
CMake version: version 3.31.6
Libc version: N/A
Python version: 3.10.19 (main, Oct 21 2025, 16:37:10) [Clang 20.1.8 ] (64-bit runtime)
Python platform: macOS-26.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A
CPU:
Apple M1 Pro
Versions of relevant libraries:
[pip3] executorch==1.1.0a0+aa58075
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] mypy==1.14.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] pytorch_tokenizers==1.0.1
[pip3] torch==2.11.0.dev20251222
[pip3] torchao==0.16.0+git08e5e203f
[pip3] torchaudio==2.10.0.dev20251222
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.25.0.dev20251222
[conda] executorch 1.1.0a0+aa58075 pypi_0 pypi
[conda] numpy 2.2.6 pypi_0 pypi
[conda] pytorch-tokenizers 1.0.1 pypi_0 pypi
[conda] torch 2.11.0.dev20251222 pypi_0 pypi
[conda] torchao 0.16.0+git08e5e203f pypi_0 pypi
[conda] torchaudio 2.10.0.dev20251222 pypi_0 pypi
[conda] torchdata 0.11.0 pypi_0 pypi
[conda] torchfix 0.6.0 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchtune 0.6.1 pypi_0 pypi
[conda] torchvision 0.25.0.dev20251222 pypi_0 pypi