CoreML sometimes produces garbage output on cached models

### 🐛 Describe the bug

When a CoreML model is cached to disk, subsequent runs can produce corrupted outputs.  The corrupted output can be fixed by clearing the model cache manually and re-compiling.

The bug appears to be model-specific.  So far, I have only observed it with stories110M.  I do not see it on Llama1B (same architecture).  The corruption manifests differently depending on compute units.  

- **CPU_ONLY**: Produces `<unk>` tokens (completely invalid logits)
- **CPU_AND_NE**: Produces nonsensical but consistent text (partially corrupted)

**Guess at Root Cause**: CoreML framework bug where certain models loaded from cache have corrupted output buffers.

# Repro

## Environment

- macOS (Apple Silicon)
- ExecutorTorch with CoreML backend
- Tested on macOS 26.2

## Prerequisites

1. Check out (https://github.com/pytorch/executorch/pull/16463), ensure you're in the executorch repo root, and set up ExecutorTorch:
```bash
cd /path/to/executorch
python install_executorch.py --editable
```

## Step 1: Build the Runner

```bash
# Clean and build
rm -rf cmake-out

cmake -S . -B cmake-out \
  -DCMAKE_BUILD_TYPE=Release \
  -DEXECUTORCH_ENABLE_LOGGING=ON \
  -DEXECUTORCH_BUILD_EXTENSION_LLM=ON \
  -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \
  -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
  -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
  -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
  -DEXECUTORCH_BUILD_COREML=ON \
  -G Ninja

cmake --build cmake-out -j --target run_static_llm_coreml
```

## Step 2: Download Model Artifacts and Export

```bash
cd examples/apple/coreml/llama

# Download stories110M model artifacts
curl -Ls https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt --output stories110M.pt
curl -Ls https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model --output tokenizer.model
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json

# Export with CPU_ONLY (triggers severe bug - <unk> tokens)
python export_static_llm_coreml.py --checkpoint stories110M.pt --params params.json --output model_cpu.pte --cpu_only

# Export with CPU_AND_NE (triggers mild bug - nonsensical text)
python export_static_llm_coreml.py --checkpoint stories110M.pt --params params.json --output model_ane.pte

cd ../../../..  # back to repo root
```

## Step 3: Reproduce the CPU_ONLY Bug

```bash
# Clear the CoreML cache
rm -rf ~/Library/Caches/executorchcoreml/models/*

# First run - WORKS (compiles and caches the model)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
  --model examples/apple/coreml/llama/model_cpu.pte \
  --params examples/apple/coreml/llama/params.json \
  --tokenizer examples/apple/coreml/llama/tokenizer.model \
  --prompt "Once upon a time," \
  --max_new_tokens 20

# Second run - FAILS (loads from cache, produces <unk> tokens)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
  --model examples/apple/coreml/llama/model_cpu.pte \
  --params examples/apple/coreml/llama/params.json \
  --tokenizer examples/apple/coreml/llama/tokenizer.model \
  --prompt "Once upon a time," \
  --max_new_tokens 20
```

### Actual Output (CPU_ONLY)

**First Run (cache miss):**
```
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and
```

**Second Run (cache hit):**
```
Once upon a time,<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
```

## Step 4: Reproduce the CPU_AND_NE Bug

```bash
# Clear the CoreML cache
rm -rf ~/Library/Caches/executorchcoreml/models/*

# First run - WORKS
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
  --model examples/apple/coreml/llama/model_ane.pte \
  --params examples/apple/coreml/llama/params.json \
  --tokenizer examples/apple/coreml/llama/tokenizer.model \
  --prompt "Once upon a time," \
  --max_new_tokens 20

# Second run - BUGGY (nonsensical but consistent output)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
  --model examples/apple/coreml/llama/model_ane.pte \
  --params examples/apple/coreml/llama/params.json \
  --tokenizer examples/apple/coreml/llama/tokenizer.model \
  --prompt "Once upon a time," \
  --max_new_tokens 20

# Third run - Same buggy output (deterministic corruption)
./cmake-out/examples/apple/coreml/llama/runner/run_static_llm_coreml \
  --model examples/apple/coreml/llama/model_ane.pte \
  --params examples/apple/coreml/llama/params.json \
  --tokenizer examples/apple/coreml/llama/tokenizer.model \
  --prompt "Once upon a time," \
  --max_new_tokens 20
```

### Actual Output (CPU_AND_NE)

**First Run (cache miss):**
```
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and
```

**Second Run (cache hit):**
```
Once upon a time, heal, heal, hiss name, named Timmy, named Samantha toad
```

**Third Run (cache hit):**
```
Once upon a time, heal, heal, hiss name, named Timmy, named Samantha toad
```

Note: The corrupted output is deterministic/consistent across subsequent cache hits.

### Models Tested

| Model | Compute Units | First Run | Cache Hit Runs |
|-------|---------------|-----------|----------------|
| stories110M | CPU_ONLY | ✅ Valid | ❌ `<unk>` tokens |
| stories110M | CPU_AND_NE | ✅ Valid | ⚠️ Nonsensical text |
| llama1b | CPU_AND_NE | ✅ Valid | ✅ Valid |

### Versions

Collecting environment information...
PyTorch version: 2.11.0.dev20251222
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 26.2 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.3.19.1)
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.10.19 (main, Oct 21 2025, 16:37:10) [Clang 20.1.8 ] (64-bit runtime)
Python platform: macOS-26.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] executorch==1.1.0a0+aa58075
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] mypy==1.14.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] pytorch_tokenizers==1.0.1
[pip3] torch==2.11.0.dev20251222
[pip3] torchao==0.16.0+git08e5e203f
[pip3] torchaudio==2.10.0.dev20251222
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.25.0.dev20251222
[conda] executorch                1.1.0a0+aa58075          pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] pytorch-tokenizers        1.0.1                    pypi_0    pypi
[conda] torch                     2.11.0.dev20251222          pypi_0    pypi
[conda] torchao                   0.16.0+git08e5e203f          pypi_0    pypi
[conda] torchaudio                2.10.0.dev20251222          pypi_0    pypi
[conda] torchdata                 0.11.0                   pypi_0    pypi
[conda] torchfix                  0.6.0                    pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchtune                 0.6.1                    pypi_0    pypi
[conda] torchvision               0.25.0.dev20251222          pypi_0    pypi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CoreML sometimes produces garbage output on cached models #16492

🐛 Describe the bug

Repro

Environment

Prerequisites

Step 1: Build the Runner

Step 2: Download Model Artifacts and Export

Step 3: Reproduce the CPU_ONLY Bug

Actual Output (CPU_ONLY)

Step 4: Reproduce the CPU_AND_NE Bug

Actual Output (CPU_AND_NE)

Models Tested

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Compute Units	First Run	Cache Hit Runs
stories110M	CPU_ONLY	✅ Valid	❌ `<unk>` tokens
stories110M	CPU_AND_NE	✅ Valid	⚠️ Nonsensical text
llama1b	CPU_AND_NE	✅ Valid	✅ Valid

CoreML sometimes produces garbage output on cached models #16492

Description

🐛 Describe the bug

Repro

Environment

Prerequisites

Step 1: Build the Runner

Step 2: Download Model Artifacts and Export

Step 3: Reproduce the CPU_ONLY Bug

Actual Output (CPU_ONLY)

Step 4: Reproduce the CPU_AND_NE Bug

Actual Output (CPU_AND_NE)

Models Tested

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions