[do not land] multifunction experiments #16514
Open
+418
−35
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CoreML Multifunction Model Experiment
This PR adds tooling to create and benchmark CoreML multifunction models that combine prefill and decode functions into a single model package.
Overview
CoreML multifunction models allow multiple functions (e.g., prefill and decode) to share weights within a single model package. This experiment evaluates:
Step 1: Export Static Models
First, export two PTE files with different sequence lengths using
export_static_llm_coreml.py:Step 2: Create Multifunction Models
Use
create_multifunctions.pyto combine the prefill and decode models:python create_multifunctions.py \ --prefill_model $HOME/Desktop/model_32.pte \ --decode_model $HOME/Desktop/model_1.pte \ --output_dir $HOME/Desktop/modsThis will:
mod1.mlpackage,mod2.mlpackage,mod3.mlpackageOptional: Pre-compile Models
Add the
--compileflag to pre-compile the models to.mlmodelcformat:python create_multifunctions.py \ --prefill_model $HOME/Desktop/model_32.pte \ --decode_model $HOME/Desktop/model_1.pte \ --output_dir $HOME/Desktop/mods \ --compileThis outputs
mod1.mlmodelc,mod2.mlmodelc,mod3.mlmodelcinstead. Pre-compiled models skip the compilation step at runtime.Step 3: Benchmark with CoreML Test
extension/benchmark/apple/Benchmark.mlpackageor.mlmodelcfiles into theResourcesfolderConfiguring the Benchmark
Edit
CoreMLTests.mmto configure the benchmark behavior:Benchmark Output
The benchmark runs:
kEnableDecode = YES)Output example:
Observations
Memory Usage
Multifunction models do not appear to use significantly more memory than individual models. The weights are shared between the prefill and decode functions, so memory overhead is minimal.
Model Piece Memory
The embedding piece (
mod1) uses significantly more memory compared to the other pieces. This can be observed by togglingkEnableMod1 = NOand comparing memory usage:This suggests the embedding table is a major contributor to overall memory footprint.