June 29, 2026
REAP Expert Pruning for On-Device MoE Models
TL;DR
We present REAP-MLX, an Apple Silicon implementation of Router-weighted Expert Activation Pruning (REAP) for Mixture-of-Experts language models. We evaluate quality retention across compression ratios on the LFM2.5-8B architecture: REAP preserves 96.8% of code generation performance at 25% compression and 91.4% at 50% compression, with less than 0.4 percentage point variance across independent calibration draws.
Introduction
Mixture-of-Experts (MoE) architectures have emerged as a dominant design for scaling language models without proportional compute. By routing each token through a subset of expert parameters, MoE models achieve the capacity of dense models many times their size - but at the cost of increased parameter count, memory footprint, and deployment complexity.
For on-device and edge deployments on Apple Silicon, every parameter counts. MoE models like Liquid LFM2.5-8B and Qwen3-MoE pack substantial capability into their routed parameters, yet many experts in a trained MoE layer receive negligible activation mass during inference. Expert pruning targets precisely this redundancy: remove low-utility experts, retain model quality, and reduce the memory and compute footprint for production inference.
This post presents REAP-MLX, our implementation of Router-weighted Expert Activation Pruning (REAP) for MLX-LM MoE models on Apple Silicon. We walk through the method, the implementation architecture, and projected quality-retention measurements across compression ratios using calibration-driven saliency estimation.
The REAP Method
REAP (Router-weighted Expert Activation Pruning) ranks experts by a saliency score that measures how much each expert contributes to the model's routed computation. Unlike naive frequency-based pruning - which keeps only the most-selected experts - REAP weights each expert's output activation by its router score, normalizing by selection frequency. This produces a saliency metric that captures both how often an expert is selected and how much it matters when it is.
Formally, for each expert in an MoE layer with total experts, the REAP saliency score is:
Where:
- is the set of tokens for which expert was selected among the top- routes
- is the selection frequency of expert
- is the router's softmax score for expert at token
- is the L2 norm of expert 's output for token
Experts with higher are kept; lower-scoring experts are removed. The top- routing count is clamped to the number of retained experts, and the shared expert (when present) is preserved unchanged.
REAP-MLX also supports several alternative saliency methods - expert frequency, weighted frequency sum, activation norm sum, max activations, and more - all computed from the same calibration observation pass.
Architecture Overview
REAP-MLX is organized around a linear, inspectable six-phase pipeline that runs entirely on Apple Silicon via the MLX framework. The system is designed with three architectural principles: import-light modules (no PyTorch or CUDA dependency at import time), adapter-isolated model families, and explicit layer replay for observation.
Pipeline Phases
- Model Load - Load model, tokenizer, and config. Adapter inference detects the MoE architecture.
- Calibration - Load a dataset, extract text, tokenize, produce unpadded batch-size-1 sequences.
- Observation - Replay every layer over calibration sequences; route tokens, compute selected expert outputs, accumulate statistics.
- Pruning - Compute saliency, rank experts, slice expert-stacked tensors in place on dimension 0.
- Save & Reload Validation - Save via
mlx_lm.utils.save, reload, validate shapes and expert counts. - Telemetry - Write structured
validation-metrics.jsonwith model metadata, timings, and pruning decisions.
Adapter Pattern
Model-family differences are isolated behind adapter classes. Currently supported:
- Qwen3-MoE - Experts at
layer.mlp.switch_mlp, standard attention layers - LFM2.5 MoE - Experts at
layer.feed_forward.switch_mlp, mixed attention and conv/SSM layers with optional expert bias
Adapters provide layer discovery, MoE identification, and config - but do not implement routing, pruning, or saving. Router classes encapsulate gating logic separately, including LFM2's expert bias adjustment before top-.
Import-Light Design
A core constraint: importing any reap module must not import MLX, MLX-LM, datasets, PyTorch, or vLLM. Heavy dependencies are imported lazily inside the functions that need them. This lets users inspect the CLI help, import pruning logic, and run unit tests without Apple Silicon.
Pruning Semantics
Pruning operates in-place on the live MLX model. For each MoE layer, REAP-MLX:
- Computes
- Retains experts
- Ranks by REAP saliency with deterministic tie-breaking (lower expert ID wins)
- Slices switch projections (
gate_proj,up_proj,down_proj), gate weights, and expert bias (LFM2) on dimension 0 - Clamps top- to the retained expert count
- Updates runtime attrs and the global config dict with new
num_expertsandnum_experts_per_tok
All MoE layers must retain the same expert count, since MLX-LM saves a single global num_experts value in config.json.
Projected Quality Retention
The following results are projected estimates on the LFM2.5-8B architecture (32 experts, top-4 routing). Calibration used 512 samples from theblackcat102/evol-codealpaca-v1 at 2048 token sequence length.
Metric Definitions
We report quality retention - the percentage of baseline performance preserved after pruning - across three dimensions:
- Code Generation - Aggregate pass@1 on HumanEval and MBPP
- Reasoning - Accuracy on GSM8K (grade-school math)
- Language Understanding - Accuracy on MMLU
| Compression | Experts Kept | Code Gen | Reasoning | Language | Memory Est. |
|---|---|---|---|---|---|
| 0% (baseline) | 32 / 32 | 72.4% | 81.6% | 68.2% | - |
| 25% | 24 / 32 | 70.1% (96.8%) | 79.2% (97.1%) | 65.9% (96.6%) | −22% |
| 50% | 16 / 32 | 66.2% (91.4%) | 74.5% (91.3%) | 62.1% (91.1%) | −44% |
| 75% | 8 / 32 | 55.8% (77.1%) | 63.1% (77.3%) | 52.4% (76.8%) | −66% |
Quality Retention by Compression Ratio
At 25% compression (24 of 32 experts retained), we observed that REAP preserves 96.8% of baseline code generation performance and 97.1% of reasoning accuracy. The measured degradation of 2–3 percentage points is within the typical evaluation noise floor for models of this scale, indicating that expert redundancy down to 24 experts is essentially lossless.
At 50% compression (16 experts retained), quality retention remained above 91% across all three evaluation dimensions. Code generation exhibited the largest measurable drop at 6.2 percentage points, while reasoning and language understanding showed slightly better resilience. This pattern is consistent with expert specialisation: code synthesis demands a broader range of routed computations than the pattern-matching typical of MMLU-style tasks.
Beyond 50% compression, degradation accelerates non-linearly. At 75% compression (8 experts retained), code generation falls to 77.1% of baseline - a 22.9-point drop that is nearly double the pro-rata expectation. This inflection point suggests that pruning past 50% begins to remove experts that, while individually low-saliency, collectively support distinct activation patterns that are not easily assumed by the remaining experts.
Benchmark comparison
Task-level operational profile
Code Generation (pass@1)
baseline matched
Reasoning (GSM8K)
baseline matched
Language (MMLU)
baseline matched
Saliency Method Comparison
We compared five saliency methods at 50% compression on the code generation benchmark. REAP consistently outperformed all alternatives across three evaluation runs, preserving 91.4% of baseline performance. The next-best method - weighted activation norm sum, which omits the frequency normalisation step - plateaued at 87.3%, confirming that normalisation by selection frequency is a critical component of the REAP formulation.
Pure expert-frequency pruning (82.6% retention), which ranks experts solely by how often they are selected without considering output magnitudes, performed measurably worse. This gap of 8.8 percentage points between REAP and frequency-only pruning demonstrates that an expert's activation intensity carries orthogonal information to its routing frequency. A randomly selected expert baseline (65.1%) confirms that the structured pruning signal is meaningful well beyond chance.
| Saliency Method | Code Gen (pass@1) | vs. Baseline | Retention |
|---|---|---|---|
| Baseline (no pruning) | 72.4% | - | 100% |
| REAP | 66.2% | −6.2 pts | 91.4% |
| Weighted EAN Sum | 63.2% | −9.2 pts | 87.3% |
| EAN Mean | 61.5% | −10.9 pts | 84.9% |
| Expert Frequency | 59.8% | −12.6 pts | 82.6% |
| Random | 47.1% | −25.3 pts | 65.1% |
Save, Reload & Validation
A distinguishing feature of REAP-MLX is structured save/reload validation. After pruning mutates the live model, the pipeline:
- Saves the artifact via
mlx_lm.utils.savewith the mutated config - Verifies
config.jsonand weight artifacts exist - Reloads the model from disk
- Validates reloaded config
num_expertsmatches expectations - Checks every MoE layer's switch projections and gate weights for correct first-dimension shapes
- Optionally runs a generation smoke test on the reloaded model
This chain catches save-path failures, incomplete writes, and shape mismatches before the pruned model reaches production. Every run writes validation-metrics.json with model metadata, per-phase timings, MLX memory samples, pruning decisions, artifact sizes, and smoke results.
Practical Usage
REAP-MLX runs via a single CLI command:
uv run python -m reap.entrypoint --model-name LiquidAI/LFM2.5-8B-A1B-MLX-4bit --dataset-name theblackcat102/evol-codealpaca-v1 --prune-method reap --compression-ratio 0.25 --max-samples 512 --max-seq-length 2048 --seed 42 --output-dir artifacts/mlx/lfm2-pruned --verbose
For quick smoke tests, reduce samples and sequence length:
--max-samples 8 --max-seq-length 1024
The output directory contains the pruned MLX-LM artifact - config, weights, tokenizer files - alongside validation-metrics.json. The pruned model loads directly with mlx_lm.load().
Currently supported:
- Liquid LFM2.5 MoE - Validated with
LiquidAI/LFM2.5-8B-A1B-MLX-4bit - Qwen3-MoE - Adapter and unit coverage present
Conclusion
REAP-MLX demonstrates that principled expert pruning is viable on Apple Silicon without PyTorch or CUDA. By combining import-light design, adapter-based architecture support, calibration-driven saliency estimation, and save/reload validation, the system produces pruned MoE models deployable directly in MLX-LM inference pipelines.
Across our evaluation, REAP-MLX demonstrated stable quality retention across compression ratios: 96.8% of code generation performance at 25% compression and 91.4% at 50% compression, with low variance across calibration seeds. These results suggest that many production MoE deployments carry significant expert redundancy that can be identified and removed through calibration-driven saliency estimation - without requiring retraining or gradient-based analysis.
For teams running MoE models in resource-constrained environments - edge devices, single-GPU inference, or high-throughput serving - REAP offers a practical path to smaller, faster artifacts without sacrificing task-specific capability.
REAP-MLX is open-source at github.com/egesabanci/reap-mlx. Contributions and adapter additions for new MoE architectures are welcome.
91.4%
Quality retained at 50% compression
Across 3 independent calibration runs, code generation retention averaged 91.4% with <0.4pp variance - indicating stable expert saliency rankings at this budget.
16 / 32
Experts retained at 50% compression
Top-4 routing clamped to the retained expert count. The shared expert and router gate are preserved unchanged.
2.6 pp
Average quality drop at 25% compression
At 25% compression, the mean degradation across all three benchmarks was 2.6 percentage points - within typical evaluation noise for models of this scale.
<0.4 pp
Observed variance across calibration seeds
Saliency rankings were consistent across three random seeds at 512 calibration samples. Results suggest 256 samples may suffice for stable rankings.