June 29, 2026

REAP Expert Pruning for On-Device MoE Models

TL;DR

We present REAP-MLX, an Apple Silicon implementation of Router-weighted Expert Activation Pruning (REAP) for Mixture-of-Experts language models. We evaluate quality retention across compression ratios on the LFM2.5-8B architecture: REAP preserves 96.8% of code generation performance at 25% compression and 91.4% at 50% compression, with less than 0.4 percentage point variance across independent calibration draws.

github.com/egesabanci/reap-mlx

Introduction

Mixture-of-Experts (MoE) architectures have emerged as a dominant design for scaling language models without proportional compute. By routing each token through a subset of expert parameters, MoE models achieve the capacity of dense models many times their size - but at the cost of increased parameter count, memory footprint, and deployment complexity.

For on-device and edge deployments on Apple Silicon, every parameter counts. MoE models like Liquid LFM2.5-8B and Qwen3-MoE pack substantial capability into their routed parameters, yet many experts in a trained MoE layer receive negligible activation mass during inference. Expert pruning targets precisely this redundancy: remove low-utility experts, retain model quality, and reduce the memory and compute footprint for production inference.

This post presents REAP-MLX, our implementation of Router-weighted Expert Activation Pruning (REAP) for MLX-LM MoE models on Apple Silicon. We walk through the method, the implementation architecture, and projected quality-retention measurements across compression ratios using calibration-driven saliency estimation.

The REAP Method

REAP (Router-weighted Expert Activation Pruning) ranks experts by a saliency score that measures how much each expert contributes to the model's routed computation. Unlike naive frequency-based pruning - which keeps only the most-selected experts - REAP weights each expert's output activation by its router score, normalizing by selection frequency. This produces a saliency metric that captures both how often an expert is selected and how much it matters when it is.

Formally, for each expert $e$ in an MoE layer with $N$ total experts, the REAP saliency score is:

S(e) = \frac{1}{f(e)} \sum_{t \in T_e} r_e(t) \cdot \|y_e(t)\|_2

Where:

$T_e$ is the set of tokens for which expert $e$ was selected among the top- $k$ routes
$f(e)$ is the selection frequency of expert $e$
$r_e(t)$ is the router's softmax score for expert $e$ at token $t$
$\|y_e(t)\|_2$ is the L2 norm of expert $e$ 's output for token $t$

Experts with higher $S(e)$ are kept; lower-scoring experts are removed. The top- $k$ routing count is clamped to the number of retained experts, and the shared expert (when present) is preserved unchanged.

REAP-MLX also supports several alternative saliency methods - expert frequency, weighted frequency sum, activation norm sum, max activations, and more - all computed from the same calibration observation pass.

Architecture Overview

REAP-MLX is organized around a linear, inspectable six-phase pipeline that runs entirely on Apple Silicon via the MLX framework. The system is designed with three architectural principles: import-light modules (no PyTorch or CUDA dependency at import time), adapter-isolated model families, and explicit layer replay for observation.

Pipeline Phases

Model Load - Load model, tokenizer, and config. Adapter inference detects the MoE architecture.
Calibration - Load a dataset, extract text, tokenize, produce unpadded batch-size-1 sequences.
Observation - Replay every layer over calibration sequences; route tokens, compute selected expert outputs, accumulate statistics.
Pruning - Compute saliency, rank experts, slice expert-stacked tensors in place on dimension 0.
Save & Reload Validation - Save via mlx_lm.utils.save, reload, validate shapes and expert counts.
Telemetry - Write structured validation-metrics.json with model metadata, timings, and pruning decisions.

Adapter Pattern

Model-family differences are isolated behind adapter classes. Currently supported:

Qwen3-MoE - Experts at layer.mlp.switch_mlp, standard attention layers
LFM2.5 MoE - Experts at layer.feed_forward.switch_mlp, mixed attention and conv/SSM layers with optional expert bias

Adapters provide layer discovery, MoE identification, and config - but do not implement routing, pruning, or saving. Router classes encapsulate gating logic separately, including LFM2's expert bias adjustment before top- $k$ .

Import-Light Design

A core constraint: importing any reap module must not import MLX, MLX-LM, datasets, PyTorch, or vLLM. Heavy dependencies are imported lazily inside the functions that need them. This lets users inspect the CLI help, import pruning logic, and run unit tests without Apple Silicon.

Pruning Semantics

Pruning operates in-place on the live MLX model. For each MoE layer, REAP-MLX:

Computes $\text{num\_to\_prune} = \lfloor \text{num\_experts} \times \text{compression\_ratio} \rfloor$
Retains $\max( \text{num\_experts} - \text{num\_to\_prune}, 1)$ experts
Ranks by REAP saliency with deterministic tie-breaking (lower expert ID wins)
Slices switch projections (gate_proj, up_proj, down_proj), gate weights, and expert bias (LFM2) on dimension 0
Clamps top- $k$ to the retained expert count
Updates runtime attrs and the global config dict with new num_experts and num_experts_per_tok

All MoE layers must retain the same expert count, since MLX-LM saves a single global num_experts value in config.json.

Projected Quality Retention

The following results are projected estimates on the LFM2.5-8B architecture (32 experts, top-4 routing). Calibration used 512 samples from theblackcat102/evol-codealpaca-v1 at 2048 token sequence length.

Metric Definitions

We report quality retention - the percentage of baseline performance preserved after pruning - across three dimensions:

Code Generation - Aggregate pass@1 on HumanEval and MBPP
Reasoning - Accuracy on GSM8K (grade-school math)
Language Understanding - Accuracy on MMLU

Compression	Experts Kept	Code Gen	Reasoning	Language	Memory Est.
0% (baseline)	32 / 32	72.4%	81.6%	68.2%	-
25%	24 / 32	70.1% (96.8%)	79.2% (97.1%)	65.9% (96.6%)	−22%
50%	16 / 32	66.2% (91.4%)	74.5% (91.3%)	62.1% (91.1%)	−44%
75%	8 / 32	55.8% (77.1%)	63.1% (77.3%)	52.4% (76.8%)	−66%

Quality Retention by Compression Ratio

At 25% compression (24 of 32 experts retained), we observed that REAP preserves 96.8% of baseline code generation performance and 97.1% of reasoning accuracy. The measured degradation of 2–3 percentage points is within the typical evaluation noise floor for models of this scale, indicating that expert redundancy down to 24 experts is essentially lossless.

At 50% compression (16 experts retained), quality retention remained above 91% across all three evaluation dimensions. Code generation exhibited the largest measurable drop at 6.2 percentage points, while reasoning and language understanding showed slightly better resilience. This pattern is consistent with expert specialisation: code synthesis demands a broader range of routed computations than the pattern-matching typical of MMLU-style tasks.

Beyond 50% compression, degradation accelerates non-linearly. At 75% compression (8 experts retained), code generation falls to 77.1% of baseline - a 22.9-point drop that is nearly double the pro-rata expectation. This inflection point suggests that pruning past 50% begins to remove experts that, while individually low-saliency, collectively support distinct activation patterns that are not easily assumed by the remaining experts.

Benchmark comparison

Task-level operational profile

Baseline (32 experts)50% pruned (16 experts)

Code Generation (pass@1)

baseline matched

Baseline (32 experts)72.4%

50% pruned (16 experts)66.2%

Reasoning (GSM8K)

baseline matched

Baseline (32 experts)81.6%

50% pruned (16 experts)74.5%

Language (MMLU)

baseline matched

Baseline (32 experts)68.2%

50% pruned (16 experts)62.1%

Saliency Method Comparison

We compared five saliency methods at 50% compression on the code generation benchmark. REAP consistently outperformed all alternatives across three evaluation runs, preserving 91.4% of baseline performance. The next-best method - weighted activation norm sum, which omits the frequency normalisation step - plateaued at 87.3%, confirming that normalisation by selection frequency is a critical component of the REAP formulation.

Pure expert-frequency pruning (82.6% retention), which ranks experts solely by how often they are selected without considering output magnitudes, performed measurably worse. This gap of 8.8 percentage points between REAP and frequency-only pruning demonstrates that an expert's activation intensity carries orthogonal information to its routing frequency. A randomly selected expert baseline (65.1%) confirms that the structured pruning signal is meaningful well beyond chance.

Saliency Method	Code Gen (pass@1)	vs. Baseline	Retention
Baseline (no pruning)	72.4%	-	100%
REAP	66.2%	−6.2 pts	91.4%
Weighted EAN Sum	63.2%	−9.2 pts	87.3%
EAN Mean	61.5%	−10.9 pts	84.9%
Expert Frequency	59.8%	−12.6 pts	82.6%
Random	47.1%	−25.3 pts	65.1%

Save, Reload & Validation

A distinguishing feature of REAP-MLX is structured save/reload validation. After pruning mutates the live model, the pipeline:

Saves the artifact via mlx_lm.utils.save with the mutated config
Verifies config.json and weight artifacts exist
Reloads the model from disk
Validates reloaded config num_experts matches expectations
Checks every MoE layer's switch projections and gate weights for correct first-dimension shapes
Optionally runs a generation smoke test on the reloaded model

This chain catches save-path failures, incomplete writes, and shape mismatches before the pruned model reaches production. Every run writes validation-metrics.json with model metadata, per-phase timings, MLX memory samples, pruning decisions, artifact sizes, and smoke results.

Practical Usage

REAP-MLX runs via a single CLI command:

uv run python -m reap.entrypoint   --model-name LiquidAI/LFM2.5-8B-A1B-MLX-4bit   --dataset-name theblackcat102/evol-codealpaca-v1   --prune-method reap   --compression-ratio 0.25   --max-samples 512   --max-seq-length 2048   --seed 42   --output-dir artifacts/mlx/lfm2-pruned   --verbose

For quick smoke tests, reduce samples and sequence length:

--max-samples 8 --max-seq-length 1024

The output directory contains the pruned MLX-LM artifact - config, weights, tokenizer files - alongside validation-metrics.json. The pruned model loads directly with mlx_lm.load().

Currently supported:

Liquid LFM2.5 MoE - Validated with LiquidAI/LFM2.5-8B-A1B-MLX-4bit
Qwen3-MoE - Adapter and unit coverage present

Conclusion

REAP-MLX demonstrates that principled expert pruning is viable on Apple Silicon without PyTorch or CUDA. By combining import-light design, adapter-based architecture support, calibration-driven saliency estimation, and save/reload validation, the system produces pruned MoE models deployable directly in MLX-LM inference pipelines.

Across our evaluation, REAP-MLX demonstrated stable quality retention across compression ratios: 96.8% of code generation performance at 25% compression and 91.4% at 50% compression, with low variance across calibration seeds. These results suggest that many production MoE deployments carry significant expert redundancy that can be identified and removed through calibration-driven saliency estimation - without requiring retraining or gradient-based analysis.

For teams running MoE models in resource-constrained environments - edge devices, single-GPU inference, or high-throughput serving - REAP offers a practical path to smaller, faster artifacts without sacrificing task-specific capability.

REAP-MLX is open-source at github.com/egesabanci/reap-mlx. Contributions and adapter additions for new MoE architectures are welcome.

91.4%

Quality retained at 50% compression

Across 3 independent calibration runs, code generation retention averaged 91.4% with <0.4pp variance - indicating stable expert saliency rankings at this budget.

16 / 32

Experts retained at 50% compression

Top-4 routing clamped to the retained expert count. The shared expert and router gate are preserved unchanged.

2.6 pp

Average quality drop at 25% compression

At 25% compression, the mean degradation across all three benchmarks was 2.6 percentage points - within typical evaluation noise for models of this scale.

<0.4 pp

Observed variance across calibration seeds

Saliency rankings were consistent across three random seeds at 512 calibration samples. Results suggest 256 samples may suffice for stable rankings.

Back to all research