SyMerge: From Non-Interference to
Synergistic Merging via Single-Layer Adaptation

1Sungkyunkwan University, 2NAVER AI Lab
corresponding author
SyMerge plot

Abstract

Model merging offers an efficient alternative to multi-task learning by combining independently fine-tuned models, but most prior approaches focus mainly on avoiding task interference. We argue instead that the real potential of merging lies in achieving synergy, where tasks enhance one another. Our intuition comes from a pilot study showing that when a classifier trained on one task is paired with the encoder of another, the resulting cross-task performance strongly predicts merge quality. Moreover, adapting even a single task-specific layer can substantially improve this compatibility, suggesting a simple yet powerful lever for synergy. Building on this insight, we introduce SyMerge, a lightweight framework that jointly optimizes one task-specific layer and merging coefficients. To ensure stability without labels, SyMerge employs a robust self-labeling strategy guided by expert model predictions, avoiding the pitfalls of entropy-based adaptation. This minimalist yet principled design achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks, while also producing adapted layers that transfer effectively to other merging methods.

Motivation

Rethinking Cross-task Performance

We hypothesize that a model's cross-task performance closely relates to its merging performance. In a preliminary study on 20 vision tasks with ViT-B/32, we observe a strong positive correlation (r = 0.863, p < 0.001) between average cross-task performance and average merge performance.

Concretely, given a model pair (A, B), we evaluate: (1) cross-task performance by attaching A's encoder to B's classifier and evaluating on B's task, and (2) merging performance by weight-averaging the two encoders and evaluating the merged encoder on B's task.

Correlation between cross-task and merging performance

Classifier Training Unlocks Task Synergy

We compare a baseline that pairs each task’s original classifier with its native encoder against a variant that replaces the classifier with one adapted on a fixed merged encoder and evaluates it alongside another task’s encoder. This two-step protocol provides a direct measure of functional alignment across tasks.

The protocol consistently improves cross-task performance over the baseline, showing that modest adaptation can substantially boost cross-task compatibility—a reliable signal for merge success. Because it still depends on labels, we pursue SyMerge to obtain the same effect without supervision.

Pilot study illustrating functional alignment protocol

Method

SyMerge tackles the label scarcity challenge in model merging by turning individually fine-tuned models into expert teachers. Instead of unstable entropy minimization, we adopt a self-labeling strategy that distills each teacher’s most confident predictions into the merged model.

For each task k, we define a task-specific loss (cross-entropy for classification, L1/L2 for regression, etc.) between the merged model’s outputs and the teacher-provided pseudo-labels on the unlabeled target data. This keeps the merged encoder aligned with expert behavior without relying on ground-truth annotations.

The only trainable parameters are:

  • Merging coefficients, which balance contributions from each model.
  • A single task-specific layer, adapted to the merged encoder.
Jointly updating these components encourages alignment across tasks while keeping the approach lightweight.

Experimental Results

Multiple Vision Tasks

Method ViT-B-32 ViT-L-14
8 tasks 14 tasks 20 tasks 8 tasks 14 tasks 20 tasks
Pretrained 48.057.156.0 64.568.165.1
Individual 90.589.590.4 94.293.294.0
Task Arithmetic 69.165.460.8 84.578.473.9
Ties Merging 72.965.263.1 86.080.573.0
PCB Merging 75.663.852.7 87.581.373.4
LiNeS w/ TA 74.168.063.7 86.980.475.7
Consensus TA 74.970.365.0 86.682.378.9
TSV-M 84.080.176.6 91.588.287.2
ISO-CTS 84.280.676.9 93.089.789.2
EMR-Merging 88.786.186.6 93.791.492.0
AdaMerging 80.176.769.6 90.885.282.1
Surgery w/ Ada. 87.584.684.5 92.390.489.5
WEMoE 89.483.078.9 93.688.478.8
ProbSurgery w/ Ada. 87.484.984.5 92.790.790.2
SyMerge 90.1 ± 0.188.7 ± 0.188.6 ± 0.4 94.1 ± 0.092.8 ± 0.193.2 ± 0.1

Mean ± standard deviation over 5 runs.

Dense Prediction (NYUv2)

Method Seg mIoU ↑ Seg Pix Acc ↑ Depth Abs Err ↓ Depth Rel Err ↓ Normal Mean ↓
Individual 52.074.241.517.324.2
Weight Averaging 36.664.055.023.230.0
Task Arithmetic 31.660.356.724.030.6
Ties-Merging 39.962.761.327.336.2
MagMax 24.754.760.323.930.3
LiNeS w/ TA 36.264.054.222.429.1
EMR-Merging 41.567.248.619.426.5
Surgery w/ TA 43.367.455.324.734.7
ProbSurgery w/ TA 43.667.652.622.336.7
SyMerge 49.8 ± 0.373.1 ± 0.245.3 ± 0.618.8 ± 0.526.2 ± 0.1

NLP (GLUE, 8 tasks)

Method CoLA SST2 MRPC STSB QQP MNLI QNLI RTE Avg.
Individual 60.294.089.290.691.487.292.779.185.6
Weight Averaging 14.064.169.431.875.442.258.755.251.3
Task Arithmetic 18.885.979.974.083.859.169.762.166.7
MagMax 17.376.070.871.385.870.459.545.162.0
Ties-Merging 20.584.481.158.285.764.774.843.064.0
LiNeS 26.186.478.972.983.356.075.959.967.4
EMR-Merging 40.093.486.382.889.785.589.674.480.2
Surgery w/TA 46.893.689.788.785.178.085.376.580.5
ProbSurgery w/TA 54.793.689.789.485.277.685.376.981.6
SyMerge 60.0 ± 0.193.8 ± 0.389.2 ± 0.090.4 ± 0.285.2 ± 1.184.0 ± 0.489.2 ± 0.279.1 ± 0.083.9 ± 0.2

Key Observations

Cross-task transferability check

Encoder Classifier Merged Cross
TA Zero-shot 69.1 49.8
Ours 79.6 (+10.5) 54.8 (+5.0)
Ada Zero-shot 80.1 50.1
Ours 87.7 (+7.6) 54.1 (+4.0)

Replacing zero-shot classifiers with our adapted classifier improves both merged and cross-task evaluations without additional target-task training.

Effect of adjusting different layers

Impact of training task-specific layers for synergistic merging
Impact of training task-specific layers: single-layer/classifier training with coefficients works; multi-layer training underperforms.