SyMerge: From Non-Interference to
Synergistic Merging via Single-Layer Adaptation

ICML 2026

SyMerge reframes model merging from avoiding interference to inducing synergy through single-layer adaptation and expert-guided self-labeling.

1Sungkyunkwan University, 2NAVER AI Lab
corresponding author
SyMerge plot

Abstract

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks.

Goal

Move model merging beyond non-interference and explicitly optimize for task synergy.

Mechanism

Jointly train merging coefficients and one task-specific layer with expert-guided self-labels.

Outcome

Strong results across multiple domains, better shift robustness, and successful merging across disjoint basins.

Why SyMerge?

We motivate SyMerge around three observations: training-free merging is fragile under shift, cross-task compatibility predicts merge quality, and small adaptation can already unlock synergy.

Training-free merging is brittle under corruption

We first focus on robustness. When several task datasets are intentionally corrupted, training-free merging methods degrade sharply, exposing how brittle purely data-agnostic merging can be under realistic distribution shift.

This motivates test-time adaptation on unlabeled target data. SyMerge keeps that adaptive flavor, but replaces unstable entropy minimization with a more reliable expert-guided self-labeling objective.

Performance of training-free and adaptive merging methods under corruption

Cross-task performance predicts merge quality

We hypothesize that a model's cross-task performance closely relates to its merging performance. In a preliminary study on 20 vision tasks with ViT-B/32, we observe a strong positive correlation (r = 0.863, p < 0.001) between average cross-task performance and average merge performance.

Concretely, given a model pair (A, B), we evaluate cross-task performance by attaching A's encoder to B's classifier and merge quality by evaluating the merged encoder on B's task. The strong correlation suggests that better functional alignment is a reliable signal for better merging.

Correlation between cross-task and merging performance

A small amount of adaptation is enough

Our pilot study compares a baseline that pairs each task's original classifier with its native encoder against a variant that retrains the classifier on a fixed merged encoder and then evaluates it with another task's encoder.

This simple two-stage protocol consistently improves cross-task performance, showing that modest adaptation can substantially improve functional alignment. SyMerge is built to obtain the same effect without labels by adapting just one task-specific layer together with the merging coefficients.

Pilot study illustrating functional alignment protocol

Method

SyMerge tackles the label scarcity challenge in model merging by turning individually fine-tuned models into expert teachers. Instead of relying on entropy minimization, we use expert-guided self-labeling to provide a stable training signal on unlabeled target data.

What is optimized?

Merging coefficients determine how each expert contributes to the shared encoder, while one task-specific layer is adapted to the merged representation.

Why is it stable?

For each task, the merged model is trained to match confident predictions from the corresponding expert model, making the objective applicable beyond classification and more reliable than entropy minimization.

This minimalist design keeps the method lightweight while still realigning the shared encoder and the task-specific layer together. We show that this joint optimization is what makes SyMerge scalable as the number of tasks grows.

Experimental Results

SyMerge consistently improves over prior training-free and test-time adaptive baselines across classification, dense prediction, and NLP, while staying close to the individual expert upper bound.

Multiple Vision Tasks

Performance stays strong from 8 to 20 tasks on both ViT-B/32 and ViT-L/14, whereas competing methods degrade much more sharply as the task count increases.

Method ViT-B-32 ViT-L-14
8 tasks 14 tasks 20 tasks 8 tasks 14 tasks 20 tasks
Individual 90.589.590.4 94.293.294.0
Task Arithmetic 69.165.460.8 84.578.473.9
Ties Merging 72.965.263.1 86.080.573.0
PCB Merging 75.663.852.7 87.581.373.4
LiNeS w/ TA 74.168.063.7 86.980.475.7
Consensus TA 74.970.365.0 86.682.378.9
TSV-M 84.080.176.6 91.588.287.2
ISO-Merging 84.280.676.9 93.089.789.2
EMR-Merging 88.786.186.6 93.791.492.0
LOTMerging 82.777.373.1 90.586.284.1
AdaMerging 80.176.769.6 90.885.282.1
Surgery w/ Ada. 87.584.684.5 92.390.489.5
WEMoE 89.483.078.9 93.688.478.8
ProbSurgery w/ Ada. 87.484.984.5 92.790.790.2
SyMerge 90.1 ± 0.188.7 ± 0.188.6 ± 0.4 94.1 ± 0.092.8 ± 0.193.2 ± 0.1

Mean ± standard deviation over 5 runs.

Dense Prediction (NYUv2)

Dense prediction is a harsher regime with heterogeneous objectives. SyMerge remains competitive with individual models while other methods suffer large drops on depth and normal estimation.

Method Seg mIoU ↑ Seg Pix Acc ↑ Depth Abs Err ↓ Depth Rel Err ↓ Normal Mean ↓
Individual 52.074.241.517.324.2
Weight Averaging 36.664.055.023.230.0
Task Arithmetic 31.660.356.724.030.6
Ties-Merging 39.962.761.327.336.2
MagMax 24.754.760.323.930.3
LiNeS w/ TA 36.264.054.222.429.1
EMR-Merging 41.567.248.619.426.5
Surgery w/ TA 43.367.455.324.734.7
ProbSurgery w/ TA 43.667.652.622.336.7
SyMerge 49.8 ± 0.373.1 ± 0.245.3 ± 0.618.8 ± 0.526.2 ± 0.1

NLP (GLUE, 8 tasks)

On GLUE, SyMerge achieves the best average while avoiding the severe collapse that other approaches show on harder outlier tasks such as CoLA.

Method CoLA SST2 MRPC STSB QQP MNLI QNLI RTE Avg.
Individual 60.294.089.290.691.487.292.779.185.6
Weight Averaging 14.064.169.431.875.442.258.755.251.3
Task Arithmetic 18.885.979.974.083.859.169.762.166.7
MagMax 17.376.070.871.385.870.459.545.162.0
Ties-Merging 20.584.481.158.285.764.774.843.064.0
LiNeS 26.186.478.972.983.356.075.959.967.4
TSV-M 28.788.483.176.885.859.975.247.768.2
ISO-Merging 27.386.279.273.785.971.781.170.471.9
EMR-Merging 40.093.486.382.889.785.589.674.480.2
Surgery w/TA 46.893.689.788.785.178.085.376.580.5
ProbSurgery w/TA 54.793.689.789.485.277.685.376.981.6
SyMerge 60.0 ± 0.193.8 ± 0.389.2 ± 0.090.4 ± 0.285.2 ± 1.184.0 ± 0.489.2 ± 0.579.1 ± 0.083.9 ± 0.2

Additional Findings

Transferable adapted layer

Encoder Classifier Merged Cross
TA Zero-shot 69.1 49.8
Ours 79.6 (+10.5) 54.8 (+5.0)
Ada Zero-shot 80.1 50.1
Ours 87.7 (+7.6) 54.1 (+4.0)

Replacing zero-shot classifiers with the layer trained by SyMerge improves both merged and cross-task evaluations without any further target-task training, showing that the learned adaptation transfers beyond the original merged encoder.

Merging across disjoint basins

Method EuroSAT DTD Avg.
Individual (EuroSAT) 98.1 3.6 50.8
Individual (DTD) 2.0 79.4 40.7
Weight Averaging 11.6 2.2 6.9
AdaMerging 8.6 2.1 5.4
Surgery 38.1 13.1 25.6
SyMerge 96.2 62.0 79.1

We also study models from different initializations. Standard baselines collapse, while SyMerge still recovers strong performance by realigning representations across disjoint basins.

Why single-layer adaptation is enough

Impact of training task-specific layers for synergistic merging
Training one layer with the merging coefficients is already enough to capture most of the benefit; updating multiple layers tends to hurt performance by disturbing task-agnostic knowledge in the shared backbone.