SyMerge (ICML 2026): From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Abstract

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks.

Goal

Move model merging beyond non-interference and explicitly optimize for task synergy.

Mechanism

Jointly train merging coefficients and one task-specific layer with expert-guided self-labels.

Outcome

Strong results across multiple domains, better shift robustness, and successful merging across disjoint basins.

Why SyMerge?

We motivate SyMerge around three observations: training-free merging is fragile under shift, cross-task compatibility predicts merge quality, and small adaptation can already unlock synergy.

Training-free merging is brittle under corruption

We first focus on robustness. When several task datasets are intentionally corrupted, training-free merging methods degrade sharply, exposing how brittle purely data-agnostic merging can be under realistic distribution shift.

This motivates test-time adaptation on unlabeled target data. SyMerge keeps that adaptive flavor, but replaces unstable entropy minimization with a more reliable expert-guided self-labeling objective.

Performance of training-free and adaptive merging methods under corruption

Cross-task performance predicts merge quality

We hypothesize that a model's cross-task performance closely relates to its merging performance. In a preliminary study on 20 vision tasks with ViT-B/32, we observe a strong positive correlation (r = 0.863, p < 0.001) between average cross-task performance and average merge performance.

Concretely, given a model pair (A, B), we evaluate cross-task performance by attaching A's encoder to B's classifier and merge quality by evaluating the merged encoder on B's task. The strong correlation suggests that better functional alignment is a reliable signal for better merging.

Correlation between cross-task and merging performance

A small amount of adaptation is enough

Our pilot study compares a baseline that pairs each task's original classifier with its native encoder against a variant that retrains the classifier on a fixed merged encoder and then evaluates it with another task's encoder.

This simple two-stage protocol consistently improves cross-task performance, showing that modest adaptation can substantially improve functional alignment. SyMerge is built to obtain the same effect without labels by adapting just one task-specific layer together with the merging coefficients.

Pilot study illustrating functional alignment protocol

Method

SyMerge tackles the label scarcity challenge in model merging by turning individually fine-tuned models into expert teachers. Instead of relying on entropy minimization, we use expert-guided self-labeling to provide a stable training signal on unlabeled target data.

What is optimized?

Merging coefficients determine how each expert contributes to the shared encoder, while one task-specific layer is adapted to the merged representation.

Why is it stable?

For each task, the merged model is trained to match confident predictions from the corresponding expert model, making the objective applicable beyond classification and more reliable than entropy minimization.

This minimalist design keeps the method lightweight while still realigning the shared encoder and the task-specific layer together. We show that this joint optimization is what makes SyMerge scalable as the number of tasks grows.

Experimental Results

SyMerge consistently improves over prior training-free and test-time adaptive baselines across classification, dense prediction, and NLP, while staying close to the individual expert upper bound.

Multiple Vision Tasks

Performance stays strong from 8 to 20 tasks on both ViT-B/32 and ViT-L/14, whereas competing methods degrade much more sharply as the task count increases.

Method	ViT-B-32			ViT-L-14
Method	8 tasks	14 tasks	20 tasks	8 tasks	14 tasks	20 tasks
Individual	90.5	89.5	90.4	94.2	93.2	94.0
Task Arithmetic	69.1	65.4	60.8	84.5	78.4	73.9
Ties Merging	72.9	65.2	63.1	86.0	80.5	73.0
PCB Merging	75.6	63.8	52.7	87.5	81.3	73.4
LiNeS w/ TA	74.1	68.0	63.7	86.9	80.4	75.7
Consensus TA	74.9	70.3	65.0	86.6	82.3	78.9
TSV-M	84.0	80.1	76.6	91.5	88.2	87.2
ISO-Merging	84.2	80.6	76.9	93.0	89.7	89.2
EMR-Merging	88.7	86.1	86.6	93.7	91.4	92.0
LOTMerging	82.7	77.3	73.1	90.5	86.2	84.1
AdaMerging	80.1	76.7	69.6	90.8	85.2	82.1
Surgery w/ Ada.	87.5	84.6	84.5	92.3	90.4	89.5
WEMoE	89.4	83.0	78.9	93.6	88.4	78.8
ProbSurgery w/ Ada.	87.4	84.9	84.5	92.7	90.7	90.2
SyMerge	90.1 ± 0.1	88.7 ± 0.1	88.6 ± 0.4	94.1 ± 0.0	92.8 ± 0.1	93.2 ± 0.1

Mean ± standard deviation over 5 runs.

Dense Prediction (NYUv2)

Dense prediction is a harsher regime with heterogeneous objectives. SyMerge remains competitive with individual models while other methods suffer large drops on depth and normal estimation.

Method	Seg mIoU ↑	Seg Pix Acc ↑	Depth Abs Err ↓	Depth Rel Err ↓	Normal Mean ↓
Individual	52.0	74.2	41.5	17.3	24.2
Weight Averaging	36.6	64.0	55.0	23.2	30.0
Task Arithmetic	31.6	60.3	56.7	24.0	30.6
Ties-Merging	39.9	62.7	61.3	27.3	36.2
MagMax	24.7	54.7	60.3	23.9	30.3
LiNeS w/ TA	36.2	64.0	54.2	22.4	29.1
EMR-Merging	41.5	67.2	48.6	19.4	26.5
Surgery w/ TA	43.3	67.4	55.3	24.7	34.7
ProbSurgery w/ TA	43.6	67.6	52.6	22.3	36.7
SyMerge	49.8 ± 0.3	73.1 ± 0.2	45.3 ± 0.6	18.8 ± 0.5	26.2 ± 0.1

NLP (GLUE, 8 tasks)

On GLUE, SyMerge achieves the best average while avoiding the severe collapse that other approaches show on harder outlier tasks such as CoLA.

Method	CoLA	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE	Avg.
Individual	60.2	94.0	89.2	90.6	91.4	87.2	92.7	79.1	85.6
Weight Averaging	14.0	64.1	69.4	31.8	75.4	42.2	58.7	55.2	51.3
Task Arithmetic	18.8	85.9	79.9	74.0	83.8	59.1	69.7	62.1	66.7
MagMax	17.3	76.0	70.8	71.3	85.8	70.4	59.5	45.1	62.0
Ties-Merging	20.5	84.4	81.1	58.2	85.7	64.7	74.8	43.0	64.0
LiNeS	26.1	86.4	78.9	72.9	83.3	56.0	75.9	59.9	67.4
TSV-M	28.7	88.4	83.1	76.8	85.8	59.9	75.2	47.7	68.2
ISO-Merging	27.3	86.2	79.2	73.7	85.9	71.7	81.1	70.4	71.9
EMR-Merging	40.0	93.4	86.3	82.8	89.7	85.5	89.6	74.4	80.2
Surgery w/TA	46.8	93.6	89.7	88.7	85.1	78.0	85.3	76.5	80.5
ProbSurgery w/TA	54.7	93.6	89.7	89.4	85.2	77.6	85.3	76.9	81.6
SyMerge	60.0 ± 0.1	93.8 ± 0.3	89.2 ± 0.0	90.4 ± 0.2	85.2 ± 1.1	84.0 ± 0.4	89.2 ± 0.5	79.1 ± 0.0	83.9 ± 0.2

Additional Findings

Transferable adapted layer

Encoder	Classifier	Merged	Cross
TA	Zero-shot	69.1	49.8
TA	Ours	79.6 (+10.5)	54.8 (+5.0)
Ada	Zero-shot	80.1	50.1
Ada	Ours	87.7 (+7.6)	54.1 (+4.0)

Replacing zero-shot classifiers with the layer trained by SyMerge improves both merged and cross-task evaluations without any further target-task training, showing that the learned adaptation transfers beyond the original merged encoder.

Merging across disjoint basins

Method	EuroSAT	DTD	Avg.
Individual (EuroSAT)	98.1	3.6	50.8
Individual (DTD)	2.0	79.4	40.7
Weight Averaging	11.6	2.2	6.9
AdaMerging	8.6	2.1	5.4
Surgery	38.1	13.1	25.6
SyMerge	96.2	62.0	79.1

We also study models from different initializations. Standard baselines collapse, while SyMerge still recovers strong performance by realigning representations across disjoint basins.

Why single-layer adaptation is enough

Impact of training task-specific layers for synergistic merging — Training one layer with the merging coefficients is already enough to capture most of the benefit; updating multiple layers tends to hurt performance by disturbing task-agnostic knowledge in the shared backbone.

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation