SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Abstract

Model merging offers an efficient alternative to multi-task learning by combining independently fine-tuned models, but most prior approaches focus mainly on avoiding task interference. We argue instead that the real potential of merging lies in achieving synergy, where tasks enhance one another. Our intuition comes from a pilot study showing that when a classifier trained on one task is paired with the encoder of another, the resulting cross-task performance strongly predicts merge quality. Moreover, adapting even a single task-specific layer can substantially improve this compatibility, suggesting a simple yet powerful lever for synergy. Building on this insight, we introduce SyMerge, a lightweight framework that jointly optimizes one task-specific layer and merging coefficients. To ensure stability without labels, SyMerge employs a robust self-labeling strategy guided by expert model predictions, avoiding the pitfalls of entropy-based adaptation. This minimalist yet principled design achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks, while also producing adapted layers that transfer effectively to other merging methods.

Motivation

Rethinking Cross-task Performance

We hypothesize that a model's cross-task performance closely relates to its merging performance. In a preliminary study on 20 vision tasks with ViT-B/32, we observe a strong positive correlation (r = 0.863, p < 0.001) between average cross-task performance and average merge performance.

Concretely, given a model pair (A, B), we evaluate: (1) cross-task performance by attaching A's encoder to B's classifier and evaluating on B's task, and (2) merging performance by weight-averaging the two encoders and evaluating the merged encoder on B's task.

Correlation between cross-task and merging performance

Classifier Training Unlocks Task Synergy

We compare a baseline that pairs each task’s original classifier with its native encoder against a variant that replaces the classifier with one adapted on a fixed merged encoder and evaluates it alongside another task’s encoder. This two-step protocol provides a direct measure of functional alignment across tasks.

The protocol consistently improves cross-task performance over the baseline, showing that modest adaptation can substantially boost cross-task compatibility—a reliable signal for merge success. Because it still depends on labels, we pursue SyMerge to obtain the same effect without supervision.

Pilot study illustrating functional alignment protocol

Method

SyMerge tackles the label scarcity challenge in model merging by turning individually fine-tuned models into expert teachers. Instead of unstable entropy minimization, we adopt a self-labeling strategy that distills each teacher’s most confident predictions into the merged model.

For each task k, we define a task-specific loss (cross-entropy for classification, L1/L2 for regression, etc.) between the merged model’s outputs and the teacher-provided pseudo-labels on the unlabeled target data. This keeps the merged encoder aligned with expert behavior without relying on ground-truth annotations.

The only trainable parameters are:

Merging coefficients, which balance contributions from each model.
A single task-specific layer, adapted to the merged encoder.

Jointly updating these components encourages alignment across tasks while keeping the approach lightweight.

Experimental Results

Multiple Vision Tasks

Method	ViT-B-32			ViT-L-14
Method	8 tasks	14 tasks	20 tasks	8 tasks	14 tasks	20 tasks
Pretrained	48.0	57.1	56.0	64.5	68.1	65.1
Individual	90.5	89.5	90.4	94.2	93.2	94.0
Task Arithmetic	69.1	65.4	60.8	84.5	78.4	73.9
Ties Merging	72.9	65.2	63.1	86.0	80.5	73.0
PCB Merging	75.6	63.8	52.7	87.5	81.3	73.4
LiNeS w/ TA	74.1	68.0	63.7	86.9	80.4	75.7
Consensus TA	74.9	70.3	65.0	86.6	82.3	78.9
TSV-M	84.0	80.1	76.6	91.5	88.2	87.2
ISO-CTS	84.2	80.6	76.9	93.0	89.7	89.2
EMR-Merging	88.7	86.1	86.6	93.7	91.4	92.0
AdaMerging	80.1	76.7	69.6	90.8	85.2	82.1
Surgery w/ Ada.	87.5	84.6	84.5	92.3	90.4	89.5
WEMoE	89.4	83.0	78.9	93.6	88.4	78.8
ProbSurgery w/ Ada.	87.4	84.9	84.5	92.7	90.7	90.2
SyMerge	90.1 ± 0.1	88.7 ± 0.1	88.6 ± 0.4	94.1 ± 0.0	92.8 ± 0.1	93.2 ± 0.1

Mean ± standard deviation over 5 runs.

Dense Prediction (NYUv2)

Method	Seg mIoU ↑	Seg Pix Acc ↑	Depth Abs Err ↓	Depth Rel Err ↓	Normal Mean ↓
Individual	52.0	74.2	41.5	17.3	24.2
Weight Averaging	36.6	64.0	55.0	23.2	30.0
Task Arithmetic	31.6	60.3	56.7	24.0	30.6
Ties-Merging	39.9	62.7	61.3	27.3	36.2
MagMax	24.7	54.7	60.3	23.9	30.3
LiNeS w/ TA	36.2	64.0	54.2	22.4	29.1
EMR-Merging	41.5	67.2	48.6	19.4	26.5
Surgery w/ TA	43.3	67.4	55.3	24.7	34.7
ProbSurgery w/ TA	43.6	67.6	52.6	22.3	36.7
SyMerge	49.8 ± 0.3	73.1 ± 0.2	45.3 ± 0.6	18.8 ± 0.5	26.2 ± 0.1

NLP (GLUE, 8 tasks)

Method	CoLA	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE	Avg.
Individual	60.2	94.0	89.2	90.6	91.4	87.2	92.7	79.1	85.6
Weight Averaging	14.0	64.1	69.4	31.8	75.4	42.2	58.7	55.2	51.3
Task Arithmetic	18.8	85.9	79.9	74.0	83.8	59.1	69.7	62.1	66.7
MagMax	17.3	76.0	70.8	71.3	85.8	70.4	59.5	45.1	62.0
Ties-Merging	20.5	84.4	81.1	58.2	85.7	64.7	74.8	43.0	64.0
LiNeS	26.1	86.4	78.9	72.9	83.3	56.0	75.9	59.9	67.4
EMR-Merging	40.0	93.4	86.3	82.8	89.7	85.5	89.6	74.4	80.2
Surgery w/TA	46.8	93.6	89.7	88.7	85.1	78.0	85.3	76.5	80.5
ProbSurgery w/TA	54.7	93.6	89.7	89.4	85.2	77.6	85.3	76.9	81.6
SyMerge	60.0 ± 0.1	93.8 ± 0.3	89.2 ± 0.0	90.4 ± 0.2	85.2 ± 1.1	84.0 ± 0.4	89.2 ± 0.2	79.1 ± 0.0	83.9 ± 0.2

Key Observations

Cross-task transferability check

Encoder	Classifier	Merged	Cross
TA	Zero-shot	69.1	49.8
TA	Ours	79.6 (+10.5)	54.8 (+5.0)
Ada	Zero-shot	80.1	50.1
Ada	Ours	87.7 (+7.6)	54.1 (+4.0)

Replacing zero-shot classifiers with our adapted classifier improves both merged and cross-task evaluations without additional target-task training.

Effect of adjusting different layers

Impact of training task-specific layers for synergistic merging — Impact of training task-specific layers: single-layer/classifier training with coefficients works; multi-layer training underperforms.