The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery

1. Summary and Contribution Classification

The paper addresses a well-known pain point in Generalized Category Discovery (GCD): jointly optimizing a supervised cross-entropy term on labeled known-class data and an unsupervised clustering or contrastive term on unlabeled data containing both known and novel categories [Vaze et al. 2022]. The authors conduct a quantitative analysis that, according to the abstract, reveals a phenomenon they name *gradient entanglement*. Two failure signatures are attributed to this phenomenon. First, the supervised gradient is distorted in a manner that weakens discrimination among known classes. Second, the representation subspaces of known and novel categories bleed into one another, eroding the separability of novel clusters. The proposed remedy is the Energy-Aware Gradient Coordinator (EAGC), described as a plug-and-play module that intervenes at the gradient level and exploits an energy-based signal to modulate how the supervised and unsupervised components combine.

Restating the technical posture: let $g_{s} = \nabla_{θ} L_{sup}$ and $g_{u} = \nabla_{θ} L_{unsup}$ denote the two per-step gradient contributions on a shared backbone $θ$ . The standard GCD objective amounts to stepping along $g_{s} + λ g_{u}$ for some fixed $λ$ . The paper's diagnostic frame implies that when $⟨ g_{s}, g_{u} ⟩ < 0$ , or when $∥ g_{u} ∥ ≫ ∥ g_{s} ∥$ along known-class directions, supervised learning is effectively overwhelmed or redirected. EAGC appears to introduce an energy functional $E (x; θ)$ whose gradient statistics are used to reweight, project, or otherwise coordinate $g_{s}$ and $g_{u}$ before the parameter update. Because the abstract does not complete this description, the exact operator is inferred.

Contribution classification: the work is predominantly (b) a new algorithm, packaged with (c) empirical claims about GCD benchmarks, and motivated by what the authors present as (c) a diagnostic empirical finding, namely gradient entanglement as a measurable pathology. It is not a new theoretical result. Reduction to familiar problems reveals something fundamental here: the method sits squarely within the multi-task gradient-manipulation lineage, descended specifically from PCGrad [Yu et al. 2020], GradNorm [Chen et al. 2018], MGDA/Pareto formulations [Désidéri, 2012; Sener & Koltun, 2018], and IMTL [Liu et al. 2021], reframed for the GCD setting and flavored with energy-based modeling in the tradition of [LeCun et al. 2006] and [Grathwohl et al. 2020].

2. Significance and Novelty Assessment

Rating: moderate, leaning incremental. My reasoning is structural rather than dismissive. GCD is an active and legitimately difficult problem because the unsupervised half must simultaneously cluster known-class samples correctly and discover novel clusters without supervision, while the supervised half shapes the metric on the shared embedding space. The tension between these objectives is not a new observation. [Vaze et al. 2022] already remarked on the fragility of joint training, and follow-ups such as SimGCD [Wen et al. 2023] and PromptCAL [Zhang et al. 2023] adjusted the unsupervised objective (parametric heads, prompt-based contrastive objectives) largely because naïve joint optimization underperforms. What this paper adds, as best as can be inferred from the abstract, is twofold: (i) a named diagnostic, *gradient entanglement*, accompanied by quantitative measurements, and (ii) a gradient coordinator driven by an energy signal.

Set against the multi-task optimization literature, the novelty budget shrinks further. PCGrad [Yu et al. 2020] projects conflicting gradients onto each other's normal plane when $⟨ g_{i}, g_{j} ⟩ < 0$ . GradNorm [Chen et al. 2018] rescales task losses to equalize gradient norms across tasks. CAGrad [Liu et al. 2021b] finds an update direction within a trust region of the average gradient that maximizes minimum task progress. MGDA-UB [Sener & Koltun, 2018] returns a Pareto-stationary direction. EAGC, to the extent the abstract describes it, is plausibly a new point in this design space, specialized to GCD by using energies rather than raw task losses as the coordination signal. Specialization is not a minor contribution if the energy signal captures something that the norm or inner-product signal misses. But the paper must carry that argument, and the abstract does not yet.

What is genuinely new, if the experiments hold, is the *empirical claim* that gradient entanglement causes representation-subspace overlap between known and novel classes. That is a falsifiable, GCD-specific diagnostic and, if the measurement protocol is sound, a useful one. The remainder, a gradient-level plug-in that reweights or reprojects task gradients, is a well-trodden template.

3. Technical Analysis

Let me lay out the setup formally so we can locate the assumptions. Let $D_{ℓ} = {(x_{i}, y_{i})}$ be the labeled set with $y_{i} \in Y_{known}$ , and $D_{u} = {x_{j}}$ the unlabeled set with latent labels in $Y_{known} \cup Y_{novel}$ . A shared encoder $f_{θ} : X \to R^{d}$ produces features, and a classifier or prototype set ${c_{k}}$ operates on top. The training objective is

L (θ) = L_{sup} (θ; D_{ℓ}) + λ L_{unsup} (θ; D_{u}),

with $L_{unsup}$ typically a self-distillation or contrastive loss [Caron et al. 2021; Chen et al. 2020]. The key lemma implicit in the paper is that, at some iterates $t$ , the projection $proj_{g_{s}} (g_{u}) = \frac{⟨ g _{s} , g _{u} ⟩}{∥ g _{s} ∥ ^{2}} g_{s}$ has magnitude large enough to flip the sign or substantially alter the direction of the composite update along known-class discriminative directions. This is the same conflict geometry that PCGrad targets, and it is intuitive: an unsupervised clustering loss is blind to which clusters correspond to labeled classes, so it happily pulls known-class samples toward novel-class centroids whenever doing so lowers its own objective.

The *energy-aware* component is where I want to see rigor the abstract does not provide. One standard construction uses $E (x; θ) = - lo g \sum_{k} exp (f_{θ} (x)^{⊤} c_{k} / τ)$ , the LogSumExp of classifier logits, which recovers the EBM-classifier equivalence of [Grathwohl et al. 2020]. Low energy on a sample means the classifier is confident; high energy flags out-of-distribution or novel-class behavior. If EAGC uses energy to distinguish gradients from known-class-confident versus novel-class-ambiguous points and reweights accordingly, that is a coherent design. The concern is circularity: energy is itself a function of $θ$ , so the coordinator conditions on the very representation it is trying to reshape. Without a fixed-point analysis, we cannot rule out that EAGC converges to a trivial rescaling of the loss weight $λ$ that any well-tuned schedule could produce.

Here is the sharper version of that critique. Suppose EAGC collapses empirically to a data-dependent scalar $α (x; θ) \in [0, 1]$ such that the effective update is

Δ θ \propto E_{x \sim D} [α (x; θ) g_{s} (x) + (1 - α (x; θ)) g_{u} (x)] .

Then EAGC is observationally equivalent to a learned curriculum over samples, which could equally be achieved by confidence-weighted pseudo-labeling or by a FixMatch-style mask [Sohn et al. 2020]. The burden falls on the authors to show that the coordinator does something PCGrad-style projection or confidence masking *cannot* replicate at the same compute.

The implicit assumptions I would surface:

1. Gradient entanglement is a cause, not a symptom. The paper identifies a correlation between gradient conflict and downstream degradation. Causal claims require intervention. Did the authors run a controlled experiment in which gradient entanglement is artificially induced or removed on a toy task, holding representation capacity fixed?

2. Energy is a reliable known-vs-novel signal throughout training. Early in training, representations are near-random and energies are uninformative. The coordinator presumably does nothing useful for the first $k$ steps. Is there a schedule? If so, the method carries a hidden warmup hyperparameter.

3. The decomposition $g = g_s + \lambda g_u$ captures the relevant interaction. Modern GCD pipelines with self-distillation [Wen et al. 2023] involve EMA teachers and stop-gradients, which complicate the gradient decomposition. The abstract does not indicate whether EAGC accounts for these.

On theoretical rigor: the abstract does not promise a convergence proof, regret bound, or generalization guarantee. That is honest, but it also means the paper must carry its weight empirically.

4. Experimental Assessment

Since the abstract does not disclose numerical results, my assessment targets the experimental design that would be *necessary* to support the claims, and flags which elements are likely present versus likely absent.

Baselines. The minimum viable baseline set for GCD circa 2026 is: the original GCD [Vaze et al. 2022]; SimGCD [Wen et al. 2023]; PromptCAL [Zhang et al. 2023]; a DINO or iBOT [Caron et al. 2021; Zhou et al. 2022] backbone paired with k-means as a floor; and at least one recent entry. For the gradient-coordinator claim specifically, the baselines must include PCGrad [Yu et al. 2020] and GradNorm [Chen et al. 2018] applied to the same GCD pipeline. Absent those, any improvement could be explained by generic multi-task gradient manipulation rather than the energy-aware aspect.

Ablations. The critical ablations needed to isolate the contribution are:

Energy signal vs. confidence signal. Replace the energy $E (x; θ)$ with the max-softmax probability and rerun. If performance is unchanged, the coordinator is a confidence-weighted scheme, not an energy-aware one.
Coordinator vs. loss reweighting. Replace the gradient-level intervention with an equivalent per-sample loss weight $w (x)$ and rerun. If performance is unchanged, the gradient-level framing is cosmetic.
Coordinator on random signal. Feed random noise in place of the energy to quantify how much of the gain is stabilization from any non-trivial perturbation.
Schedule ablation. Vary when EAGC engages (from step 0 vs. after warmup). This isolates the implicit curriculum.

The missing ablation I would flag regardless of what the paper reports: a direct measurement of gradient conflict $⟨ g_{s}, g_{u} ⟩ / (∥ g_{s} ∥∥ g_{u} ∥)$ as a function of training step, with and without EAGC. If the authors claim gradient entanglement is the mechanism, this curve should change measurably under their intervention. Without it, the causal story is unsubstantiated.

Datasets. GCD benchmarks include CUB-200 [Wah et al. 2011], Stanford Cars [Krause et al. 2013], FGVC-Aircraft [Maji et al. 2013], Herbarium-19 [Tan et al. 2019], ImageNet-100, and CIFAR-10/100. Results should span both generic (CIFAR, ImageNet) and fine-grained (CUB, SCars, Aircraft) datasets, because fine-grained separability is where GCD methods typically break.

Statistical significance. Three or more seeds with reported standard deviation is the minimum. GCD numbers are notoriously noisy due to the Hungarian matching step on novel-cluster labels, which can swing accuracy by several points for the same embedding. If the paper reports single-seed numbers, treat gains under two points as noise.

Reproducibility flag. Without specifying the energy functional, the reweighting or projection rule, the learning-rate coupling, and the EMA decay for any teacher network, reproducing this result would require substantial guesswork. The abstract does not mention a code release.

5. Limitations and Failure Modes

Beyond whatever limitations the authors list, the following failure modes deserve scrutiny.

Extreme novel-class fraction. When $∣ Y_{novel} ∣ ≫ ∣ Y_{known} ∣$ , the energy signal estimated from known-class prototypes becomes unreliable, because most high-energy points are legitimately novel rather than noise. EAGC may then systematically downweight $g_{u}$ precisely where it is most informative. A concrete failure scenario: Herbarium-19 with the standard 50/50 split is tractable, but a 10/90 known/novel split would likely break the coordinator.

Class imbalance within the labeled set. If supervised data is long-tailed, the supervised gradient is already dominated by head classes. EAGC may amplify this bias, because high-energy tail-class samples look, to the coordinator, like novel-class candidates. This is a silent failure mode that would not be detectable from aggregate accuracy.

Distribution shift between labeled and unlabeled sets. The standard GCD setup assumes both sets are drawn from the same covariate distribution. In realistic deployments, unlabeled data is collected later and under different conditions. The energy calibration from $D_{ℓ}$ no longer transfers, and the coordinator ships miscalibrated weights downstream. No ablation in the current GCD literature tests this cleanly, and I doubt this paper does either.

Backbone dependence. GCD performance is extremely sensitive to initialization, typically a DINO pretrained ViT-B/16. Methods that show large gains over GCD [Vaze et al. 2022] sometimes shrink considerably when paired with a stronger backbone, because the backbone already does most of the work. I would like to see EAGC evaluated with both a DINO ViT-B/16 baseline and a DINOv2 [Oquab et al. 2023] backbone, to check whether the gain survives representational headroom. The lower bound tells us what is fundamentally impossible, and that is liberating: if a stronger pretrained encoder closes the gap, we learn that EAGC is a correction for weak representations rather than a novel optimization principle.

Computational cost. A gradient-level coordinator implies per-sample or per-batch gradient decomposition. Naïvely, this requires computing $g_{s}$ and $g_{u}$ separately, doubling the backward pass. For a ViT-B/16 on ImageNet-100, that is a real cost. I would expect a training-time comparison, not just accuracy.

6. Questions for Authors

1. Causal test for gradient entanglement. On a synthetic GCD task where you control the overlap between known- and novel-class manifolds, do the entanglement metric and the downstream error vary together under interventions that change one but not the other? Without this, the diagnostic is correlational.

2. Energy versus confidence. If you replace the energy $E (x; θ)$ with $- max_{k} p_{k} (x; θ)$ or with an ensemble disagreement score, does EAGC's advantage persist? What specifically does the LogSumExp-style energy capture that softmax confidence does not?

3. Comparison to PCGrad and CAGrad on the identical GCD pipeline. Using the same backbone, hyperparameters, and schedule, what are the numbers for PCGrad [Yu et al. 2020] and CAGrad [Liu et al. 2021b]? If EAGC beats them by less than two points with overlapping error bars, the energy-aware framing is unjustified.

4. Asymmetric known/novel splits. How does EAGC behave when $∣ Y_{novel} ∣/∣ Y_{known} ∣ \in {0.25, 1, 4}$ ? I expect degradation at the high ratio and would like to see the curve.

5. Effective loss equivalence. Can you provide a counterexample where EAGC's update cannot be expressed as a per-sample reweighted sum $\sum_{i} w_{i} (θ) (\nabla ℓ_{i}^{sup} + \nabla ℓ_{i}^{unsup})$ ? If not, the method reduces to a learned curriculum over samples.

7. Verdict

Borderline, leaning weak accept at a top venue, conditional on the experimental design being strong. The paper identifies a named pathology and proposes a targeted intervention in a domain (GCD) that has not yet absorbed the multi-task optimization literature cleanly. That is a legitimate contribution if, and only if, the experiments isolate the contribution from adjacent baselines (PCGrad, GradNorm, confidence-weighted losses) and the energy signal is shown to do real work beyond rescaling.

My predicted outcomes:

Scenario	Likelihood	Venue-appropriate decision
EAGC beats GCD baselines but not PCGrad/GradNorm	moderate	Weak reject
EAGC beats all gradient baselines on fine-grained and coarse GCD with 3+ seeds	plausible	Weak accept
EAGC's gains evaporate under a stronger backbone (DINOv2)	plausible	Weak reject
Energy signal ablates to confidence signal with no loss	likely unaddressed	Major revision request

At ICLR or NeurIPS, I would push for the specific ablations in Section 4 before committing to an accept. The diagnostic of gradient entanglement is the more durable contribution; the coordinator is the load-bearing question.

The open conjecture worth pursuing: is there a principled characterization of when joint supervised-unsupervised training is provably sub-Pareto-optimal for GCD? A proof along the lines of [Sener & Koltun, 2018] that explicitly exploits the partial-label structure of GCD would justify gradient-level intervention as necessary rather than merely convenient. Such a result would make this line of work permanent rather than incremental. This is really about structure, not scale.

8. Reproducibility and Sources

Primary paper. *The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery.* arXiv:2604.14176, cs.LG, 2026.

Code repository. Not referenced in the abstract. No official code release confirmed at the time of this review.

Datasets (standard in the GCD literature, likely used here).

Dataset	Access
CIFAR-10 / CIFAR-100 [Krizhevsky, 2009]	Public, via torchvision
ImageNet-100 (subset of ImageNet-1k [Deng et al. 2009])	Public with registration
CUB-200-2011 [Wah et al. 2011]	Public
Stanford Cars [Krause et al. 2013]	Public
FGVC-Aircraft [Maji et al. 2013]	Public
Herbarium-19 [Tan et al. 2019]	Public via Kaggle

Reproducibility rating.

Dimension	Score (1-5)	Note
Code availability	1	No release indicated in abstract
Data availability	5	Standard GCD benchmarks are public
Experimental detail	unknown	Abstract alone is insufficient; depends on the full manuscript

Key prior work referenced in this review.

Vaze, K. Han, K. Vedaldi, A. Zisserman, A. *Generalized Category Discovery.* CVPR 2022. arXiv:2201.02609.
Yu, T. Kumar, S. Gupta, A. Levine, S. Hausman, K. Finn, C. *Gradient Surgery for Multi-Task Learning.* NeurIPS 2020.
Chen, Z. Badrinarayanan, V. Lee, C.-Y. Rabinovich, A. *GradNorm.* ICML 2018.
Sener, O. Koltun, V. *Multi-Task Learning as Multi-Objective Optimization.* NeurIPS 2018.
Désidéri, J.-A. *Multiple-Gradient Descent Algorithm (MGDA).* 2012.
Liu, B. Liu, X. Jin, X. Stone, P. Liu, Q. *Conflict-Averse Gradient Descent (CAGrad).* NeurIPS 2021.
Grathwohl, W. Wang, K.-C. Jacobsen, J.-H. Duvenaud, D. Norouzi, M. Swersky, K. *Your Classifier Is Secretly an Energy-Based Model.* ICLR 2020.
LeCun, Y. Chopra, S. Hadsell, R. Ranzato, M. Huang, F. *A Tutorial on Energy-Based Learning.* 2006.
Caron, M. Touvron, H. Misra, I. Jégou, H. Mairal, J. Bojanowski, P. Joulin, A. *DINO.* ICCV 2021.
Wen, X. Zhao, B. Qi, X. *Parametric Classification for Generalized Category Discovery (SimGCD).* ICCV 2023.
Zhang, S. Khan, S. Shen, Z. Naseer, M. Chen, G. Khan, F. *PromptCAL.* CVPR 2023.
Sohn, K. et al. *FixMatch.* NeurIPS 2020.
Oquab, M. et al. *DINOv2.* 2023.

The right abstraction makes the problem trivial, and finding it is the hard part. For GCD, that abstraction may be a partial-label multi-objective formulation with provable Pareto structure. EAGC is a reasonable step in that direction. Whether it is the decisive one is an empirical question that the full paper must answer.