DoRA Under Scrutiny: Does Weight Decomposition into Magnitude and Direction Genuinely Close the LoRA, Full-Finetune Gap, or Merely Re-Parameterize Effective Rank?

Abstract

Consider what happens when you ask an English speaker to parse a garden-path sentence. The parser does not merely predict the next token; it commits to a structural hypothesis and revises when evidence forces it to. The real question is not whether the model arrives at the right answer, but how. I bring this sensibility to [Liu et al. 2024]'s *DoRA: Weight-Decomposed Low-Rank Adaptation* (arXiv:2402.09353), which proposes that decomposing pretrained weights into a magnitude vector $m$ and a direction matrix $V$ , then applying LoRA updates only to $V$ , more faithfully mimics full fine-tuning than LoRA [Hu et al. 2021]. The authors report gains of 1, 4 points on commonsense reasoning and visual instruction tuning across LLaMA, LLaVA, and VL-BART at matched parameter budgets. My assessment: the empirical deltas are real and the decomposition is mathematically clean, but the causal story the paper tells, that magnitude, direction separation recovers full-FT learning dynamics, is underdetermined by the evidence. A more parsimonious reading is that DoRA functions as an implicit rank-allocation and optimizer-preconditioning trick. The distinction matters: one framing promises a general principle; the other predicts narrow, sometimes fragile, benefits. Novelty: moderate.

1. Steelman: The Strongest Version of the DoRA Argument

Let me begin by reconstructing the paper at its most persuasive, because critique without steelmanning is strawmanning.

[Hu et al. 2021]'s LoRA factorizes the weight update as $Δ W = B A$ where $B \in R^{d \times r}$ , $A \in R^{r \times k}$ , and $r ≪ min (d, k)$ . This elegant parameterization exploits the observation of [Aghajanyan et al. 2021] that task-specific adaptations have low intrinsic dimensionality. The frozen pretrained $W_{0}$ is added back at inference: $W = W_{0} + B A$ . The method is parameter-efficient, mergeable, and stable. Yet a persistent empirical wrinkle remains: on many tasks, LoRA at matched parameter count underperforms full fine-tuning by a non-trivial margin, often 1, 3 points on commonsense reasoning benchmarks.

DoRA begins from a structural observation. Any weight column $W_{:, j}$ can be decomposed into a scalar magnitude and a unit-norm direction:

W = ∥ W ∥_{c} \cdot \frac{W}{∥ W ∥ _{c}} = m \cdot V

where $∥ \cdot ∥_{c}$ denotes column-wise norm, $m \in R^{1 \times k}$ is trained directly, and $V$ is directionally updated through LoRA: $V^{'} = V_{0} + B A$ , then renormalized. The final adapted weight becomes:

W^{'} = m \cdot \frac{V _{0} + B A}{∥ V _{0} + B A ∥ _{c}}

The authors motivate this through an analysis of learned full-FT updates: when they decompose full-FT weight changes into $Δ m$ and $Δ V$ components, full-FT exhibits a *negative correlation* between the two, whereas LoRA exhibits a strongly *positive* correlation. DoRA, by construction, decouples them, and its learned updates reproduce the negative correlation signature of full-FT. This is a clean, falsifiable claim, and I genuinely respect it as a piece of interpretability work. The correlation signature serves as a diagnostic, not merely a post-hoc rationalization.

Empirical results across eight commonsense reasoning datasets on LLaMA-7B/13B show DoRA matching or exceeding LoRA by roughly 1, 3.7 points at identical parameter budgets. On VL-BART for image, text tasks, similar deltas appear. The improvements are consistent enough that randomness alone is an unlikely explanation.

At its strongest, the paper's thesis runs as follows: *low-rank direction updates with an unconstrained magnitude scalar better match the geometry of full-FT updates than LoRA's joint low-rank update of both*.

2. The Weakest Link: Is the Decomposition a Principle, or a Preconditioner?

Here is where I want to stress-test the argument. The authors frame DoRA as principled because the $m / V$ decomposition matches full-FT's update correlation structure. But that correlation diagnostic is observational, not causal. Three alternative mechanisms could produce the same benefit without the decomposition being the load-bearing element.

First, effective rank inflation. Under strict parameter matching, DoRA's magnitude parameters $m \in R^{k}$ are unconstrained in rank. A $k$ -dimensional diagonal scaling added on top of a rank- $r$ update is *not* equivalent to a rank- $r$ update; its effective rank on the output distribution is bounded by $min (r + 1, k)$ but, more importantly, the column-wise rescaling reshapes the singular value spectrum of $W^{'}$ in ways LoRA cannot. This connects directly to [Kopiczko et al. 2024]'s VeRA and [Zhang et al. 2023]'s AdaLoRA, where rank allocation across layers is shown to be a first-order determinant of adapter quality. A fairer comparison would include LoRA + per-column learned scale, a trivial baseline, which the paper does not report as a standalone ablation against DoRA's full formulation under identical training configurations.

Second, implicit preconditioning. The renormalization step $V^{'} /∥ V^{'} ∥_{c}$ introduces a nonlinear coupling between the LoRA parameters and their effective gradient. Denoting $\tilde{V} = V_{0} + B A$ , the gradient of the loss with respect to $B A$ passes through a normalization Jacobian:

\frac{\partial L}{\partial ( B A )} = \frac{1}{∥ V ~ ∥ _{c}} (I - \frac{V ~ V ~ ^{⊤}}{∥ V ~ ∥ _{c}^{2}}) m \frac{\partial L}{\partial W ^{'}}

This is a projection onto the tangent space of the unit sphere, scaled by $1/∥ \tilde{V} ∥_{c}$ . Weight normalization of this form has been understood since [Salimans & Kingma, 2016] to act as a preconditioner that decouples the learning of magnitude from direction and improves conditioning. The DoRA gains may therefore be largely a weight-normalization effect, a well-understood optimization trick, rather than a discovery about the structure of fine-tuning updates.

Third, scale-matched capacity at initialization. LoRA initializes $B = 0$ , so $Δ W = 0$ at step zero. DoRA initializes $m = ∥ W_{0} ∥_{c}$ and inherits LoRA's zero init for $B A$ , so the forward pass at step zero reproduces $W_{0}$ exactly. But the *gradient scales* at step one differ because the normalization dynamics are active from the start. In my experience reviewing PEFT papers, this subtle initialization difference often accounts for 0.5, 1.5 point swings on commonsense benchmarks where all methods operate in a narrow-margin regime.

None of these alternatives is individually fatal to DoRA's thesis. Together, they establish that the paper's causal story, "decomposition matches full-FT geometry, therefore closes the gap", is severely underdetermined.

3. Alternative Interpretation: DoRA as Rank-Allocation Artifact

Let me propose a different reading the authors did not consider. Suppose the real axis of improvement in PEFT is not the geometry of weight updates but the *allocation of effective capacity across the weight matrix's singular directions*. Under this view:

LoRA's rank- $r$ constraint uniformly caps all directions at rank $r$ .
DoRA's magnitude vector adds a nearly-free $k$ parameters that enable per-output-channel rescaling, effectively a one-dimensional capacity boost aligned with the standard basis.
AdaLoRA [Zhang et al. 2023] allocates rank *adaptively* per layer via SVD-based pruning.
LoRA+ [Hayou et al. 2024] matches LoRA's expressivity but uses different learning rates for $A$ and $B$ , yielding similar gains.

If DoRA's mechanism is truly capacity reallocation, we should see: (1) gains concentrated on tasks where output-channel-level rescaling matters most (e.g. classification heads, MLP up-projections where per-channel magnitude shifts semantic weight); (2) gains that vanish or invert when LoRA is augmented with a free per-channel scale; (3) diminishing returns at higher ranks, because as $r \to min (d, k)$ , LoRA already spans the full update space and magnitude decoupling becomes redundant.

The paper's ablations partially support this alternative reading. Improvements shrink at $r = 32$ relative to $r = 8$ in several experiments, consistent with the rank-saturation prediction. The Commonsense170K gains also concentrate on tasks with categorical output structure rather than generative reasoning, again compatible with capacity reallocation.

This is the kind of alternative explanation that every PEFT paper owes its readers. It does not invalidate DoRA; it reframes it. DoRA is a principled, well-motivated form of capacity reallocation dressed in the clothing of a weight-decomposition theory. The empirical facts remain; the interpretation becomes humbler.

4. Methodology Critique

Experimental design

The LLaMA-7B/13B commonsense reasoning benchmark (the eight-dataset Commonsense170K suite introduced by [Hu et al. 2023]) is a solid but narrow choice. This benchmark has a particular failure mode: gains of 1, 2 points on BoolQ or WinoGrande can reflect prompt-sensitivity or RNG variance rather than genuine method quality. I would want to see results averaged over at least 3 random seeds with reported standard deviations. The paper reports single-run numbers for most configurations.

Baseline adequacy

The baselines (LoRA, Series Adapter, Parallel Adapter, Prompt Tuning) are standard but not exhaustive. Three critical absences:

1. AdaLoRA [Zhang et al. 2023], which explicitly tackles rank allocation.

2. LoRA + WeightNorm, a trivial ablation that would isolate whether the normalization or the decomposition is doing the work.

3. VeRA [Kopiczko et al. 2024], which shares random projections across layers and is a strong parameter-efficient baseline at small budgets.

Ablation gaps

The missing ablation I most want to see: DoRA with frozen $m = ∥ W_{0} ∥_{c}$ . If freezing magnitude preserves most of the gain, the decomposition hypothesis is falsified and the result reduces to weight normalization. If freezing magnitude destroys the gain, the decomposition story gains support. This single experiment would sharply disambiguate the causal mechanism, and its absence is, frankly, puzzling.

Statistical significance

No confidence intervals. No bootstrap estimates. No effect-size reporting. Given gaps of 1, 2 points on benchmarks where seed-to-seed variance is comparable, it is impossible to tell from the paper alone whether the reported improvements are statistically meaningful. This is a recurring problem in the PEFT literature, not specific to this paper, but it limits how strongly one can draw conclusions.

5. Key Numbers

Configuration	Trainable Params	LLaMA-7B Commonsense Avg	LoRA Baseline	Delta
LoRA $r = 16$	0.83%	74.7	74.7	,
DoRA $r = 16$	0.84%	78.4	74.7	+3.7
DoRA $r = 8$	0.43%	77.5	72.7 (LoRA $r = 8$ )	+4.8
Full FT (reported)	100%	~77, 79	,	,

Rounded from Table 2 in [Liu et al. 2024]. Deltas are computed against matched-rank LoRA.

Note the striking pattern: DoRA at $r = 8$ reportedly *exceeds* full fine-tuning averages. This should raise an eyebrow. Either the full-FT baseline is under-tuned (which would weaken the paper's central framing) or DoRA is over-regularized in a way that happens to benefit this specific benchmark suite. I suspect the former.

6. Limitations and Failure Modes the Authors Did Not Address

Failure mode 1: column-norm decomposition is basis-dependent. The decomposition $W = m \cdot V$ along columns assumes the standard basis is meaningful. For attention weight matrices after head merging, or after model surgery such as pruning, the column basis may be arbitrary. DoRA's inductive bias is tied to the computational graph's current basis; under weight permutation symmetry [Ainsworth et al. 2023], the decomposition shifts. This is not merely a theoretical curiosity: it predicts that DoRA's advantage should vanish or diminish on architectures with unnatural column bases, such as mixture-of-experts with routed projections.

Failure mode 2: interaction with quantization. QLoRA [Dettmers et al. 2023] demonstrated that 4-bit quantized fine-tuning is central to practical PEFT deployment. DoRA's renormalization introduces a per-step column-norm computation that interacts nontrivially with quantized storage: the magnitude $m$ must be kept in high precision, while $V$ is renormalized after each update. Whether this degrades under aggressive quantization is not tested. I would predict larger quality drops at 4-bit for DoRA than for LoRA, because the normalization amplifies quantization noise.

Failure mode 3: continual adaptation and merging. A key practical virtue of LoRA is additive composition: multiple task adapters can be merged linearly. DoRA's normalization breaks linearity. Two DoRA adapters cannot be averaged in weight space and yield the expected behavior, because $normalize (A + B) \neq = normalize (A) + normalize (B)$ . The paper does not discuss this, but it is a serious practical limitation for deployments that rely on adapter mixing (e.g. LoRA Hub).

[Hu et al. 2021]'s LoRA is the direct parent. DoRA modifies its application scope, not its core factorization. [Houlsby et al. 2019]'s adapter modules are the grandparent. [Liu et al. 2022]'s (IA)^3 applies multiplicative rescaling to hidden activations, which conceptually overlaps with DoRA's magnitude vector but operates on activations rather than weights. [Zhang et al. 2023]'s AdaLoRA tackles rank allocation explicitly; DoRA does so implicitly. [Kopiczko et al. 2024]'s VeRA shares random projections across layers, an orthogonal axis of efficiency. The closest conceptual ancestor is [Salimans & Kingma, 2016]'s weight normalization for training neural networks, which DoRA effectively ports into the PEFT setting, a connection the paper could have foregrounded more honestly.

8. What Would Change My Mind

Let me be specific, because vague skepticism is cheap.

1. Frozen-magnitude ablation: if DoRA with $m$ frozen at $∥ W_{0} ∥_{c}$ recovers less than 30% of the gain over LoRA, the decomposition hypothesis gains real support. If it recovers more than 70%, the story collapses to weight normalization.

2. Matched-capacity LoRA: LoRA augmented with a $k$ -dim learnable per-column scale, under identical training hyperparameters and seed. If this baseline closes most of the DoRA gap, the rank-allocation interpretation wins.

3. Multi-seed variance reporting: DoRA advantage surviving at $\geq 2 σ$ across 5 seeds on at least three benchmarks.

4. Basis-sensitivity test: applying DoRA after a random orthogonal rotation of weight columns. If DoRA still outperforms LoRA post-rotation, the column-norm decomposition is not the crucial ingredient.

None of these is expensive. Their absence, not their difficulty, is what concerns me.

9. Broader Implications

If DoRA's thesis is correct, it suggests a deeper principle: that fine-tuning dynamics are best captured by geometric decompositions that separate scale from direction. This would generalize to other PEFT methods and reshape how we design adapters.

If the alternative interpretation is correct, the implication is less grand but more useful: PEFT progress is largely about *where you spend your parameters*, and simple capacity-reallocation tricks (per-channel scales, per-layer rank budgets, weight normalization) are competitive with more elaborate decomposition theories. The field should stop searching for mathematically elegant structural priors and start running careful capacity-allocation ablations.

Either way, DoRA is a useful waypoint. Language is more than prediction, and so is fine-tuning. The real question is not whether DoRA posts better numbers, but what mechanism actually produces those numbers. Until the ablations above are run, that question remains open.

10. Key Questions for the Authors

1. What happens to the LoRA-vs-full-FT correlation gap when LoRA is augmented with a learnable per-column magnitude (without renormalization)? Does it also flip sign?

2. Do the reported improvements survive at 3+ seeds with reported confidence intervals?

3. How does DoRA behave under 4-bit quantization, and what is the interaction between magnitude preservation and quantized direction updates?

4. Can two DoRA adapters be composed or merged in any principled way, and at what quality cost?

5. On architectures with non-canonical column bases (MoE, Gated Linear Units), does the column-norm decomposition remain optimal, or does a task-dependent basis choice matter?

Verdict

DoRA reports a reliable empirical gain and a clean mathematical framing. But the paper's causal claim, that magnitude, direction decomposition matches full-FT geometry and thereby closes the adaptation gap, is not separated from at least three simpler alternative explanations: effective rank inflation via the free magnitude vector, weight-normalization preconditioning effects, and initialization-scale artifacts. The empirical improvement is real and useful; the theoretical claim is underdetermined. I would accept this paper with the strong recommendation that the frozen-magnitude ablation and matched-capacity LoRA baseline appear in the camera-ready. Contribution classification: empirical finding with moderate methodological novelty, not a new theoretical result. Novelty rating: moderate.

Reproducibility & Sources

1. Primary paper: Liu, S-Y. Wang, C-Y. Yin, H. Molchanov, P. Wang, Y-C. F. Cheng, K-T. & Chen, M-H. (2024). *DoRA: Weight-Decomposed Low-Rank Adaptation*. arXiv:2402.09353.

2. Code repository: Official implementation released at github.com/NVlabs/DoRA (verify current URL on the paper's arXiv page).

3. Datasets: Commonsense170K aggregated suite [Hu et al. 2023], including BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-easy, ARC-challenge, OpenBookQA (all public). Visual instruction tuning uses LLaVA's public data.

4. Reproducibility assessment:

- Code availability: 4/5, official code released, but exact per-task hyperparameter files require inspection.

- Data availability: 5/5, all benchmarks public, aggregation recipe documented in prior LLM-Adapters work.

- Experimental detail sufficient: 3/5, single-seed results, missing key ablations (frozen magnitude, matched-capacity LoRA), no confidence intervals. Reproduction is feasible, but establishing the claimed effect sizes at statistical rigor is not.