Problem Setup
The central claim of Foret et al. (2020, arXiv:2010.01412) is that generalization in deep networks correlates with the *flatness* of the loss landscape around the final parameters, and that one can exploit this correlation by optimizing a surrogate objective that explicitly penalizes sharpness. Formally, given a training loss , Sharpness-Aware Minimization (SAM) replaces the standard ERM objective with
The inner maximization is a worst-case perturbation within a radius- ball around . The authors approximate via a single first-order step, yielding , and then descend along . The procedure requires two forward-backward passes per step and introduces a single hyperparameter, .
The empirical story is compelling: consistent gains on CIFAR-10/100, ImageNet, and transfer benchmarks, alongside robustness to label noise. The theoretical story, as we shall see, is considerably less settled. The paper has been heavily cited and has spawned an entire cottage industry of SAM variants (ASAM, GSAM, LookSAM, Fisher-SAM). Before joining the enthusiasm, let us be precise about what SAM actually *is* mathematically, and about which claims the paper's machinery can and cannot support.
Contribution classification. This is primarily an *algorithmic* contribution with *empirical* support and *weak-to-moderate theoretical framing*. The PAC-Bayes sketch in Section 2 motivates the sharpness objective but does not rigorously bound the *SAM estimator's* generalization gap. The novelty lies in the specific min-max reformulation and its cheap first-order approximation, not in the idea that flatness matters.
Novelty and Positioning Against Prior Work
The flatness hypothesis is old. Hochreiter and Schmidhuber (1997) introduced *flat minima* as a minimum-description-length principle, arguing that low-precision parameter encoding at flat optima yields better generalization. Keskar et al. (2017) empirically linked large-batch training to sharper minima and degraded generalization. Neyshabur et al. (2017) derived PAC-Bayes bounds that scale with a notion of sharpness measured via the Hessian spectrum. Dziugaite and Roy (2017) showed that PAC-Bayes could yield non-vacuous bounds even for overparameterized networks.
What, then, is new in Foret et al.? Three things. First, the explicit reformulation of sharpness as an -ball worst-case perturbation, which furnishes a *differentiable-almost-everywhere* surrogate. Second, the first-order ascent approximation, which reduces an intractable inner maximization to a single gradient step and keeps the method within a constant factor of SGD's cost. Third, systematic empirical validation across vision benchmarks at scale.
The elephant in the room: Dinh et al. (2017), *Sharp Minima Can Generalize for Deep Nets*, proved that standard sharpness measures are not reparameterization-invariant and can be made arbitrarily large or small by rescaling layers without altering the function computed. This is a *formal* objection to any sharpness-based generalization argument that does not explicitly account for such invariances. Foret et al. acknowledge this and add weight decay, yet their -ball formulation is not scale-invariant and inherits precisely the pathology Dinh et al. identified. Kwon et al. (2021) subsequently proposed ASAM (*Adaptive SAM*) to address this issue directly, replacing the isotropic ball with an adaptive one normalized by parameter magnitudes. That this fix was needed, and that it materially shifts performance, suggests the original formulation rested on shakier foundations than the paper's empirics imply.
Novelty rating: moderate. Strong engineering and empirical contribution; modest theoretical novelty; open questions about what SAM actually optimizes.
The Key Insight Is Algebraic, Not Geometric
Here is where we must be precise about the claim. Consider the first-order Taylor expansion of the inner problem:
Maximizing over yields , and substituting back gives:
To *first order*, SAM is gradient-norm regularization, not flatness-seeking in the Hessian sense, but gradient-magnitude penalization. This point was made explicit by Zhuang et al. (2022, GSAM) and Andriushchenko and Flammarion (2022), who showed that SAM's dynamics can be understood as implicit regularization of . The second-order correction does involve the Hessian, but the dominant first-order effect is not what the paper's intuition-pumping suggests.
This reframing has teeth. Gradient-norm regularization is well studied in statistics and optimization: it is the dominant term in *Tikhonov-style* stability penalties (Bousquet and Elisseeff, 2002) and appears in *manifold tangent regularization* (Rifai et al. 2011). Viewing SAM through this lens makes its success less mysterious and more continuous with a decades-old line of work on stable predictors. It also predicts where SAM should fail: in regimes where gradient norm and sharpness decouple, such as near saddles or along flat directions of rank-deficient Hessians.
Cross-Disciplinary Bridges
Robust Optimization
The SAM objective is structurally identical to *robust optimization* in the sense of Ben-Tal, El Ghaoui, and Nemirovski (2009). The inner max over a Euclidean uncertainty set is a textbook robust formulation. What deep learning calls sharpness, robust optimization calls *worst-case sensitivity to parameter uncertainty*. The key insight from that literature: robustness under an uncertainty set of radius is *equivalent* to adding an -norm regularizer on the gradient under linearization, and this equivalence becomes exact only when the problem is convex and the uncertainty set is small. Deep networks violate both conditions, yet the first-order equivalence is what SAM effectively exploits.
A researcher trained in robust optimization would immediately ask: why ? An ball yields coordinate-wise perturbations and sign-aligned gradients (structurally closer to FGSM from Goodfellow et al. 2015); a Mahalanobis ball defined by the Fisher information would yield the natural-gradient analog. The choice of is not principled in the paper; it is convenient.
Adversarial Training
SAM is the Madry et al. (2018) adversarial training objective applied to *weight space* rather than *input space*. This parallel is not merely cosmetic. Both use an inner max with a single gradient ascent step. Both rely on the implicit assumption that one ascent step approximates the true worst case, an assumption known to be loose when the loss landscape is non-convex or curvature is misaligned with the gradient direction (Wang et al. 2019). The adversarial training literature has developed a rich toolbox, PGD attacks, TRADES, curvature-aware defenses, that could inform SAM variants but has been under-exploited in the flatness-seeking literature.
Control Theory
control addresses the same mathematical structure: minimize worst-case performance under bounded disturbance. The parallel is close enough that one can port intuitions directly. In particular, control theorists have long known that worst-case robustness and average-case performance trade off in ways that depend sharply on the *disturbance set geometry*. The ASAM modification (Kwon et al. 2021) is precisely a change of disturbance set, scaled by parameter magnitudes, and its empirical superiority echoes control-theoretic findings that uniform bounds are rarely optimal when the underlying system spans multiple scales.
Statistics: Influence Functions and Stability
The PAC-Bayes framing in Foret et al. connects to *algorithmic stability* (Bousquet and Elisseeff, 2002), which bounds generalization in terms of how much a learner's output changes under training set perturbations. A duality emerges here: SAM perturbs *parameters* and penalizes loss sensitivity; stability analyses perturb *data* and penalize output sensitivity. Cook's distance, influence functions, and the jackknife all furnish quantitative tools for sensitivity that the sharpness literature rarely invokes. A statistician would note that is, up to constants, the empirical norm of the score function, and its magnitude at an optimum has direct interpretations in terms of Fisher information and asymptotic variance.
Neuroscience (Speculative)
There is a tenuous but intriguing parallel to *synaptic consolidation* and the stabilization of learned representations via reduced plasticity at critical synapses (Kirkpatrick et al. 2017, EWC). Both frameworks prefer solutions where small parameter perturbations do not catastrophically disrupt function. This parallel is worth flagging but should not be oversold.
The Implicit Learning Rate Story
Here is the uncomfortable finding: several follow-up works argue that a non-trivial fraction of SAM's gains can be reproduced by adjusting the learning rate schedule or by adding gradient-norm regularization without the inner max. Wen et al. (2023) show that SAM's effective dynamics, in a linearized regime, resemble those of SGD with a modified learning rate depending on gradient alignment with the Hessian's top eigenvector. Andriushchenko and Flammarion (2022) demonstrate that many reported SAM improvements shrink or vanish when baselines are properly tuned, particularly when the SGD baseline uses longer schedules or tuned momentum.
This is the core of the audit. If SAM's gains are substantially an implicit learning-rate effect, then the "flatness" story is epistemically misleading even where the empirical gains are real. A carefully tuned SGD with cosine schedule and gradient clipping may capture much of what SAM delivers.
Experimental Assessment
The paper reports strong numbers on CIFAR-10/100, ImageNet, and transfer tasks. Representative claims, summarized:
| Benchmark | Model | Baseline Reported | SAM Reported | Relative Gain |
|---|---|---|---|---|
| CIFAR-10 | WRN-28-10 | ~3.5% err | ~2.7% err | ~23% rel. |
| CIFAR-100 | WRN-28-10 | ~18.3% err | ~16.5% err | ~10% rel. |
| ImageNet | ResNet-152 | ~22.0% err | ~20.3% err | ~8% rel. |
| Label noise (40%) | ResNet-32 | substantial drop | recovers | large |
*Values approximate from paper tables; see arXiv:2010.01412 for exact figures.*
Three concerns arise. First, baseline tuning is the perennial trap. The baselines appear reasonably tuned but do not include the strongest competing regularizers (e.g. stochastic weight averaging from Izmailov et al. 2018, or AugMix with tuned schedules). Second, the sweep is shown to matter considerably, with an optimal value that depends on both model and dataset, yet no guidance is offered for selecting without a validation set. This is a hidden cost. Third, no error bars or formal significance tests are reported for most top-line numbers; for a method claiming 1, 2% absolute improvements on CIFAR, between-seed variance is non-trivial.
Missing ablations. The critical ablation, isolating whether SAM's gains come from gradient-norm regularization alone, is absent. A clean comparison would train with as the objective (computed via double backpropagation or finite differences) and measure whether the gap to full SAM is small. Subsequent work has begun to address this, but the original paper did not.
Limitations and Failure Modes
Beyond the stated limitations (roughly 2x compute overhead, sensitivity to ), three concrete failure modes deserve attention.
Batch norm interaction. SAM's inner ascent perturbs , which in turn shifts batch statistics in any layer using batch normalization. The gradient at is computed under different running statistics than at , and the interaction between this shift and the batch-norm moving average is not analyzed. On very small batches or with group normalization, the effective perturbation geometry differs substantially.
Distribution shift. The PAC-Bayes argument the authors invoke assumes iid train/test. Under covariate shift or subpopulation shift, flatness at the training distribution does not translate to flatness at the test distribution. Work by Cha et al. (2021, SWAD) and others has shown that flatness-based arguments for domain generalization require distribution-specific notions of sharpness that plain SAM does not provide.
Scale of model. Reported results are dominated by CIFAR-scale and ImageNet-scale ResNets and WideResNets. The behavior of SAM on modern transformer LLMs at the billion-parameter scale is substantially more contested. Bahri et al. (2022) reported SAM gains at T5 scale, but follow-up attempts have been mixed, and the compute overhead at pretraining scale is prohibitive absent efficient approximations. Whether the flatness story holds at scale, and whether it survives fused optimizers, ZeRO sharding, and low-precision training (Kumar et al. 2024), is an open question the original paper could not address.
Mathematical Aside: When Is the Bound Tight?
The first-order approximation is exact only when the loss is linear within the -ball. For a twice-differentiable loss with Hessian , the true maximizer satisfies
The first-order approximation is tight only when . Near convergence, while does not, so the approximation degrades precisely where the sharpness story matters most. This is not a minor technicality; it means SAM is most accurate during early training (when it arguably matters least) and least accurate near optima (where the claimed benefit should be maximal).
Impact Trajectory
SAM has generated considerable follow-up work, which tells us something about its productive role even where the original theory is shaky. ASAM, GSAM, LookSAM, Fisher-SAM, PGN, and efficient-SAM all trace back to this formulation. The method has become a de facto regularizer in several domains. But citation count is a lagging indicator of *quality*; it is a leading indicator of *influence*. These are not the same.
The sustainability of this impact will depend on whether the community converges on a *correct* story for why SAM works. If the gradient-norm regularization interpretation holds up, SAM may be superseded by cheaper variants that skip the inner ascent. If the flatness story survives rigorous testing under reparameterization-invariant sharpness measures (Petzka et al. 2021), SAM or ASAM becomes a durable technique. Current evidence leans toward the former, but the question remains live.
What Adjacent Fields Can Offer
Robust optimization offers principled methods for selecting uncertainty sets beyond balls, potentially yielding sharper bounds and better performance under known perturbation structure. Stochastic analysis offers tools (e.g. Kramers' rate theory, adapted to neural network settings by Chaudhari and Soatto, 2018) that reframe flatness as escape-rate asymmetry in a continuous-time limit. Information geometry, following Amari's work on natural gradient, suggests that any reparameterization-invariant notion of sharpness must be expressed via the Fisher-information metric rather than the Euclidean metric SAM employs. None of these perspectives is incompatible with SAM, but each would strengthen its foundations.
Questions for the Authors
1. When baselines are tuned to match SAM's total compute budget (double the epochs, tuned LR schedule, gradient clipping), how much of the reported gain persists? A formal compute-matched comparison would be decisive.
2. How do you reconcile the Dinh et al. (2017) reparameterization pathology with your sharpness claim? Specifically, does any of your reported gain survive layer-wise rescaling of trained networks?
3. What is the gap between full SAM and optimized directly? If the gap is small, the min-max framing is unnecessary.
4. Under -tuning, are improvements statistically significant across seeds, or within the noise floor of CIFAR training?
5. At what model scale does SAM's 2x compute overhead become prohibitive, and do efficient approximations (sparse perturbations, stochastic SAM) preserve the claimed benefits at that scale?
Assessment
SAM is a genuinely useful algorithmic idea wrapped in a theoretically loose motivation. Its empirical performance is real but more modest than first impressions suggest once baselines are carefully tuned. Its geometric intuition, that we seek flat minima, is partially right and partially a post-hoc rationalization for what is largely gradient-norm regularization with a particular inner-max implementation. The paper deserves its influence for launching a productive research program; it does not deserve uncritical acceptance of its theoretical framing. The most valuable contribution of this work may ultimately be the follow-up literature it provoked, which is slowly untangling what flatness, gradient norm, and learning-rate schedules each contribute to generalization. That disentanglement is where the real science lies.
