Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning

1. Summary & Contribution Classification

The paper revisits Hinton's Forward-Forward (FF) algorithm [Hinton, 2022; arXiv:2212.13345] and argues that the choice of *goodness function*, the local scalar each layer maximizes on positive data and minimizes on negative data, has been under-examined. Since Hinton's original proposal, nearly every FF derivative has defaulted to the sum-of-squares (SoS) of post-activations, $G_{SoS} (h) = \sum_{i} h_{i}^{2}$ , treating it as a fixed design decision rather than a tunable hyperparameter of the learning rule. The authors decompose goodness along two orthogonal axes: *which* activations to measure, and *how* to aggregate them. They then propose two variants: a hard top-k goodness, $G_{k} (h) = \sum_{i \in top- k} h_{i}^{2}$ , and a softer entmax-weighted energy that replaces the combinatorial top-k with a learnable sparse weighting $w = α -entmax (ϕ (h))$ [Peters et al. 2019]. The headline claim is a 22.6 percentage-point improvement on Fashion-MNIST over the SoS baseline.

The central conceptual move is to reinterpret goodness not as an energy but as a *selective* measurement over a subset of neurons, rendering the goodness function itself a site of inductive bias. The entmax variant smooths the non-differentiable top-k into a learnable sparsity regularizer with a closed-form Jacobian, which matters because FF's layer-local gradient estimator cannot afford the cost of straight-through approximations bleeding across layers.

I classify this as primarily (b) a new algorithm with substantial (c) empirical findings, bordering on an (d) engineering refinement of an existing framework. It is not a new theoretical result: the paper does not, as abstracted, offer a convergence proof, a sample-complexity bound, or a formal characterization of when sparse goodness dominates dense goodness. That gap is the principal weakness, and I will return to it.

2. Novelty & Significance

I rate the novelty as moderate. The ingredients, top-k selection, entmax parameterization, and the FF substrate, are each established. What is genuinely new is their composition and the empirical demonstration that the choice of goodness aggregator, long treated as incidental, exerts a first-order effect on FF's learnability.

The relevant prior-art cone comprises at least five threads. First, Forward-Forward itself [Hinton, 2022] and its successors, including layer-wise contrastive variants and CaFo-style cascades, all of which inherit SoS. Second, k-sparse autoencoders [Makhzani & Frey, 2014; arXiv:1312.5663], where hard top-k activation was shown to yield better unsupervised representations than $L_{1}$ penalties; the present work effectively imports that design choice into the goodness rather than the activation, which is a nontrivial and interesting transposition. Third, the sparsemax / entmax line [Martins & Astudillo, 2016; Peters, Niculae & Martins, 2019], which established that $α$ -entmax interpolates between softmax ( $α = 1$ ) and sparsemax ( $α = 2$ ) with exact zeros and a tractable Jacobian; the entmax-weighted goodness is a direct application. Fourth, sparse coding and divisive normalization [Olshausen & Field, 1996; Heeger, 1992], which have long argued that selective, competitive activations produce more informative codes, a claim the FF literature had essentially ignored. Fifth, local learning rules as backprop alternatives, including target propagation [Bengio, 2014; Lee et al. 2015], feedback alignment [Lillicrap et al. 2016], and equilibrium propagation [Scellier & Bengio, 2017], which furnish the comparative landscape against which any new FF variant must be evaluated.

What is *new*: the observation that FF's goodness function admits a rich design space and that hard selective measurement is not a minor tweak but a 22pp regime-shifter on a standard benchmark. What is *known*: that sparsity aids representations, that top-k is a viable sparsifier, and that entmax provides a differentiable relaxation. The contribution is thus a well-chosen recombination rather than a foundational advance, and its significance hinges almost entirely on whether the reported gains are real, general, and mechanistically explained.

3. Technical Analysis

Problem Setup

In FF, each layer $ℓ$ receives an input $x$ and produces an activation $h^{(ℓ)} = f (W^{(ℓ)} x)$ . The local objective is typically

L^{(ℓ)} = lo g (1 + exp (- (G (h_{+}^{(ℓ)}) - θ))) + lo g (1 + exp (G (h_{-}^{(ℓ)}) - θ)),

where $h_{+}$ and $h_{-}$ are positive and negative samples, $θ$ is a threshold, and $G$ is the goodness. With $G = G_{SoS}$ , gradients scale as $2 h_{i}$ for every neuron, yielding a dense, uniform learning signal.

Replacing $G_{SoS}$ with $G_{k}$ restricts the gradient to the top-k active units:

\frac{\partial G _{k}}{\partial h _{i}} = 2 h_{i} \cdot 1 [i \in top- k (h)] .

Here is where the interesting theory resides. The indicator induces a piecewise-linear loss surface with $(k n)$ smooth pieces for a width- $n$ layer, and the boundaries between pieces correspond to activation ties. Generically, on any open set away from ties, $G_{k}$ is smooth, and the gradient update is identical to an SoS update restricted to a subspace. The entmax variant smooths these boundaries through a probability simplex projection, converting the combinatorial geometry into a convex one.

Implicit Assumptions

I flag three assumptions the abstract does not surface.

Assumption A1 (stability of the top-k set). The algorithm presumes that the identity of the top-k neurons is reasonably stable across mini-batches for a given class; otherwise the layer receives contradictory credit-assignment signals across batches. Under mode collapse or high-variance representations early in training, this can fail catastrophically: the same input may route gradient to disjoint neuron sets across epochs, producing diffusion rather than learning.

Assumption A2 (calibration of the threshold $\theta$). SoS goodness grows as $Θ (n)$ with layer width $n$ , so $θ$ is typically tuned per width. Top-k goodness grows as $Θ (k)$ in the top quantile, which decouples the threshold from width only if $k$ is chosen proportional to some fixed fraction. If the authors held $k$ constant across layers of differing width, the effective comparison to SoS is not scale-matched, and the 22.6pp gap partially reflects a better-calibrated threshold rather than a better aggregator.

Assumption A3 (the negative-data distribution is informative under sparsity). FF's negative data are usually hybrid images or label-mismatched inputs. SoS measures total activation energy, which is well defined regardless of the structure of negatives. Top-k measures peak activation energy, which is meaningful only if negatives produce *differently shaped* top-k sets, not merely lower-energy ones. The paper does not, as abstracted, analyze what the negative-data marginal over top-k indices looks like.

Complexity and Optimization

Top-k can be computed in $O (n)$ via quickselect or $O (n lo g k)$ with a heap, so the per-step cost is unchanged from SoS. The entmax projection requires solving a 1D root-finding problem for the threshold parameter; Peters et al. give $O (n lo g n)$ via sorting, which is dominated by the matrix multiply. Neither variant, then, adds meaningful wall-clock cost.

Convergence is the harder question. FF with SoS has no known global convergence guarantee; it is a coordinate-ascent-style scheme on a non-convex local objective. Restricting to top-k renders the per-step update a *subgradient* when the active set changes, and subgradient descent on non-smooth, non-convex objectives has, at best, convergence to approximate stationary points under strong conditions. The entmax smoothing recovers differentiability almost everywhere and should inherit the same weak guarantees as softmax-parameterized objectives. No proof is offered, and none would be easy to obtain; the reduction reveals something fundamental, but the paper does not attempt it.

4. Experimental Assessment

Headline Metric

Metric	Reported	Concern
Fashion-MNIST accuracy gain (top-k vs SoS)	+22.6 pp	Baseline calibration, single dataset
Datasets evaluated (from abstract)	Fashion-MNIST only	No MNIST, CIFAR-10, or tabular controls surfaced
Architectures	Not specified	Width, depth, and $k$ all unclear
Seeds / error bars	Not specified	Statistical significance unverifiable

A 22.6 percentage-point gap on Fashion-MNIST is enormous. For context, the difference between a well-tuned MLP and a well-tuned CNN on Fashion-MNIST is roughly four to six points, and the gap between FF with SoS and standard backprop in Hinton's original paper is about 1.5 points on MNIST. A 22pp swing from changing an aggregator alone is plausible *only* if the SoS baseline is poorly calibrated or structurally pathological in ways the sparse variant incidentally repairs. A fairer baseline would be SoS with per-layer $θ$ tuned via validation, SoS with layer-wise normalization, and SoS with $L_{2}$ activation regularization, each of which can absorb some of the benefit top-k appears to confer.

Missing Ablations

The critical ablation, absent from what the abstract reports, is the sensitivity of accuracy to $k$. If accuracy is flat across $k \in [n /10, n /2]$ , the story is about selection in general; if it peaks sharply at a single value, the story is about a specific inductive bias. Equally important is a width ablation: does the gain persist as $n \to \infty$ at fixed $k$ , fixed $k / n$ , or neither? Without this, we cannot distinguish a genuine sparsity effect from implicit regularization via reduced effective capacity.

A second missing control is the dense-with-temperature baseline: $G_{T} (h) = \sum_{i} h_{i}^{2} \cdot softmax (h^{2} / T)_{i}$ , which recovers SoS as $T \to \infty$ and something close to top-1 as $T \to 0$ . If this continuous interpolation tracks the entmax result, the contribution collapses into a specific choice on a one-parameter curve, and the entmax framing becomes decorative.

Reproducibility Flags

The abstract does not commit to released code, specific architectures, or hyperparameter schedules. FF is notoriously sensitive to layer-wise normalization, negative-data construction, and threshold tuning. Absent those details, reproducing a 22.6pp gap would require a search over a joint space larger than the reported effect itself. I could not reproduce this without substantially more specification.

5. Limitations & Failure Modes

Failure mode 1: Natural images with distributed codes. Fashion-MNIST admits highly class-discriminative, spatially concentrated activation patterns at low resolution. On CIFAR-10 or ImageNet, where natural image statistics encourage distributed population codes, top-k goodness may discard most of the relevant signal. Concretely, if class information is spread over 100 units in a 512-unit layer, top-10 goodness sheds 90% of the discriminative variance. The paper needs at least one dataset where class identity is *known* to require distributed representations in order to falsify this failure mode.

Failure mode 2: Class imbalance and tail classes. Top-k sharpens the competition among neurons. Under heavy class imbalance, the winning neurons will be dominated by majority classes, and minority-class gradient signals may be starved entirely. This is structurally analogous to the long-tail failure of hard-attention mechanisms [Jang et al. 2017; arXiv:1611.01144]. Fashion-MNIST is perfectly balanced, so this mode is invisible in the reported evaluation.

Failure mode 3: Adversarial and distribution-shift robustness. Selective aggregation concentrates the decision boundary on a small neuron subset; small perturbations that flip the top-k set produce large goodness swings. I would conjecture, without testing, that top-k FF is *more* adversarially fragile than SoS FF, not less, and that the gap widens with smaller $k$ . Entmax partially mitigates this because the weights are continuous in $h$ , but only partially.

Failure mode 4: Biological plausibility is overstated. FF is often defended on grounds of biological plausibility, yet a global top-k operation requires each neuron to compare its activation to the $k$ -th largest across the layer, which is a non-local computation unless implemented via lateral inhibition with carefully tuned time constants. The paper does not, as abstracted, engage with whether sparse goodness preserves or undermines the original motivation.

Failure mode 5: Interaction with negative data. FF's negative-data construction has always been fragile. Under top-k goodness, negatives that happen to activate a *different* top-k set than positives will be suppressed more effectively, but negatives that partially overlap in the top-k may be pushed in inconsistent directions. A diagnostic experiment measuring the Jaccard overlap of top-k sets between positives and their paired negatives would directly probe this.

6. Questions for Authors

1. How was the SoS baseline tuned? Specifically, were $θ$ , learning rate, and any normalization re-tuned for SoS with the same budget granted to top-k, or were SoS settings inherited from prior work? A 22.6pp gap warrants explicit evidence of baseline parity.

2. What is the accuracy as a function of $k / n$ on a log-spaced sweep? Does the improvement persist at $k = n$ (which should recover SoS up to a constant), and if so, why?

3. Does the method transfer to CIFAR-10 or a dataset with known distributed population codes? If not, how do you distinguish *general sparsity benefits* from *Fashion-MNIST-specific inductive alignment*?

4. For entmax-weighted energy, what $α$ was used, and how sensitive is accuracy to $α \in {1, 1.5, 2}$ ? Does learning $α$ jointly with parameters converge, and to what value?

5. Under negative-data construction that overlaps heavily with positives in activation support, does the top-k variant still separate positive and negative goodness, or does it collapse? A controlled study with synthetically constructed near-positive negatives would be informative.

7. Verdict

At a venue with NeurIPS-level selectivity, my recommendation would be weak reject, encouraging revision. The core idea is well motivated and the empirical signal is striking, but the contribution as abstracted has three gaps a senior reviewer would insist on closing before acceptance. First, the dataset coverage is too narrow: Fashion-MNIST alone cannot carry a claim about goodness-function design. Second, the 22.6pp gap is implausibly large absent a baseline-calibration audit, and the paper does not, as abstracted, demonstrate that audit. Third, the entmax construction is presented alongside top-k but without the comparative analysis that would establish when each should be preferred. The theoretical grounding, why selective measurement should help in FF specifically, is under-developed; an analysis of when sparse goodness is provably necessary would elevate this from a refinement to a contribution.

A lower bound tells us what is fundamentally impossible, and that is liberating. Here, the missing lower bound runs in the opposite direction: a characterization of the regime in which SoS is provably suboptimal, or a sample-complexity separation between the two aggregators. Without it, the paper reports a benchmark win on a small dataset with an aggregator change whose mechanism is plausibly but not rigorously understood. With a proper baseline audit, a cross-dataset replication, and a sensitivity analysis over $k$ and $α$ , this becomes a solid paper. Without them, it joins the class of appealing empirical findings that do not survive replication, of which the FF literature already harbors too many.

The right abstraction, selection as a first-class axis in goodness design, is the genuine insight. Finding it is indeed the hard part, and the authors deserve credit for naming the axis. The proof that it matters, in generality and not merely on Fashion-MNIST, is still owed.

8. Reproducibility & Sources

Primary paper. *Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning*, arXiv:2604.13081v1 [cs.LG], April 2026.

Code repository. Not indicated in the abstract. Status: no official code released, to my knowledge.

Datasets. Fashion-MNIST [Xiao, Rasul & Vollgraf, 2017; arXiv:1708.07747], publicly available via standard ML dataset loaders. No other datasets surfaced in the abstract.

Reproducibility ratings (1 = worst, 5 = best).

Dimension	Rating	Justification
Code availability	1	No repository indicated in abstract
Data availability	5	Fashion-MNIST is a standard public dataset
Experimental detail	2	Abstract omits architecture, $k$ schedule, $α$ , seeds, and baseline-tuning protocol

Key prior works referenced in this review. Hinton (2022, arXiv:2212.13345) on Forward-Forward; Makhzani & Frey (2014, arXiv:1312.5663) on k-sparse autoencoders; Martins & Astudillo (2016) on sparsemax; Peters, Niculae & Martins (2019) on entmax; Olshausen & Field (1996) on sparse coding; Bengio (2014) and Lee et al. (2015) on target propagation; Lillicrap et al. (2016) on feedback alignment; Scellier & Bengio (2017) on equilibrium propagation; Jang et al. (2017, arXiv:1611.01144) on Gumbel-softmax; Xiao, Rasul & Vollgraf (2017, arXiv:1708.07747) on Fashion-MNIST.