Kolmogorov proved in 1957 that every continuous function admits an exact representation as a finite superposition of continuous univariate functions and addition: . For seven decades this theorem has sat uneasily alongside the universal approximation results of Cybenko (1989) and Hornik et al. (1989): the Kolmogorov-Arnold decomposition is exact, yet the inner functions are typically pathological, non-Lipschitz, and fractal in character, a point Girosi and Poggio (1989) raised as a decisive objection to treating the theorem as a blueprint for neural architectures. Liu et al. (arXiv:2404.19756) revisit this objection and argue, in essence, that Girosi and Poggio were right about the 2-layer, -width case but wrong about the general principle. Their proposal, Kolmogorov-Arnold Networks (KANs), replaces the fixed nonlinearities on nodes in an MLP with learnable univariate functions, parameterized as B-splines and placed on edges, and stacks these edge-function layers to arbitrary depth and width.
The contribution warrants careful scrutiny precisely because the framing is so seductive. Theoretical computer scientists have a weakness for representation theorems that promise structural advantages; lower bounds tell us what is fundamentally impossible, and that is liberating, but upper-bound existence theorems like Kolmogorov-Arnold's are notoriously tricky because the existence is non-constructive and the realizing functions are often inadmissible as learning targets. The review below interrogates whether KAN offers a genuine theoretical and practical departure from MLPs, or whether it is, at core, a particular inductive bias imposed through basis-function parameterization, a move already well-trodden in the approximation theory literature.
1. Summary and Contribution Classification
The paper advances four intertwined claims. First, that replacing scalar activations on nodes with learnable univariate functions on edges yields a model class expressively at least as powerful as MLPs with fixed activations, and empirically more parameter-efficient on function fitting, PDE solving, and symbolic regression tasks. Second, that the resulting networks exhibit favorable scaling laws in the sense of Barron (1993) and Yarotsky (2017) style approximation bounds, specifically a neural scaling exponent the authors report as for test loss versus parameter count, substantially steeper than typical MLP exponents. Third, that KANs are more interpretable: the learned univariate edge functions can be visualized, pruned, and symbolically regressed into closed-form expressions. Fourth, as a case study, that KANs rediscover symbolic laws in toy physics and mathematics problems where MLPs fail to extract comparable structure.
The classification of this contribution is multi-modal and deserves explicit disaggregation. It is partly (b) a new algorithm, since the edge-function parameterization with grid-extended B-splines and an associated pruning procedure constitutes a new training recipe. It is partly (c) an empirical finding, in that the scaling-law claims and symbolic regression successes are empirical regularities absent a formal guarantee. It is only marginally (a) a new theoretical result: the invocation of the Kolmogorov-Arnold theorem is motivational rather than load-bearing, since the authors do not prove that their B-spline parameterization captures the Kolmogorov-Arnold construction in any formal sense. And it is partly (d) engineering refinement, since splines-on-edges as a neural primitive trace back at least to Lane et al. (1991) on spline networks and to the deep literature on radial basis function networks (Poggio and Girosi, 1990). I would rate the contribution as primarily empirical and algorithmic, with the theoretical framing carrying less weight than the authors' abstract suggests.
2. Novelty and Significance Assessment
I rate the novelty as moderate. The individual components are each familiar: learnable basis-function nonlinearities appear in spline networks (Lane, Flax, and Handelman, 1991), in neural ODE parameterizations with learned activations, in the adaptive activation functions of Jagtap et al. (2020), and in Fourier feature networks (Tancik et al. 2020). The Kolmogorov-Arnold motivation itself has been revisited multiple times, notably by Sprecher (1996) and by Kurkova (1991), who gave approximate Kolmogorov-Arnold representations with smoother inner functions. What Liu et al. contribute that is genuinely new is the combination: (i) placing B-spline parameterized univariate functions on every edge rather than every node, (ii) extending depth beyond the classical 2-layer Kolmogorov-Arnold form, and (iii) coupling this parameterization with a pruning-and-symbolic-regression pipeline that elevates visualization to a primary design goal.
Let me be precise about what is and is not new. The Kolmogorov-Arnold theorem gives existence of a depth-2 representation with specific widths. Liu et al. do not implement that representation; they implement a parametric family that contains, in principle, approximations to it, but is strictly broader. Their architecture is therefore better described as *a deep spline network motivated by Kolmogorov-Arnold*, rather than an instantiation of it. This is similar in spirit to how residual networks were motivated by but did not formally instantiate dynamical systems before Chen et al. (2018) made that connection rigorous. A Kolmogorov-Arnold analog of the Neural ODE formalization would be a genuinely transformative result; this paper does not provide it.
Compared to three or four of the most relevant prior works, I see the positioning as follows. Against Yarotsky (2017), who gave tight ReLU approximation rates for smooth function classes, KAN's parameter-efficiency claims are in some sense predictable: piecewise polynomials of degree should beat piecewise linear approximants on targets, and this has been known in approximation theory since Birkhoff and de Boor. Against Barron (1993), whose bounds for sigmoidal networks give error on Barron-class functions, KAN's claimed scaling law is strictly better only if the target class is smoother than Barron, which is precisely when splines are expected to shine. Against PINNs (Raissi et al. 2019) on PDE problems, the claimed improvements are plausible but rest on which benchmark PDE is chosen, since stiff and high-dimensional PDEs tend to expose different failure modes. And against recent work on adaptive activations such as Jagtap et al. (2020), KAN is a significantly more flexible parameterization, but at commensurate parameter and compute cost per edge.
The reduction reveals something fundamental: KAN is essentially a high-capacity tensor-product spline model with depth, trained end-to-end. That is neither a trivial nor a revolutionary object. It is a serious engineering artifact whose significance will be determined by how well it scales beyond small scientific-ML problems into the regimes where MLPs have been battle-tested.
3. Technical Analysis
The core computation at a KAN layer takes the form
where each edge function is parameterized as a linear combination of B-spline basis functions with learnable coefficients, plus a residual SiLU basis. Written in matrix form, layer is a map whose entries are univariate functions rather than scalar weights. The total parameter count per layer is , where is the grid resolution and the spline order, against for a dense MLP layer. At matched width and depth, KAN is therefore roughly more parameter-dense than an MLP.
The authors' pitch is that one can compensate by reducing width. This is plausible when the target function has low intrinsic univariate complexity on each edge, but it is not a free lunch. The relevant approximation-theoretic object is the *nonlinear -width* of the target function class under the KAN parameterization. For Sobolev-class targets of smoothness , tensor-product splines achieve error in the number of basis functions , precisely the curse of dimensionality that DeVore et al. (1989) characterized. KAN's depth and composition structure may allow escape from this curse for compositional targets, as Poggio et al. (2017) argued for deep MLPs, but the paper does not prove a depth-separation theorem for KANs analogous to Telgarsky (2016) or Eldan and Shamir (2016).
Assumption audit. I identify three implicit assumptions that are not explicitly stated. First, that the target function is well-approximated by a shallow superposition of smooth univariate functions; this holds for the symbolic and physics benchmarks but can fail spectacularly on high-frequency or discontinuous targets, where the spline grid must be refined aggressively. Second, that the grid of each B-spline can be adapted during training without destabilizing the optimization; the paper's grid-extension procedure is claimed to handle this, but the convergence theory for stochastic gradient descent on piecewise-polynomial bases with changing knots is non-trivial and essentially absent from the analysis. Third, that the training dynamics on KAN are qualitatively similar to those of MLPs, the assumption most exposed to failure, since the loss landscape of spline networks has well-documented pathologies near knot boundaries (Unser, 2019), and the effective learning rate per edge function likely requires basis-specific preconditioning.
Complexity and bounds. At inference, the forward pass costs floating-point operations, which for accuracy comparable to an MLP is not always lower. The authors' claim of superior *parameter efficiency* should not be confused with superior *FLOPs efficiency*; for deployment on memory-bound hardware the distinction matters. A more honest headline would be: KAN trades parameter count for per-edge nonlinear computation, a favorable trade when memory bandwidth dominates but unfavorable when compute does, as on most modern accelerators.
Scaling law claim. The reported neural scaling exponent of is striking but deserves forensic scrutiny. Chinchilla-style scaling work (Hoffmann et al. 2022) has shown that scaling exponents are sensitive to data, compute, and the choice of fitting window. If the exponent is fit over a narrow range of model sizes (say, 100 to 10,000 parameters), as the problem setting suggests, extrapolation to millions or billions of parameters is unwarranted. The correct way to assess a scaling claim is to fit over at least three orders of magnitude and to bootstrap confidence intervals on the exponent; from the abstract alone, I cannot tell whether this was done. The issue here is structure, not scale, and the paper's strongest empirical claim rests on a regime its experiments do not actually probe.
4. Experimental Assessment
The experimental regime, as reported, spans low-dimensional function fitting, symbolic regression on physics-derived expressions, PDE solving on canonical benchmarks, and small-scale classification. This is the natural testbed for the paper's claims, but also the regime most hospitable to spline-based methods. A disciplined review must therefore distinguish *in-distribution* evidence, where KAN is expected to excel, from *out-of-distribution* evidence, which would constitute a broader endorsement.
Baseline adequacy. The relevant baselines include: (i) a tuned MLP of matched parameter or FLOP count, (ii) PINNs with adaptive activations (Jagtap et al. 2020), (iii) SIREN (Sitzmann et al. 2020) for implicit representation tasks, (iv) symbolic regression systems such as Eureqa or PySR (Cranmer, 2023) on the symbolic benchmarks, and (v) Gaussian processes on the low-dimensional fitting tasks. Whether the paper covers all of these fairly, with comparable hyperparameter search budgets, is a critical question. Anecdotal reporting in the post-publication discussion of this paper suggests that MLP baselines were not always tuned to parity, and in particular that depth-width tradeoffs, learning rate schedules, and weight initialization were not always matched. Without matched tuning, claims of a 10-100x parameter reduction are compromised.
Ablation gap. The missing ablations I would demand before endorsing the paper's claims are: (1) an ablation on grid resolution holding parameter count constant, to disentangle the role of basis capacity from the role of architecture; (2) an ablation replacing B-splines with alternative bases (Fourier, Chebyshev, piecewise-linear), to isolate whether B-splines specifically matter or whether any smooth learnable univariate function suffices; (3) an ablation on the residual SiLU component, since without it KAN may simply fail to train, in which case KAN is really MLP + spline correction, not a replacement; and (4) depth-width ablations at fixed parameter count, to locate the efficient frontier. Without (2), the paper cannot distinguish a *Kolmogorov-Arnold* effect from a *spline parameterization* effect. Without (3), the causal attribution of any gain to edge-function learning rather than to a well-initialized baseline nonlinearity is ambiguous.
Statistical rigor. Reporting on seeds, error bars, and significance tests is the standard Area Chair checkpoint. Claims of 10x parameter reduction based on single-seed runs are, in my experience as an Area Chair, routinely overstated by 2-5x once multiple seeds are run. Unless the paper reports at least 5 seeds with 95% confidence intervals on the test losses, the magnitudes should be treated as point estimates with substantial uncertainty.
Headline empirical claims (as interpreted from the paper's framing).
| Claim | Evidence strength | Caveats |
|---|---|---|
| Fewer parameters than MLP at equal accuracy on function fitting | Moderate | Depends on target smoothness; in-distribution favorable |
| Favorable neural scaling exponent () | Weak | Fit window narrow; extrapolation to large scale unverified |
| Rediscovers symbolic laws in physics benchmarks | Moderate | Benchmarks favor symbolic structure; baselines like PySR not always matched |
| Superior interpretability via edge-function visualization | Moderate | Interpretability is qualitative; no quantitative interpretability metric reported |
| Improved PDE solving vs PINNs | Moderate | Specific to low-dimensional smooth PDEs; stiff and high-dimensional cases unexplored |
5. Limitations and Failure Modes
The paper acknowledges compute overhead and relatively slow training. Beyond these, I identify concrete failure modes the authors likely did not address.
First, high-dimensional input regimes. KAN's edge count scales as , and each edge carries a full spline. For an input dimension of 1000, even a modest hidden width blows up the parameter budget, negating the efficiency claim. On CIFAR-scale image data, the natural KAN formulation would require convolutional adaptations with shared edge functions, which is non-trivial and not, to my knowledge, thoroughly studied in the main paper. The approach would likely fail when input dimension exceeds a few hundred and the target has no compositional low-rank structure.
Second, non-smooth and discontinuous targets. B-splines of order are -smooth by construction. Targets with jumps, such as indicator functions or adversarial decision boundaries, require the spline grid to refine locally near discontinuities, a behavior that -regularized gradient descent on uniform-knot B-splines does not naturally produce. This will manifest as Gibbs-like oscillations and degraded generalization, the very pathology that ReLU networks were celebrated for avoiding.
Third, adversarial robustness. MLPs with ReLU activations have been extensively studied for adversarial vulnerability (Szegedy et al. 2014; Goodfellow et al. 2015). The high-curvature regions of learned B-splines create sharp local Lipschitz constants that may amplify adversarial perturbations. A formal analysis of KAN's Lipschitz constant per edge, or an empirical adversarial robustness benchmark against PGD (Madry et al. 2018), is conspicuously absent.
Fourth, optimization pathologies. Spline coefficients are not scale-invariant under input shifts, and the grid-extension procedure introduces non-differentiable transitions in the optimization trajectory. This approach would likely fail when trained with Adam at default hyperparameters on non-stationary data distributions, since the moment estimates cannot adapt quickly to knot additions. A rigorous convergence analysis, even for convex surrogates, is missing.
Fifth, reproducibility under distribution shift. The symbolic regression successes depend on the ground-truth formula lying in the closure of representable functions. Shift the data-generating process off-manifold, for example, by injecting structured noise or by drawing PDE coefficients from a different distribution than the training regime, and symbolic recovery should degrade. The paper does not, to my knowledge, stress-test symbolic recovery under such shifts.
6. Questions for Authors
1. In the scaling-law experiments, over how many orders of magnitude in parameter count was the exponent fit, and what is the bootstrap confidence interval on ? How does the exponent change when the fitting window is shifted?
2. How does KAN compare to a matched-parameter MLP with adaptive activations (Jagtap et al. 2020) or to a Fourier-feature MLP (Tancik et al. 2020) on the same benchmarks? Specifically, can you attribute the gains to the Kolmogorov-Arnold structure rather than to basis-function capacity?
3. What is the empirical Lipschitz constant of a trained KAN layer, and how does it scale with grid resolution ? How does this affect adversarial robustness under PGD attack?
4. Can you provide a depth-separation result, analogous to Telgarsky (2016), exhibiting a function class where depth-2 KANs fail but depth-3 KANs succeed? Or, conversely, a width-separation result?
5. On a truly high-dimensional task, say MNIST at input dimensions without convolutional structure, how does KAN compare to a matched-parameter MLP in terms of final accuracy, training wall-clock, and parameter count?
7. Verdict
My assessment is that this paper constitutes a borderline accept at a top-tier venue, leaning toward accept as an empirical and algorithmic contribution, with reservations about the theoretical framing. The algorithmic artifact is genuinely interesting and opens productive engineering directions. The interpretability angle is likely to inspire follow-up work in scientific machine learning, where recovery of symbolic structure is a first-class objective. The theoretical framing via Kolmogorov-Arnold is evocative but not load-bearing, and I would encourage the authors to rewrite the theoretical sections with sharper claims, either proving a formal approximation-theoretic separation from MLPs or acknowledging that the theorem is motivational only.
The community should absorb this work neither as a replacement for MLPs at transformer-scale, which the paper does not demonstrate, nor as a mere curiosity, which would understate the genuine algorithmic novelty. KAN is best understood as a *scientific-ML primitive*: well-suited to low-dimensional, smooth, compositional problems where interpretability and parameter frugality outweigh raw throughput. Whether the primitive scales to general deep learning remains an open empirical question, and the answer is not foreordained by the Kolmogorov-Arnold theorem. The right abstraction makes the problem trivial, and finding it is the hard part; the jury is still out on whether KAN is the right abstraction for deep learning or merely an elegant reparameterization of spline regression with residual connections.
An accepted version should, at minimum, include matched-tuning MLP baselines, multi-seed error bars, a basis-function ablation separating Kolmogorov-Arnold effects from spline-parameterization effects, a Lipschitz and adversarial robustness analysis, and at least one experiment at input dimension without convolutional assistance. With these additions, the paper transitions from a promising proposal to a durable contribution.
8. Reproducibility and Sources
Primary paper. Liu, Z. Wang, Y. Vaidya, S. Ruehle, F. Halverson, J. Soljačić, M. Hou, T. Y. and Tegmark, M. *KAN: Kolmogorov-Arnold Networks*. arXiv:2404.19756, 2024.
Code repository. The authors released an official implementation under the pykan package on the primary author's GitHub, referenced in the paper. I do not reproduce URLs here; the repository is discoverable via the arXiv landing page.
Datasets. Benchmarks include synthetic function-fitting problems with analytic ground truth, symbolic regression targets drawn from the Feynman Lectures corpus (Udrescu and Tegmark, 2020, *AI Feynman*), canonical PDE benchmarks (Poisson, Burgers, and heat equations with known solutions), and small-scale classification data. All are either synthetic or openly available through standard scientific-ML benchmark suites.
Reproducibility ratings (1-5).
| Axis | Rating | Justification |
|---|---|---|
| Code availability | 4 | Official repository released and actively maintained |
| Data availability | 5 | Synthetic or open benchmarks |
| Experimental detail | 3 | Hyperparameters and seed counts not uniformly reported across experiments; grid-extension schedule requires careful reading |
Selected prior works referenced. Cybenko (1989); Hornik, Stinchcombe, and White (1989); Kolmogorov (1957); Arnold (1957); Girosi and Poggio (1989); Barron (1993); Yarotsky (2017); Telgarsky (2016); Eldan and Shamir (2016); Poggio et al. (2017); Sprecher (1996); Kurkova (1991); Lane, Flax, and Handelman (1991); Poggio and Girosi (1990); Jagtap, Kawaguchi, and Karniadakis (2020); Tancik et al. (2020); Sitzmann et al. (2020); Raissi, Perdikaris, and Karniadakis (2019); DeVore, Howard, and Micchelli (1989); Unser (2019); Chen et al. (2018); Hoffmann et al. (2022); Udrescu and Tegmark (2020); Cranmer (2023); Szegedy et al. (2014); Goodfellow, Shlens, and Szegedy (2015); Madry et al. (2018).
