KAN

Kolmogorov-Arnold Representation Theorem (KART)

KART promises that any multivariate continuous function \(f(x_1, \ldots, x_N)\) can be represented as a finite composition of univariate functions and addition:

\[f(\boldsymbol{x}) = \sum_{q=1}^{2N+1} \Phi_q \left( \sum_{p=1}^{N} \phi_{q,p}(x_p) \right),\]

where \(\phi_{q,p} : [0,1] \rightarrow \mathbb{R}\) and \(\Phi_q : \mathbb{R} \rightarrow \mathbb{R}\) are continuous functions. Though KART provides theoretical guarantees of universal approximation, the inner and outer functions can be non-smooth or hard to learn in practice. We need to find a better way to approximate these functions with smooth and learnable ones.

Kolmogorov-Arnold Network (KAN)

Liu et al. introduced KANs as a practical realization of the KART, generalizing it to deep and wide architectures. Each variational activation function in KANs is modeled as a learnable function parameterized by B-splines, which are piecewise polynomial functions capable of approximating any continuous function with arbitrary precision.

Formally, a KAN layer maps the output of the \(\ell\)-th layer to the \((\ell+1)\)-th layer via:

\[x_{\ell+1,j} = \sum_{i=1}^{n_\ell} \phi_{\ell,j,i}(x_{\ell,i}),\]

where \(\phi_{\ell,j,i}\) is the learnable univariate variational activation function connecting input node \(i\) to output node \(j\). This can be expressed in matrix notation as:

\[\begin{split}\begin{align} \boldsymbol{x}_{l+1} & = \Phi_\ell(\boldsymbol{x}_\ell), \\ \Phi_\ell & = \begin{pmatrix} \phi_{\ell,1,1}(\cdot) & \phi_{\ell,1,2}(\cdot) & \cdots & \phi_{\ell,1,n_\ell}(\cdot) \\ \phi_{\ell,2,1}(\cdot) & \phi_{\ell,2,2}(\cdot) & \cdots & \phi_{\ell,2,n_\ell}(\cdot) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{\ell,n_{\ell+1},1}(\cdot) & \phi_{\ell,n_{\ell+1},2}(\cdot) & \cdots & \phi_{\ell,n_{\ell+1},n_\ell}(\cdot) \end{pmatrix}. \end{align}\end{split}\]

The KAN is a composition of \(L\) KAN layers: given input \(\boldsymbol{x}\), we’ll have KAN output as:

\[\text{KAN}(\boldsymbol{x}) = \Phi_{L-1}\circ\Phi_{L-2}\circ\cdots\circ\Phi_1\circ\Phi_0(\boldsymbol{x}).\]

By contrast, a multilayer perceptron (MLP) is given by linear layers \(W\) and nonlinear activation functions \(\sigma\) as:

\[\text{MLP}(\boldsymbol{x}) = \sigma(W_{L-1}\circ W_{L-2}\circ\cdots\circ W_1\circ W_0)(\boldsymbol{x}).\]

The following figure illustrates the KAN architecture, where each node (neuron scheme) in the network represents a variational activation function \(\phi_{\ell,j,i}\) that connects inputs to outputs across layers. KAN

Training KAN

Here is a simple example of how to train a KAN from QKAN package:

[1]:

import torch
from tqdm import tqdm

from qkan import KAN, create_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

f = lambda x: torch.sin(20 * x) / x / 20  # J_0(20x)
dataset = create_dataset(
    f, n_var=1, ranges=[0, 1], device=device, train_num=1000, test_num=1000, seed=0
)

model = KAN(
    [1, 1],
    grid_size=10,
    device=device,
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-1)
loss_fn = torch.nn.MSELoss()

[2]:

steps = 100
pbar = tqdm(range(steps), ncols=100)

model.train()
for _ in pbar:
    optimizer.zero_grad()
    pred = model.forward(dataset["train_input"])
    loss = loss_fn(pred, dataset["train_label"])
    loss.backward()
    optimizer.step()

    pbar.set_description(f"loss: {loss.item():.4f}")

model.eval()
with torch.no_grad():
    pred = model.forward(dataset["test_input"])
    loss = loss_fn(pred, dataset["test_label"])
    print(f"Test loss: {loss.item():.4f}")

loss: 0.0031: 100%|██████████████████████████████████████████████| 100/100 [00:00<00:00, 735.03it/s]

Test loss: 0.0032