Excitatory Inhibitory
Input
-- --
Hidden 1 (16)
Hidden 2 (16)
Output (10)
Feed (0-9)
10/s
#0 Loss: --
Test Accuracy Total Accuracy --%

How It Works

The Network

Imagine a chain of voters. 784 pixels each cast a weighted vote to 16 "feature detectors" in the first hidden layer. Those 16 neurons vote again to another 16, which finally vote on 10 output digits. The network learns by adjusting how much each vote counts.

Each connection has a weight — a number that can be positive ("this pixel matters for this feature") or negative ("this pixel argues against it"). Training is the process of tweaking all ~13,000 weights until the votes produce the right answer.

784 → 16 → 16 → 10 — 8 nodes shown for clarity; the real input has 784 (one per pixel)

Activation Functions

After a neuron adds up all its weighted inputs, it passes the sum through an activation function — a filter that decides how strongly the neuron fires.

is like a dimmer switch: it smoothly squashes any input into a value between 0 and 1. Every neuron always glows at least a little. The result is soft, blended weight patterns.

is like an on/off switch with a volume knob: negative inputs produce exactly zero (the neuron is "dead"), but positive inputs pass through at full strength. This creates sharper features and clearly dead neurons — dark tiles that never fire.

Because Sigmoid caps its gradients at 0.25 while ReLU passes them at full strength, the same learning rate hits very differently. LR=0.30 works well for Sigmoid but is catastrophic for ReLU — a single round of updates can push every neuron negative, killing the entire network with no way to recover. The app auto-adjusts the learning rate when you switch (0.30 for Sigmoid, 0.05 for ReLU).

How each function transforms the neuron's raw sum into its output

Weight Initialization

Before training starts, all weights are set to random numbers. How you pick those random numbers matters more than you'd think — it's like choosing the starting position of runners in a race.

initialization uses larger random values (wider spread). It's designed for ReLU, which kills half the signal — so you start louder to compensate.

(fan-in variant) uses smaller random values (narrower spread). It's designed for Sigmoid, which already squashes everything — starting too loud would saturate every neuron at 0 or 1.

picks values evenly across a range. No bell curve, no structure — a naive baseline.

Distribution of initial weight values — how spread out the starting point is

Why Pairings Matter

Activation and initialization are designed as a pair. The right combination keeps signals flowing at a healthy volume through the network. The wrong combination either kills the signal or blows it up.

Designed pair — signals stay healthy, sharp features emerge
Too quiet to start — most neurons die immediately and never recover
Unpredictable — some neurons die, others work fine
Too loud — neurons start saturated near 0 or 1, learning is sluggish
Designed pair — smooth gradients, all neurons contribute
Crude but functional — works OK for simple problems

Live weight distribution from the network above — updates as you train

Reading the Heatmaps

Each small square in Hidden Layer 1 shows a neuron's 784 weights arranged as a 28×28 grid. Think of it as a "template" — the pattern this neuron is looking for in the input image.

Colors show weight polarity:  excitatory ("I want bright pixels here") and  inhibitory ("I want dark pixels here"). The glow around each neuron shows its activation strength.

Hover a Hidden 1 neuron to see what it's looking at — only positive contributions (pixel × weight) light up, so you see exactly which parts of the input activate that neuron. Hover a Hidden 2 neuron to see which Layer 1 neurons it relies on. (This shows the linear contribution before bias and activation are applied.)

The diverging color scale — weight magnitude and direction

Training & Learning Rate

Each time you feed a digit, three things happen: a forward pass (compute prediction), backpropagation (trace how each weight contributed to the error), and a gradient step (nudge each weight to reduce error). This one-sample-at-a-time process is called stochastic gradient descent (SGD) — "stochastic" because each update uses a single random sample rather than averaging over a batch.

The learning rate controls how big each nudge is. Too high and the network overshoots the sweet spot, bouncing around erratically. Too low and it barely moves. The default (0.30) is a reasonable middle ground for this architecture.

Schematic — effect of learning rate on convergence

Test Accuracy & Draw Mode

Test Accuracy — 90 held-out samples (9 per digit) that are never used for training. Red badges on wrong samples show what the network guessed instead (e.g. →3). The sparklines track accuracy over training time. With only 9 test samples per digit, accuracy is coarse — a single sample swings it by ~11%. In practice, test sets contain thousands of samples.

Draw mode (D) — Sketch your own digit on the input canvas and watch the network classify it live. Press C to clear.

Beyond This Demo

This visualization covers the fundamentals, but real-world neural networks use many techniques not shown here. Concepts to explore next:

  • Mini-batch SGD — averaging gradients over groups of samples (32, 64, 128) for smoother, faster training
  • Momentum & Adam — optimizers that accumulate velocity from past gradients, avoiding local minima and speeding convergence
  • Dropout & regularization — techniques to prevent overfitting by randomly disabling neurons or penalizing large weights during training
  • Convolutional networks (CNNs) — instead of connecting every pixel to every neuron, learn small reusable filters that detect edges, curves, and shapes — the standard for image tasks
  • Batch normalization — stabilizing activations between layers for faster, more reliable training
  • Deeper architectures — modern networks have hundreds or thousands of layers, not just two

For deeper dives: 3Blue1Brown's neural network series, Karpathy's micrograd, and fast.ai.