Feed it handwritten digits and watch patterns emerge from noise.
Imagine a chain of voters. 784 pixels each cast a weighted vote to 16 "feature detectors" in the first hidden layer. Those 16 neurons vote again to another 16, which finally vote on 10 output digits. The network learns by adjusting how much each vote counts.
Each connection has a weight — a number that can be positive ("this pixel matters for this feature") or negative ("this pixel argues against it"). Training is the process of tweaking all ~13,000 weights until the votes produce the right answer.
784 → 16 → 16 → 10 — 8 nodes shown for clarity; the real input has 784 (one per pixel)
After a neuron adds up all its weighted inputs, it passes the sum through an activation function — a filter that decides how strongly the neuron fires.
is like a dimmer switch: it smoothly squashes any input into a value between 0 and 1. Every neuron always glows at least a little. The result is soft, blended weight patterns.
is like an on/off switch with a volume knob: negative inputs produce exactly zero (the neuron is "dead"), but positive inputs pass through at full strength. This creates sharper features and clearly dead neurons — dark tiles that never fire.
Because Sigmoid caps its gradients at 0.25 while ReLU passes them at full strength, the same learning rate hits very differently. LR=0.30 works well for Sigmoid but is catastrophic for ReLU — a single round of updates can push every neuron negative, killing the entire network with no way to recover. The app auto-adjusts the learning rate when you switch (0.30 for Sigmoid, 0.05 for ReLU).
How each function transforms the neuron's raw sum into its output
Before training starts, all weights are set to random numbers. How you pick those random numbers matters more than you'd think — it's like choosing the starting position of runners in a race.
initialization uses larger random values (wider spread). It's designed for ReLU, which kills half the signal — so you start louder to compensate.
(fan-in variant) uses smaller random values (narrower spread). It's designed for Sigmoid, which already squashes everything — starting too loud would saturate every neuron at 0 or 1.
picks values evenly across a range. No bell curve, no structure — a naive baseline.
Distribution of initial weight values — how spread out the starting point is
Activation and initialization are designed as a pair. The right combination keeps signals flowing at a healthy volume through the network. The wrong combination either kills the signal or blows it up.
Live weight distribution from the network above — updates as you train
Each small square in Hidden Layer 1 shows a neuron's 784 weights arranged as a 28×28 grid. Think of it as a "template" — the pattern this neuron is looking for in the input image.
Colors show weight polarity: excitatory ("I want bright pixels here") and inhibitory ("I want dark pixels here"). The glow around each neuron shows its activation strength.
Hover a Hidden 1 neuron to see what it's looking at — only positive contributions (pixel × weight) light up, so you see exactly which parts of the input activate that neuron. Hover a Hidden 2 neuron to see which Layer 1 neurons it relies on. (This shows the linear contribution before bias and activation are applied.)
The diverging color scale — weight magnitude and direction
Each time you feed a digit, three things happen: a forward pass (compute prediction), backpropagation (trace how each weight contributed to the error), and a gradient step (nudge each weight to reduce error). This one-sample-at-a-time process is called stochastic gradient descent (SGD) — "stochastic" because each update uses a single random sample rather than averaging over a batch.
The learning rate controls how big each nudge is. Too high and the network overshoots the sweet spot, bouncing around erratically. Too low and it barely moves. The default (0.30) is a reasonable middle ground for this architecture.
Schematic — effect of learning rate on convergence
Test Accuracy — 90 held-out samples (9 per digit) that are never used for training. Red badges on wrong samples show what the network guessed instead (e.g. →3). The sparklines track accuracy over training time. With only 9 test samples per digit, accuracy is coarse — a single sample swings it by ~11%. In practice, test sets contain thousands of samples.
Draw mode (D) — Sketch your own digit on the input canvas and watch the network classify it live. Press C to clear.
This visualization covers the fundamentals, but real-world neural networks use many techniques not shown here. Concepts to explore next:
For deeper dives: 3Blue1Brown's neural network series, Karpathy's micrograd, and fast.ai.