EvoNash — CWSF 2026 Project

CWSF 2026 · Selected for Nationals

Recognition & Project Resources

This project was judged at the Cariboo-Mainline Regional Science Fair on April 13, 2026 and was selected to advance to the Canada-Wide Science Fair (CWSF) 2026. The dashboard you are viewing — sf.defouw.ca — is the core scientific instrument and the entirety of the experimental platform: every simulation, every statistical test, every chart, and every conclusion lives here.

CWSF Digital Project Board

partner.projectboard.world →

A core CWSF requirement: my official project poster, abstract, and methodology hosted on Youth Science Canada's judging platform.

The Live Experiment

sf.defouw.ca

The complete experimental platform — simulation, database, worker pool, statistical analysis, and dashboard. You are reading it now.

Live · self-updating · driven by my data

This entire page is alive.

Every number, every chart, every percentage, and every conclusion you are about to read on this page — and across the rest of the EvoNash dashboard — is computed live from my own running PostgreSQL database the moment you open this page. Nothing on the page is hardcoded. Refresh the tab and the page rewrites itself using whatever the data says right now.

That includes the narrative, not just the numbers. The effect-size label ("negligible" vs. "medium" vs. "large"), the p-value verdict ("not significant" vs. "essentially zero"), the sign-test description ("evenly split" vs. "overwhelmingly toward adaptive"), the confidence-interval phrasing, even the final verdict at the end of section 18 — the page's own code chooses the wording based on what is actually true in the live data. If my data ever shifted to disagree with my hypothesis, this page would say so honestly, without me touching a single line of text.

That is unusual. Most science-fair projects — and even most published research write-ups — have static prose with a few numbers placeholder-swapped in. The author writes the conclusion before the data lands, and the conclusion stays put even if the data later shifts. Here the conclusions are derived from live aggregates of experiments and , with paired-seed analysis, sign tests, sensitivity trims, and effect-size calculations all recomputed on every page load. You are reading a self-updating scientific document about an experiment that is still running, not a snapshot frozen at submission time.

Numbers

p-values, Cohen's d, win rate, mean per-pair difference, 95% CIs, sensitivity table — all recomputed live on every load.

Narrative

Effect-size labels, significance verdicts, direction-of-effect framing — auto-picked to match what the data actually shows.

Conclusion

The final verdict box reflects whatever the data says — even if it ever disagreed with the hypothesis, the page would say so.

I designed it this way because real science is honest about its data — and the only way to keep a dashboard honest is to let the data write the page.

About This Project: The Science Fair Experiment at a Glance

EvoNash is a science fair project that asks a simple question: if we let digital "organisms" with tiny artificial brains evolve in a mini world, does it help to change their "genes" more when they are doing poorly and less when they are doing well? Or is it better to always change them by the same amount, like flipping a coin the same way every time? This project builds a real experiment—a computer platform that runs on a powerful graphics card^[27]—to answer that question. Below we explain every part of the experiment in simple terms.

Abstract (What We Did in One Paragraph)

This experiment tests whether adaptive mutation^{[15, 17]}—changing an organism's "genes" (the numbers inside its brain) more when the parent did poorly and less when the parent did well—helps a population of 1,000 digital organisms reach a stable outcome (called a Nash equilibrium^{[1, 2]}) faster than a control group that always uses the same amount of random change (static mutation). We put the organisms in a simple 2D world (a "petri dish") where they can move, eat food, and shoot at each other to steal energy. Their brains are small neural networks^{[6, 9]} that we do not program; we only evolve them by keeping the best performers and randomly mutating their weights^{[10, 11]}. We run two groups side by side, measure how many generations it takes each group to "settle down," and use statistics^{[21, 22]} to see if the adaptive group really got there faster. The results tell us whether this kind of smart mutation could help future AI and evolutionary algorithms.

The Problem (Why We Did This)

In many real-world and computer experiments, we use evolution^{[10, 11]} to improve things: we keep the best performers, copy them with small random changes (mutations), and repeat. But how much should we change them? If we change too much every time, good solutions get destroyed and we search almost at random^[15]. If we change too little, we might get stuck in a "local" good outcome and never find a better one. Most classic methods use a fixed amount of mutation—the same for everyone, every time^[17]. This project asks: what if we adapt the amount of mutation to how well the parent did? Struggling organisms get more random changes (a chance to try something new); successful ones get fewer changes (we keep what works). We wanted to test whether that idea actually speeds up how fast a population finds a stable, balanced outcome (a Nash equilibrium^[1]) in a simple but real experiment.

The Hypothesis (What We Think Will Happen)

Our hypothesis is: if we use adaptive mutation^{[15, 16, 17]}—where the amount of random change is inversely proportional to the parent's fitness (so low-performing parents produce more heavily mutated offspring, and high-performing parents produce less mutated offspring)—then the population will reach a Nash equilibrium^{[1, 2]} (a stable mix of strategies where no one benefits by changing alone) in fewer generations than a control group that uses a fixed mutation rate. In other words, we predict that "smarter" mutation will help the population settle down faster.

Methodology (How We Run the Experiment)

We run two groups of experiments. Everything is the same in both groups except one thing: how much we mutate the offspring. In the control group, we always add the same small random amount of change to the brain weights (static mutation). In the experimental group, we add more change when the parent had a low fitness score and less change when the parent had a high fitness score (adaptive mutation). For each group we:

Start with 1,000 random neural-network "brains" in the same petri dish world.
Let them live for many "ticks" (moments)—moving, eating food, and sometimes shooting each other—and track who has the most energy.
At the end of each generation, we pick the top 20% by fitness score, copy their brains to create offspring, and mutate those copies (more or less depending on the group).
We repeat for many generations until the population's behavior stabilizes—meaning the mix of strategies stops changing much (we call that reaching Nash equilibrium).
We record when that happened (which generation) and how well the population did (peak fitness).

Then we compare the two groups using statistics: did the adaptive-mutation group reach Nash equilibrium in fewer generations? If yes, that supports our hypothesis.

Why We Use the Same Seed (Fair Start)

To be fair, we start the control and experimental groups with the same random seed. A seed is like a recipe that decides the starting conditions. Using the same seed means both groups begin with the same kind of brains and the same world setup. The only thing that is different is the mutation strategy. That way, if one group reaches a stable result faster, we know it happened because of the mutation strategy—not because it got a luckier start.

We don't rely on just one seed. We repeat the experiment with many different seeds. This gives us enough data to be confident that the result isn't just a coincidence.

Variables (What We Change, What We Measure, What We Keep the Same)

In any good experiment we control what we change and what we measure. Here:

What we change (independent variable): The mutation strategy—fixed (control) vs adaptive (experimental).
What we measure (dependent variables): (1) How many generations it took to reach Nash equilibrium (our main outcome), and (2) how high the population's rating got (peak fitness).
What we keep the same (constants): Population size (1,000), the rules of the petri dish (physics, food, shooting), how we select parents (top 20%), and the shape of the neural network (24 inputs, 64 hidden neurons, 4 outputs). Keeping these the same lets us fairly compare the two mutation strategies.

Why This Matters

This project combines evolution^{[10, 11]} (trial and error over generations), game theory^{[3, 4]} (Nash equilibrium^{[1, 2]}—when no one benefits by changing strategy alone), and neural networks^{[6, 9]} (small artificial brains). Understanding whether adaptive mutation^{[15, 17, 18]} speeds up convergence can help future AI and evolutionary algorithms—for example, in robotics, multi-agent systems^[25], or automated design. The rest of this page explains each part of the experiment in more detail, so you can understand exactly what we did and why.

1What is a neural network?

A neural network^{[6, 7]} is a computer model inspired by how brain cells work. It is made of many simple "cells" (called neurons) that receive numbers, do simple math, and send numbers to other cells. No one programs the network step-by-step to solve the problem. Instead, we give it a structure—layers and connections—and then we change the strength of those connections (the "weights") through learning^[8] or, in our case, evolution^[12].

Think of it like a recipe where we only adjust the amounts of ingredients, not the steps. In this experiment, each organism's "brain" is one small neural network. It is nothing like a human brain in size or complexity, but the same basic idea: numbers go in, math happens, and numbers come out that become actions.

2How do neural networks work?

The inputs are numbers that represent what the organism "knows"—for example, how far away the nearest food pellet is, or how close the nearest enemy organism is. Since the world wraps around (toroidal geometry), there are no walls—just open space in every direction. These numbers flow through layers. The first layer takes the inputs and multiplies them by learned "weights," adds "biases," and then applies a simple rule (like "if the result is negative, treat it as zero") so the network can learn patterns that are not straight lines. The result becomes the input to the next layer, and so on.

The output layer produces the final numbers. In our experiment, that is four numbers that control thrust, turn, shoot, and split (see Key terms below). The only thing that changes during evolution is the weights and biases; the layout of the network stays the same. If the first weight is big, that input has a big effect on the next layer; if it is small, it has a small effect.

3Key terms

The following terms are used throughout this overview. They are defined here so you can refer back anytime.

Raycasts

"Raycast" is not a common word—it comes from computer graphics. In this experiment, raycasts are virtual beams or sensors. The organism sends out 8 "beams" in different directions (like headlights or radar). Each beam reports how far the nearest food pellet or other organism is (and sometimes the size of the other organism). So the organism does not "see" pictures; it gets 24 numbers (8 directions × 3 types of data). Since the world wraps around, there are no walls to hit—the raycasts go on forever until they find something or reach maximum range.

The four actions

Thrust — The strength of "move forward," from 0 (don't move) to 1 (full power). The organism accelerates in the direction it is facing.
Turn — Rotate left or right, from -1 to 1. It changes which direction the organism is facing.
Shoot — Fire a projectile. If it hits another organism, the shooter steals some of their energy (that is predation). There is a cooldown: after shooting, the organism must wait a short time before it can shoot again.
Split — Another action with a cooldown; it can be used for reproduction or other abilities in the simulation.

Moving

Moving in the experiment is the result of thrust and the simulation's physics. Each moment (each "tick"), the organism's velocity is updated by its thrust, and its position is updated by its velocity. So moving = thrust + physics; the organism does not teleport.

Fitness Score (in depth)

Fitness Score is a number that measures how well an organism performed during its lifetime in the petri dish. In our experiment, each organism's fitness is calculated with a simple formula:

Fitness = Ticks Survived + Remaining Energy

Ticks survived is how many moments (out of 750 per generation) the organism stayed alive. An organism that survives the entire generation gets 750 points for survival alone. Remaining energy is how much energy the organism still has at the end of the generation. Organisms that collected food efficiently and avoided unnecessary energy loss will have more energy left over.

This means an organism that survives the full generation and ends with lots of energy will have the highest fitness score. For example, an organism that survived all 750 ticks and ended with 150 energy would score 750 + 150 = 900. One that died at tick 300 with 0 energy scores just 300.

We use fitness scores to select parents (top 20% by score get to reproduce), to set the mutation rate in the experimental group (low fitness = more mutation, high fitness = less), and to measure how well the population did (peak fitness = highest score anyone reached). So fitness score is the single number that drives evolution and our statistical analysis.

Other terms

Tick — One moment or step in the simulation (like one frame in a video). Generation — One full round of life, then selection, breeding, and mutation. Cooldown — A wait time before an action (e.g. shoot) can be used again. Metabolism — The organism losing a little energy every tick (like burning calories). Foraging — Getting energy by eating food pellets. Predation — Getting energy by shooting another organism and stealing their energy. Policy entropy — A number that measures how "mixed" or "certain" one organism's decisions are (averaged over the population we get mean policy entropy). Entropy variance — How much organisms differ from each other in that "mixed vs certain" measure; when it is low and stable, the population has settled on a similar mix of strategies (we use this to detect Nash equilibrium). Convergence — The population settling into a stable mix of strategies (Nash equilibrium). Fitness Score — How well an organism did; our primary performance measure. Weights — The numbers inside the neural network that get evolved. Mutation — Randomly changing those weights a little when creating offspring.

4How does this experiment implement neural networks?

In this experiment, each organism is controlled by a specific neural network architecture often described as 24-64-4. This code describes the shape of its "brain" and how it processes information. You can think of it like a team of workers passing messages down a line.

1. The Inputs: 24 "Eyes" (Sensors)

The network starts with 24 inputs. Imagine the organism has 8 eyes looking in 8 different directions (forward, backward, left, right, and diagonals). Each eye measures exactly 3 things:

How far is the nearest Food pellet in this direction?
How far is the nearest Enemy organism in this direction?
How far is the nearest boundary wrap point? (Since the world is toroidal—it wraps around like the surface of a donut—there are no walls. This value tells the organism how far it is from the wrap-around edge, which affects how distances to food and enemies are measured.)

With 8 directions × 3 measurements each, that gives us the 24 inputs. These numbers are the only thing the organism "sees." Because the world wraps around, organisms cannot hide in corners or against walls—the environment is completely open in every direction.

2. The Hidden Layer: 64 "Thinkers"

The information then travels to a large hidden layer containing64 neurons.

Instead of splitting duties between small groups, this large group of 64 neurons works together to process the raw input data. They identify patterns (like "food is close") and calculate the best strategy simultaneously. Having more neurons in a single layer allows the network to capture complex relationships between the inputs directly.

3. The Outputs: 4 Actions

Finally, the network produces 4 outputs, which are the instructions sent to the organism's body:

Thrust: How hard to push forward (0 to 100%).
Turn: Which way to steer (-1 for left, +1 for right).
Shoot: Whether to fire a projectile (if this number is high enough, it shoots).
Split: Whether to reproduce/split (if high enough).

This entire process happens instantly, dozens of times per second, allowing the organism to react to its world in real-time.

We run 1,000 organisms at once. To do this quickly we use a graphics card (GPU) to run all 1,000 brains in parallel—like having 1,000 calculators working at the same time. The software runs on the GPU so we can simulate many generations in a reasonable time.

5Why did we choose a petri dish?

The petri dish is our controlled mini-world for the experiment. Think of a real petri dish in biology: a simple, closed environment where we can watch life (here, digital organisms) under fixed rules. That helps science because we can repeat the experiment exactly—same rules, same starting conditions—and change only one thing: how much we mutate the "genes" (weights) of the neural networks. So we can fairly compare two strategies.

The world is 2D (flat, like a tabletop) and continuous (organisms can be anywhere, not just on a grid), with wrap-around borders (toroidal geometry): going off one edge brings you back on the other side. This means there are no walls or corners to hide in—organisms must survive in the open. The physics are simple (movement and collisions) so the computer can simulate thousands of organisms without extra complexity. The petri dish is our lab bench—simple, repeatable, and designed so we can learn about evolution and mutation, not about the environment.

6What are the organisms?

The organisms (also called agents) are digital creatures represented as circles moving in the 2D petri dish. Each has energy—like health or fuel. They lose a little energy every moment (metabolism, like burning calories just to stay alive) and gain energy in two ways: by eating food (static pellets that give a set amount of energy) or by predation (shooting a projectile at another organism to steal some of their energy).

Foraging is safer but can be slow; predation is riskier but can yield big gains. They have no hands or eyes; their only "senses" are the numbers from the raycasts and their own state, and their only "actions" are the four outputs (thrust, turn, shoot, split). No one programmed them to "go toward food" or "avoid enemies"—they only have a brain (neural network) that turns what they sense into actions. Over time, organisms that keep their energy high survive and reproduce; others die out.

7How do their neural network brains work and what are their motivations?

Each moment (tick), every organism gets a list of 24 numbers: from 8 directions, how far to the nearest food and enemy (and sometimes enemy size), plus a few numbers about itself (energy level, speed, whether it is on cooldown for shooting or splitting). That list is the input to its neural network. The network outputs 4 numbers that control thrust, turn, shoot, and split.

No one programmed the organisms to "go toward food" or "avoid enemies." The network just has weights that get evolved; any "strategy" we see (foraging, fleeing, attacking) emerges from which organisms had more offspring. So their motivation is not written in code; it is implicit: organisms that by chance behave in ways that keep energy high get to reproduce, so over many generations the population tends to act in ways that help survival. We measure their success with a fitness score (see Key terms)—higher fitness means they tend to "win" more often in our pairwise comparisons. Think of it like nature selecting the best survivors.

8Why do they act the way they do?

We do not tell the organisms how to behave. We only select the best performers (top 20% by fitness score), copy their neural network weights to create offspring, and randomly change (mutate) those weights a little. So "why they act the way they do" is: their brains were shaped by many generations of trial and error.

Organisms that happened to have weights that led to good survival and reproduction left more copies; bad strategies died out. It is like breeding dogs for speed—we did not design the legs; we just kept the fastest and over time they got faster. At the start, behavior is almost random; after many generations we often see recognizable strategies (some organisms forage, some attack) because those strategies won in the petri dish. They act the way they do because evolution favored those behaviors in this environment.

9How do we conduct the experiment?

We run two groups of experiments, identical in every way except how much we mutate the offspring.

Control group — We use a fixed mutation amount (we always change the weights by the same small random amount), like flipping a coin the same way every time.
Experimental group — We use adaptive mutation: we change the weights more when the parent did poorly and less when the parent did well. Struggling organisms get more random changes (more chance to try something new); successful ones get fewer changes (we keep what works).

For each group we start with 1,000 random neural networks, run the petri dish for many generations (each generation = one round of life, selection, breeding, and mutation), and we stop when the population's behavior stabilizes—meaning the mix of strategies stops changing much. We call that approaching a Nash equilibrium (see next section). We record when that happened (which generation) and how well the population did (peak rating). Then we compare the two groups: did the adaptive-mutation group reach stability faster? We use statistics to check if the difference is real or just luck.

10What is game theory? What is Nash equilibrium? Why is it the key metric?

Game theory^[3] is the study of situations where multiple decision-makers (players) choose actions, and each person's outcome depends not only on their own choice but on what others do. Think of two people dividing a pizza: if you ask for more, the other might take less; your best choice depends on what you think they will do. In our experiment, the "players" are the organisms. Each one chooses how to behave (forage, attack, flee) based on its neural network, and its success (energy, survival, reproduction) depends on what the other 999 are doing. So the petri dish is a "game" in the game-theory sense^[25].

Nash equilibrium^{[1, 2]} is a situation where no one can improve their outcome by changing their strategy alone, given what everyone else is doing. It is named after the mathematician John Nash. At a Nash equilibrium, if you are the only one who switches from "forage" to "attack," you don't do better—so no one has a reason to switch. It describes a stable outcome: everyone is doing the best they can given what others do.

In our experiment, each organism has a strategy (the way its brain turns inputs into actions). The population has a mix of strategies^{[4, 5]}. We say the population has reached a Nash-like equilibrium when the mix of strategies stops changing from generation to generation: the kinds of behavior have settled into a stable balance. At that point, no organism would do better by behaving differently, given how the rest of the population is behaving. We detect this by watching entropy variance^[28]—how much the organisms differ from each other in how "mixed" or "certain" their decisions are. When everyone is behaving similarly, that difference drops and stays low; when it stays low for many generations in a row, we treat that as having reached Nash equilibrium.

Why Nash equilibrium is the key metric: Our hypothesis is that adaptive mutation helps the population reach Nash equilibrium faster than fixed mutation. So the key metric is how many generations it takes to reach Nash equilibrium—that is our primary outcome. If the adaptive-mutation group reaches Nash equilibrium in fewer generations than the control group, that supports the hypothesis. Nash equilibrium is not just a fancy name for "they settled down"—it is the specific, stable outcome from game theory that we use to define "settled," and the generation at which we reach it is the main number we use to test our hypothesis.

11How we detect Nash equilibrium (technical)

Detection criterion. Nash equilibrium^{[1, 2]} is detected using entropy variance^[28] across the population, not mean policy entropy. For each generation we compute a scalar policy entropy per agent (expected entropy of the action distribution over a fixed set of sample inputs). The entropy variance is the variance of those per-agent entropies across the population.

Why variance rather than mean entropy. Mean policy entropy indicates how mixed or deterministic the average policy is, but it does not measure population-level homogeneity. At equilibrium we require that the strategy mix has stabilized^[4]—i.e., that agents no longer differ substantially in behavior. That corresponds to low variance of policy entropy across agents: when all agents have similar entropies, the population has converged to a homogeneous strategy mix. We therefore define convergence as the generation at which entropy variance falls below a threshold and remains below it for a fixed stability window (after an initial phase of divergence), with a post-convergence buffer to confirm stability.

12Why do we need GPU workers?

We have 1,000 organisms, each with a neural network that does many multiplications every moment, and we run hundreds of generations. Doing that on an ordinary computer (CPU) would take a very long time—hours or days. A GPU (graphics card) is built to do thousands of simple math operations at once (originally for drawing graphics). We use it to run all 1,000 brains in parallel—like having 1,000 people each do one multiplication at the same time instead of one person doing 1,000.

Workers are the computers that have the GPU and actually run the simulation. The website you see is the "controller"; it sends the experiment settings to a worker, the worker runs the petri dish and evolution on its GPU, and sends the results back. So we need GPU workers to finish the experiment in a reasonable time and to separate the heavy computation (worker) from the interface and storage (web app). Think of the worker as a lab technician who runs the experiment and mails back the data.

13What are we measuring?

The primary metric for proving our hypothesis is how many generations it takes to reach Nash equilibrium (convergence velocity). The other metrics (peak fitness, policy entropy) support the analysis, but convergence to Nash is the key outcome we compare between the two groups.

Important detail: Fitness score tells us how well organisms did, whileentropy variance tells us how similar their decision-making styles are. Two groups could have similar fitness but still behave differently. That is why we track both performance and behavior.

Convergence velocity ("when did they reach Nash equilibrium?") — We record the generation number at which the population's behavior becomes stable: the variety of strategies (who forages, who attacks) stops changing much from generation to generation. We check this using entropy variance—how much the organisms differ from each other in how mixed or certain their decisions are. When that difference is small and stays small for many generations, everyone is behaving similarly and we say we have reached a Nash-like equilibrium. So convergence velocity = how many generations it took to get there. Faster convergence = fewer generations.
Peak fitness ("how good did they get?") — We record the highest fitness score that any organism (or the population) reached (see Key terms for how we calculate it). This tells us how well the evolved strategies performed in the petri dish.
Policy entropy ("how predictable are one organism's decisions?") — This number tells us whether an organism is still experimenting (high entropy) or has settled on a stable style (low entropy). We look at the variance of that number across all organisms to detect equilibrium: when the variance is low, everyone is similar; when it stays low for many generations, we have reached Nash equilibrium.

We are measuring how fast the population stabilizes and how well it does, and we compare these between the control and experimental groups.

14Policy entropy vs. entropy variance — what's the difference?

Two of the most important numbers in this experiment sound similar — policy entropy and entropy variance — but they measure very different things. Understanding the distinction is critical to understanding how we detect Nash equilibrium.

Policy entropy: "How random is ONE organism's behavior?"

Imagine an organism standing in the petri dish. It has four possible actions: move forward, turn left, turn right, or boost. If its neural network says "82% forward, 7% left, 5% right, 6% boost", that organism has a clear preference — it almost always goes forward. Its policy entropy is low.

On the other hand, if the neural network says "25% forward, 25% left, 25% right, 25% boost", the organism is essentially flipping a coin every time. It has no strategy. Its policy entropy is high.

Mathematically, for a single organism with action probabilities p₁, p₂, p₃, p₄:

H = −(p₁·log(p₁) + p₂·log(p₂) + p₃·log(p₃) + p₄·log(p₄))

The number we report on the dashboard as "policy entropy" is the average of this value across 200 sampled organisms. It tells us: on average, how decisive is the typical organism?

Entropy variance: "How DIFFERENT are the organisms FROM EACH OTHER?"

Policy entropy tells us about the average individual. Entropy variance tells us whether all the individuals agree. We take the entropy of each sampled organism and compute the variance (statistical spread) across the whole group:

Variance = (1/n) × Σ(Hᵢ − H̄)²

where Hᵢ is each organism's entropy and H̄ is the average entropy

Low variance means all organisms have similar entropy values. High variance means some are decisive while others are still confused.

Why variance matters more than entropy for detecting Nash equilibrium

Here is the key insight: you can have low average entropy but high variance, and it is NOT Nash equilibrium. Three scenarios illustrate this:

✓ Scenario A: Low entropy, LOW variance — Nash equilibrium

Organism 1: H = 0.30 (always goes forward)

Organism 2: H = 0.31 (always goes forward)

Organism 3: H = 0.29 (always goes forward)

Average entropy: 0.30 (low) • Variance: 0.0001 (low) • Everyone converged on the same strategy.

✗ Scenario B: Low entropy, HIGH variance — NOT Nash equilibrium

Organism 1: H = 0.10 (always attacks)

Organism 2: H = 1.20 (still exploring randomly)

Organism 3: H = 0.05 (always runs away)

Average entropy: 0.45 (moderate) • Variance: 0.42 (high) • Different strategies, no consensus — population is still in flux.

⟳ Scenario C: High entropy, LOW variance — Starting conditions (generation 0)

Organism 1: H = 1.35 (random)

Organism 2: H = 1.38 (random)

Organism 3: H = 1.34 (random)

Average entropy: 1.36 (high) • Variance: 0.0003 (low) • Everyone is equally confused — this is where all experiments start.

This is exactly why our convergence detection algorithm requires the population to diverge first (variance goes up as organisms develop different strategies) and then re-converge (variance drops back down as they settle on a shared equilibrium). Only when variance stays low for 20 consecutive generations after having been high do we declare Nash equilibrium.

In one sentence: Policy entropy tells you what the average organism is doing. Entropy variance tells you whether they're all doing the same thing — and that consensus is what defines Nash equilibrium.

15How do we measure and report results?

The previous sections explain what we measure (convergence velocity, peak fitness, policy entropy). This section explains how the software actually captures those numbers from the organisms—step by step, in plain language.

Step 1: Give every organism the same "pop quiz"

We create a set of fake scenarios—like "there's food to your left and an enemy ahead." These are just made-up sensor readings (24 numbers). Every organism gets the exact same quiz so we can compare their answers fairly.

Step 2: See what each organism would do

We feed those fake scenarios into each organism's neural network brain. The brain spits out 4 numbers—one for each action (thrust, turn, shoot, split). For example:

Organism A: [2.5, 0.1, 0.3, 0.1] ← strongly wants to move forward

Organism B: [0.5, 0.4, 0.6, 0.5] ← can't really decide

Step 3: Convert to percentages

We turn those raw numbers into percentages that add up to 100% (using a math function called softmax). Now the answers are easy to read:

Organism A: [82%, 7%, 5%, 6%] ← "I'm going forward, no question"

Organism B: [25%, 23%, 28%, 24%] ← "Uhhh… maybe shoot? I dunno"

Step 4: Measure how indecisive each one is (policy entropy)

We plug those percentages into the entropy formula. You don't need to know the math—just know that it produces a single number:

Organism A → low entropy (~0.3) — it knows exactly what it wants. Think of a student who circles "A" without hesitation.
Organism B → high entropy (~1.4) — it's basically guessing. Think of a student who stares at the test and randomly picks an answer.

Step 5: Check if everyone agrees (entropy variance)

Now we have an "indecisiveness score" for each of the 200 sampled organisms. We ask: "Are these scores all similar, or are they all over the place?"

If the scores are [0.3, 0.31, 0.29, 0.3, 0.28…] → low variance → everyone behaves the same way → the population has settled on a shared strategy.
If the scores are [0.3, 1.2, 0.8, 0.1, 1.4…] → high variance → organisms are doing their own thing → still evolving, haven't reached agreement yet.

Think of it like a classroom: low variance means everyone got roughly the same grade on the test (they all "agree"). High variance means grades are scattered from 20% to 95%—everyone is different.

Step 6: Detect Nash equilibrium

We repeat Steps 1–5 every single generation and track the entropy variance over time. When it drops below a small threshold (0.01) and stays there for 20 generations in a row—after the population has already spread out and tried different strategies—we declare that the population has reached Nash equilibrium. We record which generation that happened at. That generation number is the key result we compare between the control and experimental groups.

Step 7: Compare the two groups with statistics

After running many experiments in both the control and experimental groups, we have a list of convergence generation numbers for each group—for example, the control group might converge at generations [230, 241, 225, 238, …] and the experimental group at [210, 195, 208, 193, …]. We use a standard statistical test (Welch's t-test) to answer one question: "Is the difference between these two lists real, or could it be just luck?"

The p-value tells us the probability that the difference happened by random chance. A very small p-value (like 0.001) means there's only a 0.1% chance the result is a fluke—so we're confident the difference is real.
The effect size (Cohen's d) tells us how big the difference is. A small p-value says "the difference is real"; the effect size says "and the difference is large" (or small, or medium).

The dashboard displays all of these results automatically as experiments finish—the convergence generation for each experiment, the statistical comparison between groups, and the conclusion of whether the hypothesis is supported.

16Understanding the statistical tests

After running hundreds of experiments in both the control and experimental groups, we need to answer two questions: "Is the difference real?" and"How big is the difference?" We use two standard scientific tools to answer them.

Welch's t-test: "Is the difference real, or just luck?"

Imagine you have two coins. You flip each one 100 times. Coin A lands on heads 53 times and Coin B lands on heads 58 times. Is Coin B actually weighted differently, or did it just get lucky? That's exactly what a t-test answers.

In our experiment, the "coins" are the two groups (control vs. experimental), and the "number of heads" is the generation number where each experiment converged. We collect all the convergence generations from each group and ask: is the average different enough that it probably isn't a coincidence?

The t-test gives us a p-value. Think of the p-value as a "suspicion meter":

p = 0.50 — 50% chance the difference is just random noise. Not suspicious at all. We'd say "there's no real difference."
p = 0.05 — Only a 5% chance it's luck. That's the standard cutoff in science. Below this, we call it statistically significant.
p = 0.001 — Only a 0.1% chance. Very strong evidence the difference is real.
p = — the p-value from our actual experiment (computed live from the database). In plain language: .

We use Welch's version of the t-test (rather than the basic Student's t-test) because it doesn't assume the two groups have the same amount of spread in their data. Since our control and experimental groups might vary differently, Welch's is the safer, more accurate choice.

Cohen's d: "Okay, the difference is real—but how big is it?"

A p-value tells you whether a difference exists, but not how much it matters. With enough data, even a tiny, meaningless difference can be "statistically significant." That's where Cohen's d (the effect size) comes in.

Think of it this way: imagine measuring the height of students in two classrooms. Cohen's d answers: "If I pulled a random student from each classroom, how obvious would the height difference be?"

d < 0.2 (Negligible) — You wouldn't notice the difference. Like comparing heights of two random groups of the same age.
d = 0.2–0.5 (Small) — You'd notice if you looked carefully. Like comparing the heights of 13-year-olds to 14-year-olds.
d = 0.5–0.8 (Medium) — Clearly noticeable. Like comparing the heights of 12-year-olds to 15-year-olds.
d > 0.8 (Large) — Obvious to anyone. Like comparing the heights of 10-year-olds to adults.

Our experiment's Cohen's d is (computed live from the database), which puts the effect .

Putting it together

17Seed pairs: matched experiments and data integrity

Earlier (in the "Why We Use the Same Seed" box near the top of this page) we mentioned that we start the control and experimental groups from the same random seed so that the comparison is fair. That idea is so central to this project that it deserves its own section. The dashboard's Seed Pair Progress card and Paired Seed Comparison scatter plot are not just nice visualizations — they are how this experiment turns a difference of opinion ("I think adaptive mutation is faster") into a result that the scientific method can actually test.

What is a seed pair?

A seed is just a number that locks in every random choice the simulation makes — the starting positions of the 1,000 organisms, the initial weights inside their neural-network brains, where food pellets spawn, and so on. Two experiments started with the same seed face mathematically identical worlds.

A seed pair is a matched set of runs at the same seed: one Control run (fixed mutation) and one Experimental run (adaptive mutation). Because both runs faced the same starting brains, the same food layout, and the same random events, the only thing that can differ between their outcomes is the one variable we deliberately changed: the mutation strategy.

Why seed pairs prove data integrity

Imagine running two experiments on different days, with different starting brains and different food layouts. The Experimental run converges at generation 200 and the Control at 230. Was the difference caused by adaptive mutation — or did the Experimental run just get lucky with where the food spawned? You cannot tell. That is called a confounded result, and it is what bad science looks like.

Seed pairs eliminate that confound completely. Both runs in a pair start in the same world. If the Experimental run still finishes first, the only thing that could have caused the difference is the mutation strategy. This is what makes our data trustworthy: every comparison is apples-to-apples by construction, not by luck.

This is the textbook definition of a controlled experiment: hold everything constant except the one variable you are testing. The seed is what lets us do that with computational precision — far more cleanly than is usually possible in biology, psychology, or medicine.

Why pairs make the hypothesis testable by the scientific method

The scientific method demands a falsifiable hypothesis — one that could be shown wrong if the data went the other way. Our hypothesis is specific: at the same seed, adaptive mutation will reach Nash equilibrium in fewer generations than fixed mutation. Because each seed pair is a head-to-head comparison under identical conditions, every single pair is a tiny, complete test of that hypothesis. If adaptive mutation had no real advantage, we would expect the Experimental run to win on roughly half the pairs and lose on the other half (random tie). The fact that it wins on the vast majority is direct evidence against the null hypothesis.

We don't trust just one pair — the strength tiers

Even within a single seed, the same code can produce slightly different convergence generations from one run to another. That is because the GPU runs thousands of calculations in parallel, and tiny floating-point ordering differences can nudge the trajectory of evolution. So instead of comparing one CTRL run to one EXP run at each seed, we run several of each and average them.

The dashboard sorts seeds into four tiers based on how many converged runs each side has:

Strong (≥30 each side) — rock-solid. The averaged convergence generation has very little room left for noise.
Acceptable (10–29 each side) — enough data to compare with confidence. We treat ≥10 each as the minimum threshold for a viable pair.
Weak (1–9 each side) — usable, but the seed's mean still wobbles. These pairs contribute less weight.
Unpaired — only one side has converged data. We can't compare yet, so this seed is excluded from the paired analysis.

That is what the "X/Y paired seeds" counter on the dashboard is tracking: how many seeds have crossed the bar to count toward the headline result.

How seed pairs make the statistical analysis stronger

Section 16 explained Welch's t-test, which compares the two big lists of convergence generations — all controls vs. all experimentals. That test is valid, but it has to fight through a lot of noise: some seeds are just naturally harder than others, and that seed-to-seed variation gets thrown into the mix.

Because every seed in our experiment is matched, we can also run a paired t-test. Instead of comparing two big lists, the paired t-test looks at the difference within each pair and asks: are these per-pair differences consistently away from zero? Two seeds with very different absolute convergence generations — say (220 vs 195) and (240 vs 215) — both have the same difference of −25, and the paired test sees that consistency clearly.

Why this matters: the paired t-test has more statistical power than the unpaired version because it removes seed-to-seed noise. When the per-pair gap is consistent in direction and size, the paired analysis can detect it with much smaller p-values and larger effect sizes than the unpaired version on the same data.

Cohen's d for the paired test (sometimes written d_z) is computed from the per-pair differences: the mean difference divided by the standard deviation of those differences. A small standard deviation of differences means the per-seed gaps cluster tightly around a single value, which is even stronger evidence than a large average alone could provide.

What about outliers? Sometimes the control wins

On the Paired Seed Comparison scatter plot, every dot is one matched seed. The diagonal line is "tied." Dots below the diagonal are seeds where adaptive was faster; dots above are seeds where the control was faster. The dashboard reports both counts honestly — in the current data, roughly of pairs go the adaptive way, but a small fraction (about ) come back the other direction.

That is not a flaw — that is what real data looks like.Adaptive mutation is a probabilistic advantage, not a guarantee. On any particular seed, several things can flip the comparison:

The random starting brains for that seed might already sit close to a stable equilibrium, in which case the small fixed mutation rate happens to nudge them in gently while adaptive mutation's larger early-generation steps temporarily push them away before settling.
The food and predation dynamics for that particular world might happen to reward less exploration, so adaptive mutation's extra exploration becomes a slight cost rather than a benefit.
At very low pair counts (a Weak-tier seed with only 1–2 runs each side), the seed mean still has noise of its own and a single unlucky run can flip the direction of that pair.

The paired t-test does not ask "does adaptive win every time?" It asks "across all paired seeds, is the average difference different from zero?" With of pairs leaning the adaptive way and a mean difference of generations, the paired test's current verdict on the live data is: , with the effect size . The outliers contribute to the spread of the difference distribution, but the test reports their effect transparently.

Reporting outliers is part of data integrity. Real scientific experiments almost never produce a 100% one-sided result. A finding that comes back with a clean win-rate, an effect size, and a p-value — outliers and all — is far more credible than one that claims "always." The fact that the dashboard plots every dot, including the ones that disagree with the hypothesis, is precisely what makes the conclusion trustworthy.

In one sentence: seed pairs are what turn this from "we ran some experiments and one group looked different" into "we ran the same experiment under both conditions on every seed, and the live paired analysis reports the result — direction, magnitude, and outliers — transparently and dynamically." That is the scientific method, applied honestly.

18Why adaptive wins overall — even when individual seeds disagree

One of the most common (and most healthy) reactions to the headline result on the dashboard is some version of: "Wait. Adaptive doesn't win every seed. And the dashboard is showing more control runs converging than experimental runs. How can you still claim adaptive is better?" That is exactly the right question to ask, and answering it properly is what separates a result you should trust from a result you should question. This section addresses the disparity head-on and shows the actual math, with every number computed live from the running database.

Acknowledging the disparity honestly

There are two real disparities in the data right now:

Pair-level losses. . The losses are not a glitch — they are part of the natural variance of evolution.
Convergence-rate gap. . Some of this is sampling lag (the queue prioritizes filling the slowest seed first), but even taking it at face value, it does not change the conclusion below.

The argument we are about to make is: even if you take both of those observations at their worst, the paired analysis still overwhelmingly supports adaptive mutation. We will show this four different ways: a model-free sign test, a paired t-test with confidence interval, a sensitivity analysis that throws away the most favorable wins, and a raw magnitude argument.

Step 1 — The sign test (no math required)

Forget about how much faster adaptive is for a moment. Just count, across all matched seeds, how often it finishes first. If the two strategies were equally good, that count should look like flipping a fair coin once per seed: about half wins, half losses, give or take random luck.

What we actually see is . The sign test makes noassumptions about distributions, normality, or magnitudes — it just counts directions, and the direction it counts is currently . If the live p-value is small, getting that many wins by random chance alone (if the two strategies were truly equivalent) would be vanishingly unlikely. If it ever rises, this paragraph will reflect that.

Step 2 — The paired t-test (now using the magnitudes)

The sign test ignores how large each gap is. The paired t-test fixes that: it asks whether the average size of the per-pair difference is far enough from zero to be unlikely under the null hypothesis "the two groups are equivalent."

Live result: . By Cohen's convention, d_z values above 0.8 are considered "large"; the current paired effect size sits . And we are not just relying on a single point estimate — we get . The lower bound of that interval is currently .

On the proportion side, a Wilson 95% confidence interval for the true win-rate (the probability that adaptive will beat control on a fresh seed) lands at . That lower bound is currently .

Step 3 — Sensitivity: what if the biggest wins are flukes?

A reasonable skeptic might worry that the average is being carried by a few seeds where adaptive happened to win huge, and that the "real" population difference is much smaller. So let's deliberately throw away those wins and run the paired t-test again, on the remaining (less favorable) data:

The pattern in that table is the point. The bottom row shows what happens after the most favorable adaptive wins are deleted from the dataset. If the remaining differences still produce small p-values and a non-trivial mean gap, the result is not propped up by a handful of lucky seeds — it is a population-wide phenomenon. If the trimmed test starts to lose significance, that is also visible here in the live numbers, and the conclusion is correspondingly less robust.

Step 4 — The raw magnitude argument

Forget statistics for a moment and just look at the raw arithmetic. If you sum up every generation that adaptive shaved off relative to control across every winning pair, and separately sum up every generation that control shaved off relative to adaptive across every losing pair, you get one of the most striking numbers on the dashboard:

. In summary, .

For colour, the extreme pairs are: .

What about the convergence-rate gap?

The other disparity is that, as of the moment you load this page, the two sides may have different numbers of converged runs (see the live counters above). There are two things to say about this. First, the headline paired analysis specifically only uses seeds where both sides have converged data — so the count gap cannot bias that comparison. Second, looking just at the matched seeds where the comparison is fair: .

In other words: convergence-count differences are mostly a function of which experiments the queue has scheduled to completion at this moment in time, not a statement about whether adaptive runs are failing to converge. The paired-mean-difference number above is the right summary of which side, on average, takes longer.

Putting it together

All of the numbers in this section are computed from the live database every time you load this page. As the experiment continues to run, the specific values above will shift slightly — what won't shift is the conclusion they support.

19Understanding IQR and confidence intervals

Two statistical concepts appear frequently on the results dashboard: the Interquartile Range (IQR) and the 95% Confidence Interval (CI). Both describe "spread," but they answer very different questions.

IQR: "Where did the middle 50% of experiments land?"

Imagine lining up every converged experiment from fastest to slowest. The first quartile (Q1) is the point where 25% of experiments finished faster, and the third quartile (Q3) is where 75% finished faster. The gap between them is the IQR:

IQR = Q3 − Q1

The range that contains the middle 50% of convergence generations

A small IQR means most experiments converged around the same generation—the process is consistent. A large IQR means there was more variability in how long experiments took.

Control group (Static ε)

With a fixed mutation rate, every experiment faces the same evolutionary pressure. The result is a tight IQR—most experiments converge within a narrow window of generations. Think of it like a class of students who all study the exact same way: they all finish the exam around the same time.

Experimental group (Adaptive ε)

Adaptive mutation creates multiple evolutionary pathways. Some experiments find a fast route and converge early; others explore longer before settling. The result is a wider IQR—but on average, convergence still happens faster. The students each develop their own study strategy: some finish early, some later, but the class average is better.

95% Confidence Interval: "How sure are we about the true difference?"

Our experiments are a sample—we ran , but we could theoretically run millions. The confidence interval gives us a range where the true average difference between control and experimental almost certainly lives.

CI = (x̄₁ − x̄₂) ± t × SE

where SE is the standard error of the difference between means

On the live data, the 95% CI for the mean per-pair difference is currently generations. That reads as: "We are 95% confident the true average difference in convergence speed is between and generations."

&check; CI does NOT include zero

The difference is statistically significant. We can be confident one group genuinely converges faster than the other.

&cross; CI includes zero

We cannot rule out that there is no real difference. The observed gap might be due to random chance.

How they work together: The IQR tells you how spread out individual experiments are within each group. The confidence interval tells you how precisely we know the average difference between the two groups. Even when a group has a wide IQR (high variability among individual experiments), the CI can still be narrow if we have enough data—which is exactly what we see in EvoNash with .

20The scale of this experiment

One question people often ask is: "Why do you need computers to do the statistics? Can't you just look at the numbers?" The short answer is that the dataset is far too large for any human to process by hand.

How much data does one experiment produce?

Each experiment runs 1,000 organisms for up to generations, with each generation comprising 750 simulation ticks (steps). That means a single experiment involves:

At every tick, each organism's 1,860-weight neural network processes 24 sensor inputs and produces 4 action outputs. All of these weights, inputs, and outputs are numbers that contribute to the statistics.

How much data does the full study produce?

We don't run just one experiment—we run of them to achieve statistical significance. That multiplies the numbers above dramatically:

Each generation also logs a row of statistics (average fitness, peak fitness, policy entropy, entropy variance, mutation rate, population diversity). With .

Why computers are essential

With this volume of data, doing the analysis by hand would be impractical:

Running the simulation itself requires a GPU (graphics processing unit) because 1,000 neural networks need to make decisions simultaneously, 750 times per generation. A GPU can process all 1,000 agents in parallel; a human calculator would take years per experiment.
Computing entropy variance requires evaluating each organism's neural network on sample inputs, applying softmax, calculating Shannon entropy, and then computing the variance across all 200 sampled organisms—every single generation. That's tens of thousands of entropy calculations.
Statistical testing across experiments means computing Welch's t-test, Cohen's d, bootstrap confidence intervals (10,000 resamples), normality checks (Shapiro-Wilk), and power analysis. Each of these involves hundreds or thousands of mathematical operations.
Real-time monitoring on the dashboard requires the database to query, aggregate, and return data for charts and tables in under a second—something only possible with a proper PostgreSQL database and a web application.

In short, this experiment produces an enormous volume of data, and every layer—from running the simulation to detecting convergence to testing the hypothesis—relies on computational power that would be impossible to replicate by hand.

21Why is this relevant for the future of AI, and how could it be expanded?

This experiment sits at the intersection of three powerful fields: evolutionary computing^{[10, 11, 12]} (improving AI through trial and error over generations), game theory^{[3, 4, 5]} (understanding strategic decision-making when multiple agents interact), and neural networks^{[6, 8, 9]} (giving agents brains that can process information and make decisions). That combination isn't just academic—it has direct applications to some of the most important challenges facing humanity.

Real-World Applications

Autonomous vehicles and robotics: Self-driving cars must constantly make decisions in a world full of other drivers, pedestrians, and cyclists—all making their own decisions simultaneously. This is exactly the kind of multi-agent game theory problem our experiment studies. If adaptive mutation helps populations of AI agents find stable strategies faster, that same principle could help robot fleets (warehouse robots, delivery drones, self-driving trucks) learn to coordinate efficiently without crashing into each other.
Economics and market design: Stock markets, auctions, and supply chains are all systems where many agents (traders, companies, consumers) interact strategically. Economists use Nash Equilibrium to predict market outcomes. Our experiment tests whether there are faster ways to find these equilibria—which could help design fairer auction systems, more efficient markets, or better pricing algorithms.
Drug discovery and protein design: Pharmaceutical companies use evolutionary algorithms to search through billions of possible molecular structures to find effective drugs. The question of "how much should we mutate?" is directly relevant—adaptive mutation could help these searches converge on promising drug candidates faster, potentially saving years of research time.
Climate and resource management: Managing shared resources (fisheries, forests, water supplies) involves many stakeholders making independent decisions. Game theory helps model these "tragedy of the commons" situations. Understanding how populations converge to stable strategies could inform policies that help communities reach sustainable equilibria faster.
Multi-agent AI systems: As AI becomes more common, we increasingly have situations where multiple AI systems interact—chatbots negotiating, trading algorithms competing, or recommendation systems influencing each other. Understanding how populations of AI agents reach equilibrium is crucial for ensuring these systems behave predictably and safely.
Cybersecurity: Attackers and defenders in cybersecurity are engaged in a constant strategic game. Evolutionary approaches to security (where defense strategies evolve in response to attacks) could benefit from adaptive mutation to find robust defense strategies more quickly.

Why Adaptive Mutation Matters Beyond This Experiment

The core question—"should we change things more when they're not working and less when they are?"—is fundamental to optimization everywhere. Currently, most evolutionary algorithms use fixed mutation rates, which is like always turning every knob by the same amount regardless of whether you're close to a good solution or far away. If our experiment demonstrates that adaptive mutation accelerates convergence, that finding could be applied to:

Training neural networks more efficiently (reducing the enormous energy cost of training large AI models)^{[13, 14]}
Optimizing engineering designs (aircraft wings, circuit layouts, antenna shapes) faster^[11]
Evolving game-playing AI (like AlphaGo's successors) with less computational cost^[26]
Accelerating scientific simulations that use evolutionary search methods^{[17, 18]}

How This Experiment Could Be Expanded

This kind of experiment could grow in many directions. We could use bigger or more complex worlds (3D environments, multiple food types, predator-prey ecosystems), larger populations or bigger neural networks, or entirely different mutation strategies (for example, letting the organisms learn their own mutation rate over time). We could also introduce cooperation (organisms that work together to hunt) or communication (organisms that signal to each other), creating richer game-theoretic scenarios. The goal is to show that evolutionary game-theoretic experiments can scale from a science fair project to tools that benefit both scientific research and real-world industry.

22Try it yourself

EvoNash is fully open-source. If you would like to reproduce this experiment—or run your own variation of it—everything you need is on GitHub:

github.com/jdefouw/EvoNash

What You Need

An NVIDIA GPU with CUDA support — any modern NVIDIA card will work (e.g. RTX 2060 or newer). The simulation runs entirely on the GPU, so a dedicated graphics card is required. An RTX 3090 is recommended for fastest results.
Python 3.8+ with PyTorch (CUDA version) — this powers the simulation worker.
Node.js 20+ and PostgreSQL 16 — these run the web dashboard and store experiment data.

Quick Start (3 Steps)

1. Clone the repository

git clone https://github.com/jdefouw/EvoNash.git
cd EvoNash

2. Set up the web dashboard

# Install PostgreSQL, create a database, and apply the schema
psql -U evonash -d evonash -f web/lib/sql/schema_standalone.sql

cd web
npm install
cp .env.example .env   # then edit .env with your database connection string
npm run build
npm start

3. Run the GPU worker

# Install PyTorch with CUDA (see https://pytorch.org for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd worker
pip install -r requirements.txt
# Edit config/worker_config.json — set controller_url to your dashboard URL
python run_worker.py

Running Your Own Experiments

Once the dashboard is running and the worker is connected, open the dashboard in your browser, navigate to Experiments → New Experiment, and create experiments in both the Control and Experimental groups. Use the same random seed for each pair (e.g. 42) so the only difference is the mutation strategy. For statistically meaningful results, run at least 5 experiments per group with different seeds.

The dashboard will automatically compute convergence generation, peak fitness, and statistical comparisons (t-tests, effect sizes) as data comes in. You can watch the experiments progress in real time.

Contribute or Extend

Contributions are welcome! Some ideas for extensions: try different neural network architectures, experiment with new mutation strategies (e.g. self-adaptive rates), add cooperation mechanics, or scale to larger populations. See the project PROJECT_SPEC.md for the full technical specification.

23References

28 sources across 9 categories

Game Theory & Nash Equilibrium

1
Nash, J.F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49 doi:10.1073/pnas.36.1.48
2
Nash, J.F. (1951). Non-cooperative games. Annals of Mathematics, 54(2), 286–295 doi:10.2307/1969529
3
von Neumann, J. & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press
4
Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press doi:10.1017/CBO9780511806292
5
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books

Neural Networks

6
McCulloch, W.S. & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133 doi:10.1007/BF02478259
7
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408 doi:10.1037/h0042519
8
Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536 doi:10.1038/323533a0
9
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press [Link]

Evolutionary Computing & Neuroevolution

10
Holland, J.H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press
11
Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley
12
Stanley, K.O. & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127 doi:10.1162/106365602320169811
13
Salimans, T., Ho, J., Chen, X., Sidor, S. & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 [Link]
14
Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O. & Clune, J. (2017). Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567 [Link]

Adaptive Mutation & Self-Adaptation

15
Bäck, T. (1993). Optimal mutation rates in genetic search. Proceedings of the 5th International Conference on Genetic Algorithms, 2–8
16
Smith, J.E. & Fogarty, T.C. (1997). Operator and parameter adaptation in genetic algorithms. Soft Computing, 1(2), 81–87 doi:10.1007/s005000050009
17
Eiben, A.E., Hinterding, R. & Michalewicz, Z. (1999). Parameter control in evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 3(2), 124–141 doi:10.1109/4235.771166
18
Karafotias, G., Hoogendoorn, M. & Eiben, A.E. (2015). Parameter control in evolutionary algorithms: Trends and challenges. IEEE Transactions on Evolutionary Computation, 19(2), 167–187 doi:10.1109/TEVC.2014.2308294

Stress-Induced Mutagenesis

19
Radman, M. (1975). SOS repair hypothesis: Phenomenology of an inducible DNA repair which is accompanied by mutagenesis. Basic Life Sciences, 5A, 355–367 doi:10.1007/978-1-4684-2895-7_48
20
Tenaillon, O., Taddei, F., Radman, M. & Matic, I. (2004). Second-order selection in bacterial evolution: Selection acting on mutation and recombination rates in the course of adaptation. Research in Microbiology, 155(6), 457–463 doi:10.1016/j.resmic.2004.01.013

Statistical Methods

21
Welch, B.L. (1947). The generalization of 'Student's' problem when several different population variances are involved. Biometrika, 34(1–2), 28–35 doi:10.1093/biomet/34.1-2.28
22
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates
23
Hedges, L.V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128 doi:10.3102/10769986006002107
24
Mann, H.B. & Whitney, D.R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60 doi:10.1214/aoms/1177730491

Multi-Agent Systems & Reinforcement Learning

25
Shoham, Y. & Leyton-Brown, K. (2009). Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press
26
Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press [Link]

GPU Computing

27
Nickolls, J., Buck, I., Garland, M. & Skadron, K. (2008). Scalable parallel programming with CUDA. ACM Queue, 6(2), 40–53 doi:10.1145/1365490.1365500

Information Theory

28
Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423 doi:10.1002/j.1538-7305.1948.tb01338.x

Experiment Overview

Recognition & Project Resources

This entire page is alive.

About This Project: The Science Fair Experiment at a Glance

Abstract (What We Did in One Paragraph)

The Problem (Why We Did This)

The Hypothesis (What We Think Will Happen)

Methodology (How We Run the Experiment)

Why We Use the Same Seed (Fair Start)

Variables (What We Change, What We Measure, What We Keep the Same)

Why This Matters

1What is a neural network?

2How do neural networks work?

3Key terms

Raycasts

The four actions

Moving

Fitness Score (in depth)

Other terms

4How does this experiment implement neural networks?

1. The Inputs: 24 "Eyes" (Sensors)

2. The Hidden Layer: 64 "Thinkers"

3. The Outputs: 4 Actions

5Why did we choose a petri dish?

6What are the organisms?

7How do their neural network brains work and what are their motivations?

8Why do they act the way they do?

9How do we conduct the experiment?

10What is game theory? What is Nash equilibrium? Why is it the key metric?

11How we detect Nash equilibrium (technical)

12Why do we need GPU workers?

13What are we measuring?

14Policy entropy vs. entropy variance — what's the difference?

Policy entropy: "How random is ONE organism's behavior?"

Entropy variance: "How DIFFERENT are the organisms FROM EACH OTHER?"

Why variance matters more than entropy for detecting Nash equilibrium

✓ Scenario A: Low entropy, LOW variance — Nash equilibrium

✗ Scenario B: Low entropy, HIGH variance — NOT Nash equilibrium

⟳ Scenario C: High entropy, LOW variance — Starting conditions (generation 0)

15How do we measure and report results?

Step 1: Give every organism the same "pop quiz"

Step 2: See what each organism would do

Step 3: Convert to percentages

Step 4: Measure how indecisive each one is (policy entropy)

Step 5: Check if everyone agrees (entropy variance)

Step 6: Detect Nash equilibrium

Step 7: Compare the two groups with statistics

16Understanding the statistical tests

Welch's t-test: "Is the difference real, or just luck?"

Cohen's d: "Okay, the difference is real—but how big is it?"

Putting it together

17Seed pairs: matched experiments and data integrity

What is a seed pair?

Why seed pairs prove data integrity

Why pairs make the hypothesis testable by the scientific method

We don't trust just one pair — the strength tiers

How seed pairs make the statistical analysis stronger

What about outliers? Sometimes the control wins

18Why adaptive wins overall — even when individual seeds disagree

Acknowledging the disparity honestly

Step 1 — The sign test (no math required)

Step 2 — The paired t-test (now using the magnitudes)

Step 3 — Sensitivity: what if the biggest wins are flukes?

Step 4 — The raw magnitude argument

What about the convergence-rate gap?

Putting it together

19Understanding IQR and confidence intervals

IQR: "Where did the middle 50% of experiments land?"

Control group (Static ε)

Experimental group (Adaptive ε)

95% Confidence Interval: "How sure are we about the true difference?"

&check; CI does NOT include zero

&cross; CI includes zero

20The scale of this experiment

How much data does one experiment produce?

How much data does the full study produce?

Why computers are essential

21Why is this relevant for the future of AI, and how could it be expanded?

Real-World Applications

Why Adaptive Mutation Matters Beyond This Experiment