Can you build AlphaZero from first principles?
Not by importing libraries—but by constructing the complete mathematical foundations: σ-algebras → Banach spaces → Markov chains → MDPs → stochastic approximation → deep RL theory.
This repository documents a 48-week odyssey from Folland and Brezis to self-play game AI, proving every theorem needed along the way.
After many years in industry, building systems and shipping products, I felt a deepening urge to return to something I'd set aside: the eternal beauty of pure mathematics. Not as an escape from applicability—but as a way to understand why the tools we use actually work.
This journey reconnects two worlds I've lived in: the rigorous abstractions that captivated me during my PhD, and the practical algorithms that power modern AI. It's a bridge built in both directions—honoring the elegance of mathematical structure while never forgetting that these theorems exist because real problems demanded them.
Target audience: Anyone who feels this same pull—researchers with strong mathematical background (physics, pure math, engineering) who want to truly understand reinforcement learning, not just run it. Those who've spent years in applied work but hunger for the deeper "why."
┌─────────────────────────────────────────────────────────────────────────┐
│ PHASE I: MEASURE THEORY (Weeks 1-6) │
│ σ-algebras → Integration → Lᵖ Spaces → Conditional Expectation │
└─────────────────────────────┬───────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌──────────────────────────┐ ┌────────────────────────────────┐
│ PHASE II: MARKOV CHAINS │ │ PHASE III: FUNCTIONAL ANALYSIS│
│ (Weeks 7-12) │ │ (Weeks 13-18) │
│ Chains → Ergodic Theory │ │ Banach → Operators → Fixed Pt │
│ MCMC → General Spaces │ │ Spectral → Semigroups │
└────────────┬─────────────┘ └───────────┬────────────────────┘
│ │
└──────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ PHASE IV: SOBOLEV & PDEs │
│ (Weeks 19-24) │
│ Weak Derivatives → HJB │
│ Lax-Milgram → Viscosity │
└──────────┬──────────────────┘
│
┌────────────┴─────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌──────────────────────────┐
│ PHASE V: MDPs │ │ PHASE VI: BANDITS │
│ (Weeks 25-28) │ │ (Weeks 29-33) │
│ Bellman → Value Iter │ │ Regret → UCB │
│ Policy Iter → Avg Rwd│ │ Thompson → Contextual │
└──────────┬────────────┘ └────────┬─────────────────┘
│ │
└──────────┬───────────────┘
│
▼
┌─────────────────────────────────────┐
│ PHASE VII: STOCHASTIC APPROX │
│ (Weeks 34-39) │
│ Robbins-Monro → ODE Method │
│ TD Learning → Q-learning → PG │
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ PHASE VIII: ADVANCED TOPICS │
│ (Weeks 40-43) │
│ Continuous MDPs → Mean-Field │
│ Deep RL Theory → Synthesis │
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 🎯 PHASE IX: ALPHAZERO CAPSTONE │
│ (Weeks 44-48) │
│ MCTS → Neural Nets → Self-Play │
│ Implementation → Analysis │
└─────────────────────────────────────┘
Key Connections:
- Measure Theory → Probability foundations for MDPs
- Markov Chains → Policy evaluation, exploration
- Functional Analysis → Value function spaces, Bellman operators
- Sobolev/PDEs → Continuous control, HJB equations
- MDPs → Formal RL framework
- Bandits → Exploration-exploitation
- Stochastic Approximation → All RL algorithms (TD, Q-learning, PG)
- Deep RL Theory → Neural function approximation
- AlphaZero → Theory meets practice
A textbook in the making, written in real-time as daily study notes transform into publication-quality exposition.
Three commitments:
- Rigor without compromise: Every theorem proven, every definition precise, counterexamples for necessity
- RL as the North Star: Every abstraction justified by its necessity in reinforcement learning
- Realistic constraints: 90-minute daily target (up to 2.5 hours for dense material), weekends completely off
Not another tutorial. A bridge between:
- Bourbaki's architectural vision (build from the ground up)
- Brezis's pedagogical clarity (rigorous but readable)
- DeepMind's algorithmic practice (code that runs at frontier labs)
Unlike traditional RL courses that separate theory from practice, this journey features four features that bridge the gap:
Every implementation includes interactive visualization of AI decision-making:
Terminal:
python play_terminal.py --show-thinking
🤔 AI CANDIDATE EVALUATIONS
Position Outcome Assessment Best
────────────────────────────────────────────────────────────
a1 (0) ⚖️ Draw Balanced position ⭐ YES
b2 (4) ⚖️ Draw Balanced position ⭐ YES
...Web: Toggle "Show Candidate Moves" to see color-coded value overlays (green = strong, red = weak)
Impact: You don't just learn "π*(s) = argmax_a V*(s,a)" — you watch the algorithm evaluate every move and choose the best one.
All board games use algebraic notation (a1, b2, c3) — the universal language of board games:
- Same coordinate system as chess, Go, Connect Four
- Works for any board size (3×3 → 15×15 Gomoku)
- Industry-standard presentation (portfolio-ready demos)
Why it matters: "Play a1" vs. "Play position 0" — one is professional, one is toy example.
When practice diverges from theory, we show you both and explain the gap:
Example:
- Theory says: TD(0) converges to V* (tabular case, Robbins-Monro conditions)
- Practice needs: Target networks + experience replay (neural networks break guarantees)
- We show you: Baird's counterexample (when it fails), DQN tricks (why it still works)
No hand-waving. No "it just works." If theory and practice diverge, you'll know exactly why.
Each major implementation ships with a Lab Session Appendix (4-7 sessions, 90 min each):
- Theory recaps connecting to course material
- Step-by-step code (complete, runnable, explained)
- Time-calibrated tasks (know exactly how long each part takes)
- Ready for self-study (no instructor needed)
Total: ~8,000 lines of tutorial content across game projects
Example: Tic-Tac-Toe Lab Appendix (1600 lines)
- Lab 1: Game environment (90 min)
- Lab 2: Minimax solver (90 min)
- Lab 3: Terminal interface (60 min)
- Lab 4: Web interface (90 min)
Traditional RL course: Read theory → Run library code → Never really understand
This course: Prove theory → Build from scratch → See it work → Play against it
Concrete example:
You won't just implement minimax. You'll:
- Prove it computes V* correctly (Week 24 theory)
- Build it with α-β pruning and memoization (production code)
- Visualize it with
--show-thinking(watch it evaluate all moves) - Play against it in browser (try to beat optimal AI — spoiler: you can't!)
Result: Deep understanding you can't get from lectures or library code alone.
Why RL needs $\sigma$-algebras
- Measure spaces,
$\sigma$ --algebras, Carathéodory extension - Integration: monotone/dominated convergence, Fubini/Tonelli
- Lᵖ spaces: completeness, duality, density
- Conditional expectation (RL: transition kernels, expectations)
Key RL Connection: Probability measures on state-action spaces, observability
From random walks to mixing times
- Finite/countable chains: classification, stationary distributions
- Convergence theorems: coupling, total variation
- MCMC: Metropolis-Hastings, Gibbs sampling
- General state spaces: drift conditions, small sets
Key RL Connection: Exploration in MDPs, policy evaluation, on-policy/off-policy learning
The structure of value function spaces
- Banach/Hilbert spaces: dual spaces, weak convergence
- Compact operators, spectral theory
- Contraction mappings: Banach fixed-point theorem
- Semigroups and generators
Key RL Connection: Bellman operators, value iteration, policy iteration, contractive properties
From HJB equations to viscosity solutions
- Weak derivatives, Sobolev embeddings
- Lax-Milgram theorem, variational formulations
- Hamilton-Jacobi-Bellman equations
- Viscosity solutions for optimal control
Key RL Connection: Continuous control, actor-critic methods, maximum principle
The formal language of sequential decision-making
- MDP formalism: state/action spaces, transition kernels, rewards
- Bellman equations: optimality, uniqueness
- Value iteration, policy iteration: convergence theory
- Average reward MDPs, multichain case
Key RL Connection: Foundation for all RL algorithms
Exploration-exploitation from first principles
- Multi-armed bandits: regret framework, lower bounds
- UCB family: analysis and variants
- Thompson Sampling: Bayesian perspective
- Contextual bandits: LinUCB, neural bandits
Key RL Connection: Exploration in RL, credit assignment, reward shaping
The ODE method and convergence theory
- Robbins-Monro: step sizes, martingale analysis
- ODE method: Borkar's framework
- TD learning: TD(0), TD(λ), convergence proofs
- Q-learning: off-policy learning, deadly triad
- Policy gradients: REINFORCE, actor-critic, natural gradients
Key RL Connection: Every modern RL algorithm is stochastic approximation
Frontiers and synthesis
- Continuous-time MDPs, jump processes
- Mean-field games and multi-agent RL
- Deep RL theory: NTK perspective, approximation error
- Integration: what we've built and why it matters
Key RL Connection: Research frontier, open problems
Theory meets practice
- Monte Carlo Tree Search: UCT, PUCT
- Neural network policy-value approximation
- Self-play training loop
- Implementation, debugging, performance analysis
- Final reflection: 48 weeks in retrospect
Deliverable: Working AlphaZero-lite implementation with complete mathematical provenance
Study/
├── README.md # This file
├── Syllabus.md # Complete 48-week plan (canonical source)
├── Week 1/ # Weekly study materials
│ ├── Day 1.md # Polished textbook sections
│ ├── Day 2.md
│ ├── Day 3.md
│ ├── Day 1 exercises.md # Exercise solutions
│ └── ...
├── Week 2/
└── ...
"Rigorous does not mean impenetrable." — Haïm Brezis
- State theorems with full generality
- Prove in illuminating special cases when pedagogy demands
- Every abstraction is motivated by a concrete RL challenge
Every theorem answers: Why do we need this for RL?
Examples:
- σ-algebras → observability in MDPs, defining transition kernels
- Banach fixed-point → value iteration convergence
- Dominated convergence → policy gradient interchange
Every Friday: computational experiments verifying theory
- JAX for autodiff, PyTorch for neural nets
- Production-grade considerations (stability, complexity)
- References to frontier implementations (CleanRL, Stable-Baselines3)
- Postponed Generalizations Log: When time constraints force simplifications, we document what's postponed and why
- Time Constraints: Target 90 minutes/day, acceptable up to 2.5 hours—over that, we split content
- Gap Acknowledgment: Where theory and practice diverge, we say so explicitly
- Folland, Real Analysis
- Brezis, Functional Analysis, Sobolev Spaces and PDEs
- Puterman, Markov Decision Processes
- Lattimore & Szepesvári, Bandit Algorithms
- Durrett, Probability: Theory and Examples
- Levin, Peres & Wilmer, Markov Chains and Mixing Times
- Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint
- Bertsekas, Reinforcement Learning and Optimal Control
- Meyn & Tweedie, Markov Chains and Stochastic Stability
- Yong & Zhou, Stochastic Controls
- Bardi & Capuzzo-Dolcetta, Optimal Control and Viscosity Solutions
- Sutton & Barto, Reinforcement Learning (narrative guide)
- Start with Syllabus.md for the complete 48-week plan
- Read Week 1/Day 1.md to see textbook quality
- Fork this repository
- Modify Syllabus.md with your schedule/topics
- Follow the review workflow: create → review → revise → validate
- Open an issue for mathematical errors, typos, or pedagogical suggestions
- Discussions on RL connections, alternative proofs, or frontier references welcome
- This is a learning journey—honest feedback improves the final textbook
- Week 1: σ-algebras, measures, Carathéodory extension
- Week 1: Integration theory, convergence theorems
- Week 2:
$L^p$ spaces, completeness, duality - Week 3: Product measures, Fubini, Tonelli
- Week 4: Signed measures, Radon-Nikodym, conditional expectation
- ...
- Week 48: AlphaZero-lite for Reversi complete
Progress tracking: See Syllabus.md for detailed week-by-week status
Monday–Wednesday (40+40+10 min):
- 40 min: Reading from primary sources
- 40 min: Proof work or exercises
- 10 min: Reflection and RL connection
Thursday (30+60 min):
- 30 min: Reading
- 60 min: Extended proof (key theorem of the week)
Friday (20+30+40 min):
- 20 min: Reading
- 30 min: Proof review
- 40 min: Code synthesis (numerical experiments)
- End with: Weekly reflection (Mathematical Insight, RL Connection, Open Questions)
Weekends: Completely off (sustainability over speed)
By Week 48, this repository will contain:
- A complete textbook bridging pure analysis and RL (potentially publishable)
- ~240 polished sections (5 days × 48 weeks) with proofs and code
- A working AlphaZero implementation built from scratch with full mathematical provenance
- A library of RL algorithms (TD, Q-learning, policy gradients, bandits) with convergence proofs
- A model for self-directed mathematical study combining rigor and realistic time constraints
The Legacy: A resource cited in PhD theses, used by practitioners who want to understand why RL works, not just how to run it.
MIT License - See LICENSE for details
Citation:
@misc{measure-to-alphazero-2025,
author = {Vladyslav Prytula},
title = {From Measure Theory to AlphaZero: A 48-Week Journey},
year = {2025},
publisher = {GitHub},
url = {https://github.com/VladPrytula/funcan_rl}
}Inspired by:
- Nicolas Bourbaki's Éléments de mathématique (architectural vision)
- Haïm Brezis's Functional Analysis (pedagogical clarity)
- Jacques-Louis Lions's Optimal Control (applied depth)
- The frontier RL community (DeepMind, OpenAI, Berkeley, FAIR)
- All my friends, colleagues, and senseis during my university, PhD, and postdoc years—from Ukraine, Spain, France, Norway, Portugal, and all over the world—whose passion for rigorous mathematics and honest intellectual pursuit shaped this journey
Built with:
- Obsidian for markdown-based knowledge management
- Claude Code for AI-assisted code generation and review
- Issues: Mathematical errors, typos, suggestions
- Discussions: RL connections, alternative proofs, pedagogy
- Pull Requests: Corrections welcome (especially for mathematical errors)
Remember: This is a learning journey in public. Mistakes are expected—corrections are celebrated.
The content will be published as a GitHub Pages site for easier reading:
- Searchable textbook format (searchable across all weeks)
- Better LaTeX rendering (optimized for web)
- Navigation by phase/week/day
- Downloadable PDF versions
Preview URL (once live): https://github.com/VladPrytula/funcan_rl.git
Want to adapt this workflow for your own study plan? Here's how:
git clone https://github.com/VladPrytula/funcan_rl.git
cd funcan_rlEdit Syllabus.md with your topics, timeline, and reading assignments.
Status: Week 2 completed | Next milestone: Week 6 (Measure Theory complete)
Last updated: 2025-01-08