The problem
My bachelor's thesis. Reinforcement-learning agents are usually shown the whole board; the real world hands them a keyhole. The question isn't the lazy "PPO vs Markov" — Markov isn't an algorithm. It's two cleaner questions: how much does performance drop when the observation is partial instead of near-Markovian, and which method best recovers the missing state?
Approach & tradeoffs
A modular, reproducible MiniGrid harness where observability is the independent variable. Three baselines:
- PPO — the robust baseline.
- A2C — a lighter actor-critic, faster per update.
- RecurrentPPO — explicit memory, the natural baseline for a POMDP.
They run across two environments — FourRooms for navigation and exploration,
MemoryS13Random for memory under partial observability — while
FullyObsWrapper and frame-stacking approximate a more Markovian state. That
design isolates the effect of state representation from the choice of
algorithm, which is what makes the comparison defensible.
Results
The framework runs config-driven experiments with versioned runs and a paired
evaluation/compare pipeline, so each (algorithm × observability) cell is
reproducible. The deliberate split between the observability question and the
algorithm question is the methodological core of the thesis.
Results in progress — the experimental sweep (multiple seeds, the memory-vs-navigation contrast) is the body of the TFG. The repository is private while the thesis is in development.