Partial observability in deep RL — Lucas González Fiz

The problem

My bachelor's thesis. Reinforcement-learning agents are usually shown the whole board; the real world hands them a keyhole. The question isn't the lazy "PPO vs Markov" — Markov isn't an algorithm. It's two cleaner questions: how much does performance drop when the observation is partial instead of near-Markovian, and which method best recovers the missing state?

Approach & tradeoffs

A modular, reproducible MiniGrid harness where observability is the independent variable. Three baselines:

PPO — the robust baseline.
A2C — a lighter actor-critic, faster per update.
RecurrentPPO — explicit memory, the natural baseline for a POMDP.

They run across two environments — FourRooms for navigation and exploration, MemoryS13Random for memory under partial observability — while FullyObsWrapper and frame-stacking approximate a more Markovian state. That design isolates the effect of state representation from the choice of algorithm, which is what makes the comparison defensible.

Results

The framework runs config-driven experiments with versioned runs and a paired evaluation/compare pipeline, so each (algorithm × observability) cell is reproducible. The deliberate split between the observability question and the algorithm question is the methodological core of the thesis.

Results in progress — the experimental sweep (multiple seeds, the memory-vs-navigation contrast) is the body of the TFG. The repository is private while the thesis is in development.