A tabular Q-Learning agent that masters pickup & drop-off routing in the Gymnasium Taxi-v3 environment through iterative Bellman updates.
500 discrete states encoding the taxi position (5×5 grid), passenger location (4 spots + in-taxi), and destination (4 spots).
s = row×100 + col×20 + pass×4 + dest
Bellman equation iteratively refines state-action values balancing immediate and future rewards.
Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') − Q(s,a)]
Balances exploration (random actions) with exploitation (greedy Q-table lookup). Epsilon decays exponentially.
ε ← max(ε_min, ε × ε_decay)
Sparse reward signal: +20 for successful dropoff, −10 for illegal pickup/dropoff, −1 per step (time pressure).
r ∈ {+20, −10, −1}
| # | Total Reward | Steps | Success |
|---|---|---|---|
| Loading… | |||
Decode any of the 500 Taxi-v3 states and inspect the Q-values for each action.
Step through a test episode and watch the agent navigate the grid.