⟵ Home
Reinforcement Learning
MDP Modeling Dynamic Programming Monte Carlo Temporal Difference Policy Evaluation Policy Optimization

Modeled reinforcement learning MDP environments and implemented Dynamic Programming, Monte Carlo, and Temporal Difference algorithms to simulate decision-making, evaluate contents and optimize policy strategies for user reach engagement.

Worked through grid-world and continuous-control tasks such as Cat-vs-Monsters and Mountain Car, formulating state, action, transition, and reward dynamics and using Bellman equations to reason about optimal value functions and policies.

Implemented value iteration and policy iteration for stochastic MDPs, Monte Carlo policy evaluation and ε-soft control, and TD-style updates to compare sample-based and dynamic programming methods in terms of convergence speed, stability, and sample efficiency.

Dynamic Programming · Monte Carlo · TD Control
Example RL projects & algorithms:
  • Cat-vs-Monsters 5×5 grid MDP: derived and implemented value iteration to compute v★ and an optimal deterministic policy under stochastic transitions and shaped rewards.
  • First-visit & every-visit Monte Carlo evaluation: estimated vπ for the optimal policy and analyzed sample complexity using max-norm and MSE against the true value function.
  • Monte Carlo control with ε-soft policies: learned near-optimal q★ and π★ while balancing exploration and exploitation through fixed and decaying ε schedules.
  • Mountain Car continuous control: applied evolution strategies (black-box policy search) with a neural network policy to optimize returns over thousands of episodes.
🔍 Recommended visual here: a pink-themed learning curve (episodes vs. return or MSE), or a 5×5 gridworld diagram showing arrows for the learned optimal policy to match this section’s color scheme.