codeharborhub · ajay-dhangar · Jan 4, 2026 · Jan 4, 2026
@@ -0,0 +1,109 @@
+---
+title: Actor-Critic Methods
+sidebar_label: Actor-Critic
+description: "Combining value-based and policy-based methods for stable and efficient reinforcement learning."
+tags: [machine-learning, reinforcement-learning, actor-critic, a2c, a3c]
+---
+
+**Actor-Critic** methods are a hybrid architecture in Reinforcement Learning that combine the best of both worlds: **Policy Gradients** and **Value-Based** learning. 
+
+In this setup, we use two neural networks:
+1.  **The Actor:** Learns the strategy (Policy). It decides which action to take.
+2.  **The Critic:** Learns to evaluate the action. It tells the Actor how "good" the action was by estimating the Value function.
+
+## 1. Why use Actor-Critic?
+
+* **Policy Gradients (Actor only):** Have high variance and can be slow to converge because they rely on full episode returns.
+* **Q-Learning (Critic only):** Can be biased and struggles with continuous action spaces.
+* **Actor-Critic:** Uses the Critic to reduce the variance of the Actor, leading to faster and more stable learning.
+
+## 2. How it Works: The Advantage
+
+The Critic doesn't just predict the reward; it predicts the **Advantage** ($A$). The Advantage tells us if an action was better than the average action expected from that state.
+
+$$
+A(s, a) = Q(s, a) - V(s)
+$$
+
+Where:
+
+* **$Q(s, a)$:** The value of taking a specific action.
+* **$V(s)$:** The average value of the state (The Baseline).
+
+If $A > 0$, the Actor is encouraged to take that action more often. If $A < 0$, the Actor is discouraged.
+
+## 3. The Learning Loop
+
+```mermaid
+graph TD
+    S[State] --> Actor(Actor: Policy)
+    S --> Critic(Critic: Value)
+    Actor --> A[Action]
+    A --> E[Environment]
+    E --> R[Reward]
+    E --> NS[Next State]
+    R --> TD[TD Error / Advantage]
+    NS --> TD
+    TD -->|Feedback| Actor
+    TD -->|Feedback| Critic
+
+    style Actor fill:#e1f5fe,stroke:#01579b,color:#333
+    style Critic fill:#fff3e0,stroke:#ef6c00,color:#333
+    style TD fill:#fce4ec,stroke:#d81b60,color:#333
+
+```
+
+## 4. Popular Variations
+
+### A2C (Advantage Actor-Critic)
+
+A synchronous version where multiple agents run in parallel environments. The "Master" agent waits for all workers to finish their steps before updating the global network.
+
+### A3C (Asynchronous Advantage Actor-Critic)
+
+Introduced by DeepMind, this version is asynchronous. Each worker updates the global network independently without waiting for others, making it extremely fast.
+
+### PPO (Proximal Policy Optimization)
+
+A modern, state-of-the-art Actor-Critic algorithm used by OpenAI. It ensures that updates to the policy aren't "too large," preventing the model from collapsing during training.
+
+## 5. Implementation Logic (Pseudo-code)
+
+```python
+# 1. Get action from Actor
+probs = actor(state)
+action = sample(probs)
+
+# 2. Interact with Environment
+next_state, reward = env.step(action)
+
+# 3. Get values from Critic
+value = critic(state)
+next_value = critic(next_state)
+
+# 4. Calculate Advantage (TD Error)
+# Advantage = (r + gamma * next_v) - v
+advantage = reward + gamma * next_value - value
+
+# 5. Backpropagate
+actor_loss = -log_prob(action) * advantage.detach()
+critic_loss = advantage.pow(2)
+
+(actor_loss + critic_loss).backward()
+
+```
+
+## 6. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Lower Variance:** Much more stable than pure Policy Gradients. | **Complexity:** Harder to tune because you are training two networks at once. |
+| **Online Learning:** Can update after every step (doesn't need to wait for the end of an episode). | **Sample Inefficient:** Can still require millions of interactions for complex games. |
+| **Continuous Actions:** Handles continuous movement smoothly. | **Sensitive to Hyperparameters:** Learning rates for Actor and Critic must be balanced. |
+
+## References
+
+* **DeepMind's A3C Paper:** "Asynchronous Methods for Deep Reinforcement Learning."
+* **OpenAI Spinning Up:** Documentation on PPO and Actor-Critic variants.
+* **Reinforcement Learning with David Silver:** Lecture 7 (Policy Gradient and Actor-Critic).
+* **Sutton & Barto's "Reinforcement Learning: An Introduction":** Chapter on Actor-Critic Methods.
@@ -0,0 +1,182 @@
+---
+title: "Deep Q-Networks (DQN)"
+sidebar_label: Deep Q-Networks
+description: "Scaling Reinforcement Learning with Deep Learning using Experience Replay and Target Networks."
+tags: [machine-learning, reinforcement-learning, dqn, deep-learning, neural-networks]
+---
+
+**Deep Q-Networks (DQN)** represent the fusion of Reinforcement Learning and Deep Neural Networks. While standard [Q-Learning](/tutorial/machine-learning/machine-learning-core/reinforcement-learning/q-learning) uses a table to store values, DQN uses a **Neural Network** to approximate the Q-value function.
+
+This advancement allowed RL agents to handle environments with high-dimensional state spaces, such as raw pixels from a video game screen.
+
+## 1. Why Deep Learning for Q-Learning?
+
+In a complex environment, the number of possible states is astronomical. 
+* **Atari 2600:** A $210 \times 160$ pixel screen with 128 colors has more possible states than there are atoms in the universe.
+* **The Solution:** Instead of a table, we use a Neural Network ($Q_\theta$) that takes a **State** as input and outputs the predicted **Q-values** for all possible actions.
+
+## 2. The Two "Secret Ingredients" of DQN
+
+Standard neural networks struggle with RL because the data is highly correlated (sequential frames in a game are nearly identical). To fix this, DQN introduced two revolutionary concepts:
+
+### A. Experience Replay
+Instead of learning from the current experience immediately, the agent saves its experiences $(s, a, r, s')$ in a **Replay Buffer**. During training, we sample a **random batch** of these experiences.
+* **Benefit:** It breaks the correlation between consecutive samples and allows the model to "re-learn" from past successes and failures multiple times.
+
+### B. Target Networks
+In standard Q-Learning, the "target" we are chasing changes every time we update the weights. This is like a dog chasing its own tail. 
+* **The Fix:** We maintain two networks:
+    1. **Policy Network:** The one we are constantly training.
+    2. **Target Network:** A frozen copy of the Policy Network used to calculate the "target" value. We only update this copy every few thousand steps.
+
+## 3. The DQN Mathematical Objective
+
+The loss function for DQN is the squared difference between the **Target Q-value** and the **Predicted Q-value**:
+
+$$
+L(\theta) = E \left[ \left( \underbrace{r + \gamma \max_{a'} Q_{\theta^{-}}(s', a')}_{\text{Target (Target Network)}} - \underbrace{Q_{\theta}(s, a)}_{\text{Prediction (Policy Network)}} \right)^2 \right]
+$$
+
+Where:
+
+* **$\theta$**: Weights of the Policy Network.
+* **$\theta^{-}$**: Weights of the Target Network (frozen).
+* **$r$**: Reward received after taking action $a$ in state $s$.
+* **$\gamma$**: Discount factor for future rewards.
+
+## 4. The DQN Workflow
+
+```mermaid
+graph LR
+    ENV["$$\text{Environment}$$"]
+
+    ENV --> S["$$s_t$$<br/>$$\text{Current State}$$"]
+
+    S --> NET["$$Q(s,a;\theta)$$<br/>$$\text{Online Q-Network}$$"]
+
+    NET --> ACT["$$\varepsilon\text{-greedy Policy}a_t=\begin{cases} \text{random action} & \varepsilon \\ \arg\max_a Q(s_t,a;\theta) & 1-\varepsilon \end{cases}$$"]
+
+    ACT --> ENV
+
+    ENV --> R["$$r_t,\ s_{t+1}$$"]
+
+    R --> MEM["$$\text{Replay Buffer } \mathcal{D}$$"]
+
+    MEM --> SAMPLE["$$\text{Sample Mini-batch}$$"]
+
+    SAMPLE --> TARGET["$$y_t = r_t + \gamma \max_a Q(s_{t+1},a;\theta^-)$$"]
+
+    TARGET --> LOSS["$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_t - Q(s_t,a_t;\theta))^2\right]$$"]
+
+    LOSS --> GRAD["$$\nabla_\theta \mathcal{L}$$"]
+
+    GRAD --> UPDATE["$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$$"]
+
+    UPDATE --> NET
+
+    NET -.->|"$$\text{Periodically Copy}$$"| TNET["$$\theta^-$$<br/>$$\text{Target Network}$$"]
+
+
+```
+
+## 5. Implementation logic (PyTorch-style)
+
+```python
+# The DQN Model
+class DQN(nn.Module):
+    def __init__(self, state_dim, action_dim):
+        super(DQN, self).__init__()
+        self.net = nn.Sequential(
+            nn.Linear(state_dim, 128),
+            nn.ReLU(),
+            nn.Linear(128, action_dim)
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+# Training Step
+def train_step():
+    # 1. Sample random batch from replay buffer
+    states, actions, rewards, next_states, dones = buffer.sample(batch_size)
+
+    # 2. Get current Q-values from Policy Network
+    current_q = policy_net(states).gather(1, actions)
+
+    # 3. Get maximum Q-values for next states from Target Network
+    with torch.no_grad():
+        next_q = target_net(next_states).max(1)[0]
+        target_q = rewards + (gamma * next_q * (1 - dones))
+
+    # 4. Minimize the Loss
+    loss = F.mse_loss(current_q, target_q.unsqueeze(1))
+    optimizer.zero_grad()
+    loss.backward()
+    optimizer.step()
+
+```
+
+## 6. Beyond DQN
+
+While DQN was a massive breakthrough, it has been improved by:
+
+* **Double DQN:** Reduces the tendency to overestimate Q-values.
+* **Dueling DQN:** Separates the calculation of state value and action advantage.
+* **Prioritized Experience Replay:** Samples "important" experiences (those with high error) more frequently.
+
+```mermaid
+graph LR
+    ENV["$$\text{Atari Environment}$$"]
+
+    ENV --> S["$$s_t$$<br/>$$\text{Game State}$$"]
+
+    %% Standard DQN
+    S --> DQN["Standard DQN"]
+
+    DQN --> Q1["$$Q(s,a;\theta)$$"]
+    Q1 --> T1["$$y = r + \gamma \max_a Q(s',a;\theta^-)$$"]
+    T1 --> O1["$$\text{Overestimation Bias}$$"]
+    O1 --> P1["$$\text{Unstable Learning}$$"]
+
+    %% Double DQN
+    S --> DDQN["Double DQN"]
+
+    DDQN --> Q2["$$Q(s,a;\theta)$$"]
+    Q2 --> T2["$$y = r + \gamma Q(s', \arg\max_a Q(s',a;\theta);\theta^-)$$"]
+    T2 --> O2["$$\text{Reduced Overestimation}$$"]
+    O2 --> P2["$$\text{More Stable Q-Values}$$"]
+
+    %% Dueling DQN
+    S --> DUEL["Dueling DQN"]
+
+    DUEL --> V["$$V(s;\theta_v)$$<br/>$$\text{State Value}$$"]
+    DUEL --> A["$$A(s,a;\theta_a)$$<br/>$$\text{Action Advantage}$$"]
+
+    V --> Q3["$$Q(s,a)=V(s)+A(s,a)-\frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$"]
+    A --> Q3
+
+    Q3 --> P3["$$\text{Better State Representation}$$"]
+    P3 --> G3["$$\text{Faster Learning on Atari}$$"]
+
+    %% Experience Replay Enhancement
+    ENV --> MEM["$$\text{Replay Buffer}$$"]
+
+    MEM --> PER["$$\text{Prioritized Experience Replay}$$"]
+    PER --> ERR["$$p_i \propto |\delta_i|$$<br/>$$\text{TD Error-Based Sampling}$$"]
+    ERR --> UPD["$$\text{Faster Convergence}$$"]
+
+    %% Comparison Links
+    P1 -.->|"$$\text{Beyond DQN}$$"| O2
+    O2 -.->|"$$\text{Combined}$$"| G3
+    UPD -.->|"$$\text{Boosts All}$$"| G3
+
+```
+
+## References
+
+* **Mnih et al. (2015):** "Human-level control through deep reinforcement learning" (The original Nature paper).
+* **DeepLizard RL Series:** Excellent visual tutorials on DQN mechanics.
+
+---
+
+**DQN is great for discrete actions (like buttons on a controller). But how do we handle continuous actions, like the pressure applied to a gas pedal?**