diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx
index e69de29..e6ebc5d 100644
--- a/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx
+++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx
@@ -0,0 +1,109 @@
+---
+title: Actor-Critic Methods
+sidebar_label: Actor-Critic
+description: "Combining value-based and policy-based methods for stable and efficient reinforcement learning."
+tags: [machine-learning, reinforcement-learning, actor-critic, a2c, a3c]
+---
+
+**Actor-Critic** methods are a hybrid architecture in Reinforcement Learning that combine the best of both worlds: **Policy Gradients** and **Value-Based** learning. 
+
+In this setup, we use two neural networks:
+1.  **The Actor:** Learns the strategy (Policy). It decides which action to take.
+2.  **The Critic:** Learns to evaluate the action. It tells the Actor how "good" the action was by estimating the Value function.
+
+## 1. Why use Actor-Critic?
+
+* **Policy Gradients (Actor only):** Have high variance and can be slow to converge because they rely on full episode returns.
+* **Q-Learning (Critic only):** Can be biased and struggles with continuous action spaces.
+* **Actor-Critic:** Uses the Critic to reduce the variance of the Actor, leading to faster and more stable learning.
+
+## 2. How it Works: The Advantage
+
+The Critic doesn't just predict the reward; it predicts the **Advantage** ($A$). The Advantage tells us if an action was better than the average action expected from that state.
+
+$$
+A(s, a) = Q(s, a) - V(s)
+$$
+
+Where:
+
+* **$Q(s, a)$:** The value of taking a specific action.
+* **$V(s)$:** The average value of the state (The Baseline).
+
+If $A > 0$, the Actor is encouraged to take that action more often. If $A < 0$, the Actor is discouraged.
+
+## 3. The Learning Loop
+
+```mermaid
+graph TD
+    S[State] --> Actor(Actor: Policy)
+    S --> Critic(Critic: Value)
+    Actor --> A[Action]
+    A --> E[Environment]
+    E --> R[Reward]
+    E --> NS[Next State]
+    R --> TD[TD Error / Advantage]
+    NS --> TD
+    TD -->|Feedback| Actor
+    TD -->|Feedback| Critic
+    
+    style Actor fill:#e1f5fe,stroke:#01579b,color:#333
+    style Critic fill:#fff3e0,stroke:#ef6c00,color:#333
+    style TD fill:#fce4ec,stroke:#d81b60,color:#333
+
+```
+
+## 4. Popular Variations
+
+### A2C (Advantage Actor-Critic)
+
+A synchronous version where multiple agents run in parallel environments. The "Master" agent waits for all workers to finish their steps before updating the global network.
+
+### A3C (Asynchronous Advantage Actor-Critic)
+
+Introduced by DeepMind, this version is asynchronous. Each worker updates the global network independently without waiting for others, making it extremely fast.
+
+### PPO (Proximal Policy Optimization)
+
+A modern, state-of-the-art Actor-Critic algorithm used by OpenAI. It ensures that updates to the policy aren't "too large," preventing the model from collapsing during training.
+
+## 5. Implementation Logic (Pseudo-code)
+
+```python
+# 1. Get action from Actor
+probs = actor(state)
+action = sample(probs)
+
+# 2. Interact with Environment
+next_state, reward = env.step(action)
+
+# 3. Get values from Critic
+value = critic(state)
+next_value = critic(next_state)
+
+# 4. Calculate Advantage (TD Error)
+# Advantage = (r + gamma * next_v) - v
+advantage = reward + gamma * next_value - value
+
+# 5. Backpropagate
+actor_loss = -log_prob(action) * advantage.detach()
+critic_loss = advantage.pow(2)
+
+(actor_loss + critic_loss).backward()
+
+```
+
+## 6. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Lower Variance:** Much more stable than pure Policy Gradients. | **Complexity:** Harder to tune because you are training two networks at once. |
+| **Online Learning:** Can update after every step (doesn't need to wait for the end of an episode). | **Sample Inefficient:** Can still require millions of interactions for complex games. |
+| **Continuous Actions:** Handles continuous movement smoothly. | **Sensitive to Hyperparameters:** Learning rates for Actor and Critic must be balanced. |
+
+## References
+
+* **DeepMind's A3C Paper:** "Asynchronous Methods for Deep Reinforcement Learning."
+* **OpenAI Spinning Up:** Documentation on PPO and Actor-Critic variants.
+* **Reinforcement Learning with David Silver:** Lecture 7 (Policy Gradient and Actor-Critic).
+* **Sutton & Barto's "Reinforcement Learning: An Introduction":** Chapter on Actor-Critic Methods.
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx
index e69de29..da1e178 100644
--- a/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx
+++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx
@@ -0,0 +1,182 @@
+---
+title: "Deep Q-Networks (DQN)"
+sidebar_label: Deep Q-Networks
+description: "Scaling Reinforcement Learning with Deep Learning using Experience Replay and Target Networks."
+tags: [machine-learning, reinforcement-learning, dqn, deep-learning, neural-networks]
+---
+
+**Deep Q-Networks (DQN)** represent the fusion of Reinforcement Learning and Deep Neural Networks. While standard [Q-Learning](/tutorial/machine-learning/machine-learning-core/reinforcement-learning/q-learning) uses a table to store values, DQN uses a **Neural Network** to approximate the Q-value function.
+
+This advancement allowed RL agents to handle environments with high-dimensional state spaces, such as raw pixels from a video game screen.
+
+## 1. Why Deep Learning for Q-Learning?
+
+In a complex environment, the number of possible states is astronomical. 
+* **Atari 2600:** A $210 \times 160$ pixel screen with 128 colors has more possible states than there are atoms in the universe.
+* **The Solution:** Instead of a table, we use a Neural Network ($Q_\theta$) that takes a **State** as input and outputs the predicted **Q-values** for all possible actions.
+
+## 2. The Two "Secret Ingredients" of DQN
+
+Standard neural networks struggle with RL because the data is highly correlated (sequential frames in a game are nearly identical). To fix this, DQN introduced two revolutionary concepts:
+
+### A. Experience Replay
+Instead of learning from the current experience immediately, the agent saves its experiences $(s, a, r, s')$ in a **Replay Buffer**. During training, we sample a **random batch** of these experiences.
+* **Benefit:** It breaks the correlation between consecutive samples and allows the model to "re-learn" from past successes and failures multiple times.
+
+### B. Target Networks
+In standard Q-Learning, the "target" we are chasing changes every time we update the weights. This is like a dog chasing its own tail. 
+* **The Fix:** We maintain two networks:
+    1. **Policy Network:** The one we are constantly training.
+    2. **Target Network:** A frozen copy of the Policy Network used to calculate the "target" value. We only update this copy every few thousand steps.
+
+## 3. The DQN Mathematical Objective
+
+The loss function for DQN is the squared difference between the **Target Q-value** and the **Predicted Q-value**:
+
+$$
+L(\theta) = E \left[ \left( \underbrace{r + \gamma \max_{a'} Q_{\theta^{-}}(s', a')}_{\text{Target (Target Network)}} - \underbrace{Q_{\theta}(s, a)}_{\text{Prediction (Policy Network)}} \right)^2 \right]
+$$
+
+Where:
+
+* **$\theta$**: Weights of the Policy Network.
+* **$\theta^{-}$**: Weights of the Target Network (frozen).
+* **$r$**: Reward received after taking action $a$ in state $s$.
+* **$\gamma$**: Discount factor for future rewards.
+
+## 4. The DQN Workflow
+
+```mermaid
+graph LR
+    ENV["$$\text{Environment}$$"]
+
+    ENV --> S["$$s_t$$<br/>$$\text{Current State}$$"]
+
+    S --> NET["$$Q(s,a;\theta)$$<br/>$$\text{Online Q-Network}$$"]
+
+    NET --> ACT["$$\varepsilon\text{-greedy Policy}a_t=\begin{cases} \text{random action} & \varepsilon \\ \arg\max_a Q(s_t,a;\theta) & 1-\varepsilon \end{cases}$$"]
+
+    ACT --> ENV
+
+    ENV --> R["$$r_t,\ s_{t+1}$$"]
+
+    R --> MEM["$$\text{Replay Buffer } \mathcal{D}$$"]
+
+    MEM --> SAMPLE["$$\text{Sample Mini-batch}$$"]
+
+    SAMPLE --> TARGET["$$y_t = r_t + \gamma \max_a Q(s_{t+1},a;\theta^-)$$"]
+
+    TARGET --> LOSS["$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_t - Q(s_t,a_t;\theta))^2\right]$$"]
+
+    LOSS --> GRAD["$$\nabla_\theta \mathcal{L}$$"]
+
+    GRAD --> UPDATE["$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$$"]
+
+    UPDATE --> NET
+
+    NET -.->|"$$\text{Periodically Copy}$$"| TNET["$$\theta^-$$<br/>$$\text{Target Network}$$"]
+
+
+```
+
+## 5. Implementation logic (PyTorch-style)
+
+```python
+# The DQN Model
+class DQN(nn.Module):
+    def __init__(self, state_dim, action_dim):
+        super(DQN, self).__init__()
+        self.net = nn.Sequential(
+            nn.Linear(state_dim, 128),
+            nn.ReLU(),
+            nn.Linear(128, action_dim)
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+# Training Step
+def train_step():
+    # 1. Sample random batch from replay buffer
+    states, actions, rewards, next_states, dones = buffer.sample(batch_size)
+
+    # 2. Get current Q-values from Policy Network
+    current_q = policy_net(states).gather(1, actions)
+
+    # 3. Get maximum Q-values for next states from Target Network
+    with torch.no_grad():
+        next_q = target_net(next_states).max(1)[0]
+        target_q = rewards + (gamma * next_q * (1 - dones))
+
+    # 4. Minimize the Loss
+    loss = F.mse_loss(current_q, target_q.unsqueeze(1))
+    optimizer.zero_grad()
+    loss.backward()
+    optimizer.step()
+
+```
+
+## 6. Beyond DQN
+
+While DQN was a massive breakthrough, it has been improved by:
+
+* **Double DQN:** Reduces the tendency to overestimate Q-values.
+* **Dueling DQN:** Separates the calculation of state value and action advantage.
+* **Prioritized Experience Replay:** Samples "important" experiences (those with high error) more frequently.
+
+```mermaid
+graph LR
+    ENV["$$\text{Atari Environment}$$"]
+
+    ENV --> S["$$s_t$$<br/>$$\text{Game State}$$"]
+
+    %% Standard DQN
+    S --> DQN["Standard DQN"]
+
+    DQN --> Q1["$$Q(s,a;\theta)$$"]
+    Q1 --> T1["$$y = r + \gamma \max_a Q(s',a;\theta^-)$$"]
+    T1 --> O1["$$\text{Overestimation Bias}$$"]
+    O1 --> P1["$$\text{Unstable Learning}$$"]
+
+    %% Double DQN
+    S --> DDQN["Double DQN"]
+
+    DDQN --> Q2["$$Q(s,a;\theta)$$"]
+    Q2 --> T2["$$y = r + \gamma Q(s', \arg\max_a Q(s',a;\theta);\theta^-)$$"]
+    T2 --> O2["$$\text{Reduced Overestimation}$$"]
+    O2 --> P2["$$\text{More Stable Q-Values}$$"]
+
+    %% Dueling DQN
+    S --> DUEL["Dueling DQN"]
+
+    DUEL --> V["$$V(s;\theta_v)$$<br/>$$\text{State Value}$$"]
+    DUEL --> A["$$A(s,a;\theta_a)$$<br/>$$\text{Action Advantage}$$"]
+
+    V --> Q3["$$Q(s,a)=V(s)+A(s,a)-\frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$"]
+    A --> Q3
+
+    Q3 --> P3["$$\text{Better State Representation}$$"]
+    P3 --> G3["$$\text{Faster Learning on Atari}$$"]
+
+    %% Experience Replay Enhancement
+    ENV --> MEM["$$\text{Replay Buffer}$$"]
+
+    MEM --> PER["$$\text{Prioritized Experience Replay}$$"]
+    PER --> ERR["$$p_i \propto |\delta_i|$$<br/>$$\text{TD Error-Based Sampling}$$"]
+    ERR --> UPD["$$\text{Faster Convergence}$$"]
+
+    %% Comparison Links
+    P1 -.->|"$$\text{Beyond DQN}$$"| O2
+    O2 -.->|"$$\text{Combined}$$"| G3
+    UPD -.->|"$$\text{Boosts All}$$"| G3
+
+```
+
+## References
+
+* **Mnih et al. (2015):** "Human-level control through deep reinforcement learning" (The original Nature paper).
+* **DeepLizard RL Series:** Excellent visual tutorials on DQN mechanics.
+
+---
+
+**DQN is great for discrete actions (like buttons on a controller). But how do we handle continuous actions, like the pressure applied to a gas pedal?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx
index e69de29..745e902 100644
--- a/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx
+++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx
@@ -0,0 +1,128 @@
+---
+title: Policy Gradients
+sidebar_label: Policy Gradients
+description: "Optimizing the policy directly: understanding the REINFORCE algorithm, stochastic policies, and the Policy Gradient Theorem."
+tags: [machine-learning, reinforcement-learning, policy-gradients, reinforce]
+---
+
+**Policy Gradient** methods are a class of reinforcement learning algorithms that optimize the policy ($\pi$) directly. Unlike [Q-Learning](./q-learning), which learns the value of being in a state, Policy Gradients learn the probability distribution of actions.
+
+## 1. Why Choose Policy Gradients?
+
+While Q-Learning is powerful, it struggles with:
+1.  **Continuous Action Spaces:** It's hard to find the maximum Q-value if there are infinite possible actions (e.g., the exact degree to turn a steering wheel).
+2.  **Stochastic Policies:** In some games (like Rock-Paper-Scissors), the best strategy is to be random. Q-Learning is inherently deterministic.
+3.  **High Variance:** Value functions can be unstable.
+
+## 2. The Core Concept
+
+We represent the policy using a parameterized function (usually a Neural Network) $\pi_\theta(a|s)$. This function outputs the probability of taking action $a$ given state $s$.
+
+```mermaid
+graph LR
+    S["$$s_t \in \mathcal{S}$$<br/>$$\text{State}$$"]
+
+    S --> NN["$$\pi_\theta$$<br/>$$\text{Neural Network Policy}$$"]
+
+    NN --> Z["$$z = f_\theta(s_t)$$<br/>$$\text{Latent Representation}$$"]
+
+    Z --> LOG["$$\text{Action Logits}$$"]
+
+    LOG --> SOFT["$$\text{Softmax}$$<br/>$$\pi_\theta(a|s_t)=\frac{e^{z_a}}{\sum_{a'} e^{z_{a'}}}$$"]
+
+    SOFT --> A1["$$P(a_1|s_t)$$"]
+    SOFT --> A2["$$P(a_2|s_t)$$"]
+    SOFT --> Ak["$$P(a_k|s_t)$$"]
+
+    A1 --> SAMPLE["$$a_t \sim \pi_\theta(\cdot|s_t)$$"]
+    A2 --> SAMPLE
+    Ak --> SAMPLE
+
+    SAMPLE --> ENV["$$\text{Environment}$$"]
+    ENV --> R["$$r_t,\ s_{t+1}$$"]
+
+    R --> LOSS["$$\mathcal{L}(\theta) = -\mathbb{E}[\log \pi_\theta(a_t|s_t)\, G_t]$$"]
+    LOSS --> UPDATE["$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$"]
+
+    UPDATE --> NN
+
+```
+
+### The Policy Gradient Theorem
+
+The goal is to adjust the weights  to maximize the total expected reward . We use gradient ascent to update the parameters:
+
+$$
+\nabla_\theta J(\theta) \approx E_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) G_t]
+$$
+
+Where:
+
+* **$\nabla_\theta \log \pi_\theta(a|s)$**: The direction that increases the probability of the action taken.
+* **$G_t$**: The total return (cumulative reward). If $G_t$ is high, we push the probability up; if $G_t$ is low (or negative), we push the probability down.
+
+## 3. The REINFORCE Algorithm (Monte Carlo Policy Gradient)
+
+REINFORCE is the most fundamental policy gradient algorithm. It follows these steps:
+
+1. **Act:** Run the policy to complete an entire episode and record $(s_t, a_t, r_t)$.
+2. **Calculate Returns:** For each step, calculate the total future reward $G_t$.
+3. **Update:** Update the weights using the gradient.
+
+## 4. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Action Flexibility:** Works perfectly with continuous and high-dimensional action spaces. | **High Variance:** Updates can be very noisy because one "lucky" or "unlucky" episode can heavily bias the gradient. |
+| **Simplicty:** Optimizes the performance measure directly. | **Sample Inefficient:** Often requires thousands of episodes to learn simple tasks. |
+| **Convergence:** Generally has better convergence properties than value-based methods. | **Local Optima:** Can get stuck in sub-optimal strategies easily. |
+
+## 5. Improving Policy Gradients: Baselines
+
+To reduce the high variance of the gradient, we often subtract a **Baseline** $b(s)$ (usually the average reward expected from that state). This ensures we only push the probability up if the reward was *better than average*.
+
+$$
+\nabla_\theta J(\theta) = E_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) (G_t - b(s))]
+$$
+
+## 6. Implementation Sketch (PyTorch)
+
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+
+# 1. Define the Policy Network
+class Policy(nn.Module):
+    def __init__(self):
+        super(Policy, self).__init__()
+        self.affine1 = nn.Linear(4, 128)
+        self.affine2 = nn.Linear(128, 2) # 2 possible actions
+
+    def forward(self, x):
+        x = torch.relu(self.affine1(x))
+        action_scores = self.affine2(x)
+        return torch.softmax(action_scores, dim=1)
+
+# 2. Select Action based on probabilities
+probs = policy(state)
+m = torch.distributions.Categorical(probs)
+action = m.sample()
+
+# 3. Update Policy (after episode)
+# loss = -log_prob * reward
+loss = -m.log_prob(action) * cumulative_reward
+optimizer.zero_grad()
+loss.backward()
+optimizer.step()
+
+```
+
+## References
+
+* **Andrej Karpathy's "Deep Reinforcement Learning: Pong from Pixels":** The best blog post for understanding the intuition of Policy Gradients.
+* **Spinning Up in Deep RL (OpenAI):** A comprehensive educational resource for policy-based methods.
+
+---
+
+**Policy Gradients are great for actions, but they are noisy. Q-Learning is stable but biased. What if we combined them?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx
index e69de29..cc78bcd 100644
--- a/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx
+++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx
@@ -0,0 +1,162 @@
+---
+title: "Q-Learning: Learning Through Rewards and Penalties"
+sidebar_label: Q-Learning
+description: "Mastering the Bellman Equation, Temporal Difference learning, and the Exploration-Exploitation trade-off."
+tags: [machine-learning, reinforcement-learning, q-learning, bellman-equation]
+---
+
+**Q-Learning** is a model-free, off-policy reinforcement learning algorithm. It aims to learn a **Policy**, which tells an agent what action to take under what circumstances to maximize the total reward over time.
+
+Unlike Supervised Learning, there is no "correct label." The agent learns by interacting with an environment, receiving feedback (rewards or penalties), and updating its internal "knowledge" (the Q-Table).
+
+## 1. The RL Framework: Agent & Environment
+
+In any Q-Learning problem, we have:
+* **Agent:** The learner/decision-maker.
+* **State ($s$):** The current situation of the agent (e.g., coordinates on a grid).
+* **Action ($a$):** What the agent can do (e.g., move Up, Down, Left, Right).
+* **Reward ($r$):** The feedback received from the environment.
+
+## 2. The Q-Table
+
+The "Q" in Q-Learning stands for **Quality**. The Q-Table is a lookup table where rows represent **States** and columns represent **Actions**. Each cell $Q(s, a)$ contains a value representing the expected future reward for taking action $a$ in state $s$.
+
+```mermaid
+graph TD
+    subgraph Q-Table
+    T[States / Actions] --> A1[Action: Left]
+    T --> A2[Action: Right]
+    S1[State 1] --> V1[0.5]
+    S1 --> V2[1.2]
+    S2[State 2] --> V3[-0.1]
+    S2 --> V4[0.8]
+    end
+    
+    style V2 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#333
+
+```
+
+## 3. The Bellman Equation (The Heart of Q-Learning)
+
+The agent updates its Q-values using the **Bellman Equation**. This formula allows the agent to learn the value of a state based on the rewards it expects to get in the *future*.
+
+$$
+Q(s, a) \leftarrow Q(s, a) + \alpha [R + \gamma \max_{a'} Q(s', a') - Q(s, a)]
+$$
+
+**Breaking down the math:**
+
+* **$\alpha$ (Learning Rate):** How much new information overrides old information (0 to 1).
+* **$R$:** The immediate reward received.
+* **$\gamma$ (Discount Factor):** How much we care about future rewards vs. immediate ones (0 = short-sighted, 1 = long-term vision).
+* **$\max_{a'} Q(s', a')$:** The maximum predicted reward for the *next* state.
+* **$[ \dots ]$ (Temporal Difference):** The difference between the "Target" (what we found) and the "Estimate" (what we previously thought).
+
+## 4. Exploration vs. Exploitation ($\epsilon$-greedy)
+
+An agent faces a dilemma: should it try new things or stick to what it knows works?
+
+* **Exploration:** Choosing a random action to discover more about the environment.
+* **Exploitation:** Choosing the action with the highest Q-value.
+
+We use the **Epsilon-Greedy Strategy**:
+
+1. Generate a random number between 0 and 1.
+2. If number $< \epsilon$, **Explore**.
+3. Otherwise, **Exploit**. *(Usually, $\epsilon$ decays over time as the agent becomes more confident.)*
+
+## 5. Visualizing the Q-Learning Process
+
+```mermaid
+graph LR
+    ENV["$$\text{Environment}$$<br/>$$K\text{-Armed Bandit}$$"]
+
+    ENV --> A1["$$a_1$$"]
+    ENV --> A2["$$a_2$$"]
+    ENV --> A3["$$a_3$$"]
+
+    A1 --> R1["$$r \sim \mathcal{D}_1(\mu_1)$$"]
+    A2 --> R2["$$r \sim \mathcal{D}_2(\mu_2)$$"]
+    A3 --> R3["$$r \sim \mathcal{D}_3(\mu_3)$$"]
+
+    R1 --> EST["$$\hat{\mu}_a,\ N_a$$"]
+    R2 --> EST
+    R3 --> EST
+
+    EST --> POLICY["$$\pi(a)$$<br/>$$\text{Action Selection Policy}$$"]
+
+    POLICY -->|"$$1-\varepsilon$$"| EXPLOIT["$$\arg\max_a \hat{\mu}_a$$<br/>$$\text{Exploitation}$$"]
+    POLICY -->|"$$\varepsilon$$"| EXPLORE["$$\text{Sample Non-Greedy Arm}$$<br/>$$\text{Exploration}$$"]
+
+    EXPLOIT --> UPDATE["$$\hat{\mu}_a \leftarrow \hat{\mu}_a + \alpha(r-\hat{\mu}_a)$$"]
+    EXPLORE --> UPDATE
+
+    UPDATE --> UNC["$$\text{Uncertainty Shrinks as } N_a \uparrow$$"]
+    UNC --> REG["$$\text{Cumulative Regret}$$"]
+    REG --> POLICY
+
+    %% Advanced Note
+    POLICY -.->|"$$\text{UCB / Thompson Sampling}$$"| ADV["$$\hat{\mu}_a + c\sqrt{\frac{\ln t}{N_a}}$$<br/>$$\text{Optimism / Posterior Sampling}$$"]
+
+```
+
+**In this diagram:**
+
+* The agent interacts with a K-Armed Bandit environment.
+* It selects actions based on an $\epsilon$-greedy policy.
+* It updates its estimates of action values based on received rewards.
+
+## 6. Basic Implementation (Python)
+
+```python
+import numpy as np
+
+# 1. Initialize Q-Table with zeros
+q_table = np.zeros([state_space_size, action_space_size])
+
+# 2. Hyperparameters
+alpha = 0.1   # Learning rate
+gamma = 0.95  # Discount factor
+epsilon = 0.1 # Exploration rate
+
+# 3. Training Loop
+for episode in range(1000):
+    state = env.reset()
+    done = False
+    
+    while not done:
+        # Action Selection (Epsilon-Greedy)
+        if np.random.uniform(0, 1) < epsilon:
+            action = env.action_space.sample() # Explore
+        else:
+            action = np.argmax(q_table[state]) # Exploit
+            
+        # Perform action
+        next_state, reward, done, _ = env.step(action)
+        
+        # Update Q-Value (Bellman Equation)
+        old_value = q_table[state, action]
+        next_max = np.max(q_table[next_state])
+        
+        new_value = old_value + alpha * (reward + gamma * next_max - old_value)
+        q_table[state, action] = new_value
+        
+        state = next_state
+
+```
+
+## 7. Limitations of Tabular Q-Learning
+
+While powerful, standard Q-Learning fails when the **State Space** is too large.
+
+* **Example:** In Chess, there are  possible states. A Q-Table cannot fit in any computer's RAM.
+* **Solution:** Use a Neural Network to *approximate* the Q-values instead of storing them in a table. This is called **Deep Q-Learning (DQN)**.
+
+## References
+
+* **[Reinforcement Learning: An Introduction (Sutton & Barto)]:** The definitive textbook on the subject.
+* **DeepMind's Introduction to RL:** Great for understanding the transition to Deep Learning.
+
+---
+
+**You've seen how agents learn through trial and error. But how do we scale this to complex games like Atari or Go?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx
index e69de29..e2aac60 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx
@@ -0,0 +1,99 @@
+---
+title: "DBSCAN: Density-Based Clustering and Outlier Detection"
+sidebar_label: DBSCAN
+description: "Discovering clusters of arbitrary shapes and identifying outliers using density-based spatial clustering."
+tags: [machine-learning, unsupervised-learning, clustering, dbscan, outliers]
+---
+
+**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering algorithm that views clusters as high-density regions separated by low-density regions. Unlike [K-Means](/tutorial/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans), DBSCAN does not require you to specify the number of clusters in advance and can find clusters of **arbitrary shapes**.
+
+## 1. How it Works: Core, Border, and Noise
+
+DBSCAN classifies every data point into one of three categories based on its local neighborhood:
+
+1.  **Core Points:** A point is a "Core Point" if it has at least `min_samples` within a distance of `eps` (epsilon) around it.
+2.  **Border Points:** A point that has fewer than `min_samples` within `eps`, but is reachable from a Core Point.
+3.  **Noise (Outliers):** Any point that is neither a Core Point nor a Border Point. These are ignored by the clustering process.
+
+## 2. Key Hyperparameters
+
+DBSCAN's performance depends almost entirely on two parameters:
+
+* **`eps` (Epsilon):** The maximum distance between two samples for one to be considered as in the neighborhood of the other. 
+    * *Too small:* Most data will be labeled as noise.
+    * *Too large:* Clusters will merge into one giant blob.
+* **`min_samples`:** The number of samples in a neighborhood for a point to be considered a core point.
+    * Higher values are better for noisy datasets.
+
+## 3. Handling Arbitrary Shapes
+
+While K-Means and Hierarchical clustering struggle with non-spherical data, DBSCAN excels at finding "natural" shapes like rings, crescents, or nested structures.
+
+```mermaid
+graph LR
+    X["$$X = \{x_1, x_2, \dots, x_n\}$$<br/>$$\text{Non-Spherical Data (Moons / Circles)}$$"]
+
+    X --> KM["K-Means"]
+
+    KM --> K1["$$\text{Assumes Spherical Clusters}$$"]
+    K1 --> K2["$$\min \sum_{i=1}^{k} \sum_{x \in C_i} \|x-\mu_i\|^2$$"]
+    K2 --> K3["$$\text{Distance to Centroid}$$"]
+    K3 --> K4["$$\text{Fails on Moons / Circles}$$"]
+    K4 --> K5["$$\text{Forces Incorrect Boundaries}$$"]
+
+    X --> DB["DBSCAN"]
+
+    DB --> D1["$$\varepsilon\text{-Neighborhood}$$"]
+    D1 --> D2["$$\text{MinPts Density Criterion}$$"]
+    D2 --> D3["$$\text{Density-Based Clustering}$$"]
+    D3 --> D4["$$\text{Finds Arbitrary Shapes}$$"]
+    D4 --> D5["$$\text{Handles Noise + Outliers}$$"]
+    D5 --> D6["$$\text{Works Well on Moons / Circles}$$"]
+
+    K5 -.->|"$$\text{Comparison}$$"| D6
+    
+```
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.cluster import DBSCAN
+from sklearn.preprocessing import StandardScaler
+
+# 1. DBSCAN is distance-based; scaling is CRITICAL
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# 2. Initialize and Fit
+# eps is the radius, min_samples is the density threshold
+dbscan = DBSCAN(eps=0.5, min_samples=5)
+labels = dbscan.fit_predict(X_scaled)
+
+# 3. Identifying Outliers
+# In Scikit-Learn, noise points are assigned the label -1
+n_outliers = list(labels).count(-1)
+print(f"Number of outliers detected: {n_outliers}")
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **No K needed:** Automatically detects the number of clusters. | **Varying Densities:** Struggles if clusters have vastly different densities. |
+| **Outlier Detection:** Naturally identifies noise; doesn't force outliers into clusters. | **Sensitive to eps:** Choosing the right epsilon can be difficult and data-dependent. |
+| **Shape Flexible:** Can find clusters of any shape (even "clusters within clusters"). | **Distance Metric:** Effectiveness drops in very high-dimensional data (Curse of Dimensionality). |
+
+## 6. Determining Epsilon: The K-Distance Plot
+
+To find the "optimal" epsilon, engineers often use a **K-Distance Plot**. You calculate the distance to the  nearest neighbor for every point, sort them, and look for the "elbow." The distance at the elbow is usually a good starting point for `eps`.
+
+## References for More Details
+
+* **[Visualizing DBSCAN (Interactive)](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/):** Seeing exactly how the density-reachable logic unfolds point-by-point.
+
+* **[Scikit-Learn DBSCAN Guide](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html):** Understanding how to use different distance metrics (Manhattan, Cosine).
+
+---
+
+**Clustering reveals groups, but often we have too many dimensions to visualize them. How do we compress our data into 2D or 3D without losing the structure?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx
index e69de29..8ec3584 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx
@@ -0,0 +1,82 @@
+---
+title: "Gaussian Mixture Models (GMM)"
+sidebar_label: Gaussian Mixtures
+description: "Probabilistic clustering using Expectation-Maximization and the Normal distribution."
+tags: [machine-learning, unsupervised-learning, clustering, gmm, probability]
+---
+
+**Gaussian Mixture Models (GMM)** are a sophisticated type of clustering that assumes all data points are generated from a mixture of a finite number of **Gaussian (Normal) Distributions** with unknown parameters.
+
+Think of GMM as a "generalized" version of [K-Means](./kmeans). While K-Means creates circular clusters, GMM can handle **elliptical** shapes and provides the **probability** of a point belonging to a cluster.
+
+## 1. Hard vs. Soft Clustering
+
+Most clustering algorithms provide "Hard" assignments. GMM provides "Soft" assignments.
+
+* **Hard Clustering (K-Means):** "This point belongs to Cluster A. Period."
+* **Soft Clustering (GMM):** "There is a 70% chance this point is in Cluster A, and a 30% chance it is in Cluster B."
+
+## 2. How it Works: Expectation-Maximization (EM)
+
+GMM uses a clever two-step iterative process to find the best-fitting Gaussians:
+
+1.  **Expectation (E-step):** For each data point, calculate the probability that it belongs to each cluster based on current Gaussian parameters (mean, variance).
+2.  **Maximization (M-step):** Update the Gaussian parameters (moving the center and stretching the shape) to better fit the points assigned to them.
+
+
+## 3. The Power of Covariance Shapes
+
+The "shape" of a Gaussian distribution is determined by its **Covariance**. In Scikit-Learn, you can control the flexibility of these shapes:
+
+* **Spherical:** Clusters must be circular (like K-Means).
+* **Diag:** Clusters can be ellipses, but only aligned with the axes.
+* **Tied:** All clusters must share the same shape.
+* **Full:** Each cluster can be any oriented ellipse. **(Most Flexible)**
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.mixture import GaussianMixture
+
+# 1. Initialize the model
+# n_components is the number of clusters
+gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
+
+# 2. Fit the model
+gmm.fit(X)
+
+# 3. Predict 'Soft' probabilities
+# Returns an array of shape (n_samples, n_clusters)
+probs = gmm.predict_proba(X)
+
+# 4. Predict 'Hard' labels (picks the highest probability)
+labels = gmm.predict(X)
+
+```
+
+## 5. Choosing the number of clusters: BIC and AIC
+
+Since GMM is a probabilistic model, we don't use the "Elbow Method." Instead, we use information criteria:
+
+* **BIC (Bayesian Information Criterion)**
+* **AIC (Akaike Information Criterion)**
+
+We look for the number of clusters that **minimizes** these scores. They reward a good fit but penalize the model for becoming too complex (having too many clusters).
+
+## 6. GMM vs. K-Means
+
+| Feature | K-Means | GMM |
+| --- | --- | --- |
+| **Cluster Shape** | Strictly Circular (Spherical) | Flexible Ellipses |
+| **Assignment** | Hard (0 or 1) | Soft (Probabilities) |
+| **Math** | Distance-based | Density-based (Statistical) |
+| **Flexibility** | Low | High |
+
+
+## References for More Details
+
+* **[Scikit-Learn GMM Guide](https://scikit-learn.org/stable/modules/mixture.html):** Understanding advanced variants like Bayesian Gaussian Mixture Models.
+
+---
+
+**You have now covered all the major Clustering techniques! However, sometimes the problem isn't the groups, but the number of features. Let's learn how to simplify massive datasets.**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx
index e69de29..978d990 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx
@@ -0,0 +1,84 @@
+---
+title: Hierarchical Clustering
+sidebar_label: Hierarchical
+description: "Understanding Agglomerative clustering, Dendrograms, and linkage criteria."
+tags: [machine-learning, unsupervised-learning, clustering, hierarchical, dendrogram]
+---
+
+**Hierarchical Clustering** is an unsupervised learning algorithm that builds a hierarchy of clusters. Unlike [K-Means](./kmeans), which partitions data into a flat set of $K$ groups, Hierarchical clustering produces a tree-based representation of the data.
+
+## 1. Types of Hierarchical Clustering
+
+There are two main strategies for building the hierarchy:
+
+1.  **Agglomerative (Bottom-Up):** 
+    * Starts with each data point as its own single cluster.
+    * Successively merges the two closest clusters until only one cluster (containing all points) remains.
+    * **This is the most common approach used in Scikit-Learn.**
+
+2.  **Divisive (Top-Down):** 
+    * Starts with one giant cluster containing all points.
+    * Successively splits the cluster into smaller ones until each point is its own cluster.
+
+## 2. The Dendrogram: Visualizing the Tree
+
+A **Dendrogram** is a type of diagram that records the sequences of merges or splits. It is the most powerful tool in hierarchical clustering because it shows how every point is related.
+
+* **Vertical Axis:** Represents the distance (dissimilarity) between clusters.
+* **Horizontal Axis:** Represents individual data points.
+* **Choosing Clusters:** You can "cut" the dendrogram at a specific height. Any vertical line intersected by your horizontal cut represents a distinct cluster.
+
+## 3. Linkage Criteria
+
+In step 2 of the algorithm, we must decide what "closest" means when comparing two clusters (instead of just two points). This is called the **Linkage Criterion**.
+
+| Linkage Type | Description | Result |
+| :--- | :--- | :--- |
+| **Ward** | Minimizes the variance within clusters. | Creates clusters of relatively equal size (Default in Sklearn). |
+| **Complete** | Uses the maximum distance between points in two clusters. | Tends to create compact, tightly bound clusters. |
+| **Average** | Uses the average distance between all points in two clusters. | A balanced approach between Single and Complete. |
+| **Single** | Uses the minimum distance between points in two clusters. | Can create long, "chain-like" clusters. |
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.cluster import AgglomerativeClustering
+import scipy.cluster.hierarchy as sch
+import matplotlib.pyplot as plt
+
+# 1. Visualize the Dendrogram (using Scipy)
+plt.figure(figsize=(10, 7))
+dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
+plt.show()
+
+# 2. Fit the Model
+# n_clusters=None + distance_threshold=0 allows you to compute the full tree
+model = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward')
+labels = model.fit_predict(X)
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **No fixed K:** You don't need to know the number of clusters beforehand. | **Computational Cost:** Very slow on large datasets ($O(n^3)$ or $O(n^2)$). |
+| **Intuitive Visualization:** Dendrograms provide great insight into data relationships. | **Irreversible:** Once a merge is made, it cannot be undone in later steps. |
+| **Stable:** Produces the same results every time (no random initialization). | **Sensitive to Noise:** Outliers can cause branches to merge incorrectly. |
+
+## 6. Comparison with K-Means
+
+| Feature | K-Means | Hierarchical |
+| --- | --- | --- |
+| **Number of Clusters** | Must be specified (K) | Flexible (determined by cut) |
+| **Efficiency** | High (Good for big data) | Low (Good for small/medium data) |
+| **Shape** | Assumes spherical clusters | Can handle various shapes |
+| **Underlying Math** | Centroid-based | Connectivity-based |
+
+## References for More Details
+
+* **[Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering):** Understanding connectivity constraints to speed up the algorithm.
+
+---
+
+**Clustering helps us find groups, but sometimes we have too many variables to visualize effectively. How do we compress 100 features down to 2 without losing the "essence" of the data?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx
index e69de29..d88db6c 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx
@@ -0,0 +1,102 @@
+---
+title: K-Means Clustering
+sidebar_label: K-Means
+description: "Grouping data into K clusters by minimizing within-cluster variance."
+tags: [machine-learning, unsupervised-learning, clustering, kmeans, centroids]
+---
+
+**K-Means** is an unsupervised learning algorithm that groups data points into $K$ distinct, non-overlapping clusters. Unlike Supervised Learning, there are no "correct answers" or labels; the algorithm finds structure based purely on the features of the data.
+
+## 1. How the Algorithm Works
+
+K-Means is an iterative process that follows these steps:
+
+1.  **Initialization:** Choose $K$ (the number of clusters) and randomly place $K$ points in the feature space. these are the **Centroids**.
+2.  **Assignment:** Assign each data point to the nearest centroid (usually using Euclidean distance).
+3.  **Update:** Calculate the mean of all points assigned to each centroid. Move the centroid to this new mean position.
+4.  **Repeat:** Keep repeating the Assignment and Update steps until the centroids stop moving or a maximum number of iterations is reached.
+
+## 2. Choosing the Optimal 'K': The Elbow Method
+
+One of the biggest challenges in K-Means is knowing how many clusters to use. If you have too few, you miss patterns; too many, and you over-segment the data.
+
+We use **Inertia** (the sum of squared distances of samples to their closest cluster center). As $K$ increases, inertia always decreases. We look for the "Elbow"—the point where the rate of decrease shifts significantly.
+
+## 3. Implementation with Scikit-Learn
+
+```python
+from sklearn.cluster import KMeans
+import matplotlib.pyplot as plt
+
+# 1. Initialize the model
+# n_clusters is the 'K'
+# n_init='auto' handles the number of times the algorithm runs with different seeds
+kmeans = KMeans(n_clusters=3, n_init='auto', random_state=42)
+
+# 2. Fit the model (Notice: we only pass X, not y!)
+kmeans.fit(X)
+
+# 3. Get cluster assignments and centroid locations
+labels = kmeans.labels_
+centroids = kmeans.cluster_centers_
+
+# 4. Predict the cluster for a new point
+new_point_cluster = kmeans.predict([[5.1, 3.5]])
+
+```
+
+## 4. Important Considerations
+
+### Scaling is Mandatory
+
+Since K-Means relies on distance (Euclidean), features with larger ranges will dominate the clustering. Always use `StandardScaler` before running K-Means.
+
+### Sensitivity to Initialization
+
+Randomly picking centroids can sometimes lead to poor results (local minima). Scikit-Learn uses **K-Means++** by default, a smart initialization technique that spreads out initial centroids to ensure better convergence.
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Scalable:** Very fast and efficient for large datasets. | **Manual K:** You must specify the number of clusters upfront. |
+| **Simple:** Easy to interpret and implement. | **Spherical Bias:** Struggles with clusters of irregular shapes (like crescents). |
+| **Guaranteed Convergence:** Will always find a solution. | **Sensitive to Outliers:** Outliers can pull centroids away from the true center. |
+
+```mermaid
+graph LR
+    subgraph SPH["Spherical Clusters (K-Means Works Well)"]
+        A1["$$X$$ (Data Points)"] --> B1["$$k$$ Random Centroids"]
+        B1 --> C1["$$\\text{Distance to Centroid}$$"]
+        C1 --> D1["$$\\text{Voronoi Partitions}$$"]
+        D1 --> E1["$$\\text{Compact, Round Clusters}$$"]
+        E1 --> F1["$$\\text{Low Inertia}$$"]
+        F1 --> G1["$$\\text{Correct Clustering}$$"]
+    end
+
+    subgraph MOON["Non-Spherical Clusters (Moons)"]
+        A2["$$X$$ (Moons Data)"] --> B2["$$k$$ Random Centroids"]
+        B2 --> C2["$$\\text{Euclidean Distance}$$"]
+        C2 --> D2["$$\\text{Linear Boundaries}$$"]
+        D2 --> E2["$$\\text{Forced Spherical Split}$$"]
+        E2 --> F2["$$\\text{High Inertia}$$"]
+        F2 --> G2["$$\\text{Incorrect Clustering}$$"]
+    end
+
+    G1 -.->|"$$\\text{Assumption Holds}$$"| G2["$$\\text{Assumption Violated}$$"]
+
+```
+
+## 6. Real-World Use Cases
+
+* **Customer Segmentation:** Grouping customers by purchasing behavior for targeted marketing.
+* **Image Compression:** Reducing the number of colors in an image by clustering similar pixel values.
+* **Anomaly Detection:** Identifying data points that are very far from any cluster centroid.
+
+## References for More Details
+
+* **[Scikit-Learn Clustering Guide](https://scikit-learn.org/stable/modules/clustering.html#k-means):** Technical details on K-Means++ and the ELKAN algorithm.
+
+---
+
+**K-Means assumes clusters are circular. What if your data is organized in hierarchies or nested structures?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx
index e69de29..33d2293 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx
@@ -0,0 +1,139 @@
+---
+title: Autoencoders
+sidebar_label: Autoencoders
+description: "Neural network-based dimensionality reduction: Encoder-Decoder architecture and bottleneck representations."
+tags: [machine-learning, unsupervised-learning, deep-learning, autoencoders, dimensionality-reduction]
+---
+
+An **Autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim is to learn a compressed representation (encoding) for a set of data, typically for dimensionality reduction or feature learning.
+
+## 1. The Architecture: "The Hourglass"
+
+An autoencoder consists of two main parts connected by a "bottleneck":
+
+1.  **The Encoder:** This part of the network compresses the input into a latent-space representation. It reduces the input dimensions layer by layer.
+2.  **The Code (Bottleneck):** This is the hidden layer that contains the compressed representation of the input data. It is the "knowledge" extracted from the input.
+3.  **The Decoder:** This part of the network tries to reconstruct the original input from the compressed code.
+
+### The Learning Flow
+```mermaid
+graph LR
+    Input[Original Input] --> Encoder(Encoder)
+    Encoder --> Latent[Latent Space / Code]
+    Latent --> Decoder(Decoder)
+    Decoder --> Output[Reconstructed Input]
+    
+    style Latent fill:#fff3e0,stroke:#ef6c00,stroke-width:4px,color:#333
+    style Input fill:#e1f5fe,stroke:#01579b,color:#333
+    style Output fill:#e8f5e9,stroke:#2e7d32,color:#333
+
+```
+
+## 2. The Loss Function: Reconstruction Loss
+
+The autoencoder is trained to minimize the difference between the **Input** and the **Reconstruction**. Since we want the output to be as close to the input as possible, we use a loss function like **Mean Squared Error (MSE)**.
+
+$$
+L(x, \hat{x}) = ||x - \hat{x}||^2
+$$
+
+Where:
+
+* $x$ is the original input.
+* $\hat{x}$ is the reconstructed output from the decoder.
+
+The network is forced to prioritize the most important features of the data because the "bottleneck" (Code) doesn't have enough capacity to store everything.
+
+## 3. Autoencoders vs. PCA
+
+| Feature | PCA | Autoencoder |
+| --- | --- | --- |
+| **Mapping** | Linear | Non-Linear (via activation functions) |
+| **Complexity** | Simple / Fast | Complex / Resource Intensive |
+| **Features** | Principal Components (Orthogonal) | Latent Variables (Flexible) |
+| **Use Case** | Tabular data / Simple compression | Image, Audio, and Complex patterns |
+
+### 3.1 Visual Comparison
+
+```mermaid
+graph LR
+    X["$$X \in \mathbb{R}^{n \times d}$$<br/>$$\text{Input Data}$$"]
+
+    %% PCA Path
+    X --> PCA["PCA"]
+
+    PCA --> P1["$$Z = XV_k$$<br/>$$\text{Linear Projection}$$"]
+    P1 --> P2["$$\hat{X} = ZV_k^\top$$<br/>$$\text{Linear Reconstruction}$$"]
+    P2 --> P3["$$\min \|X - \hat{X}\|_2^2$$"]
+    P3 --> P4["$$\text{Captures Linear Variance}$$"]
+    P4 --> P5["$$\text{Fails on Non-Linear Manifolds}$$"]
+
+    %% Autoencoder Path
+    X --> AE["Autoencoder"]
+
+    AE --> E["$$Z = f_\theta(X)$$<br/>$$\text{Non-Linear Encoder}$$"]
+    E --> D["$$\hat{X} = g_\phi(Z)$$<br/>$$\text{Non-Linear Decoder}$$"]
+    D --> L["$$\min \|X - \hat{X}\|_2^2$$"]
+    L --> A1["$$\text{Learns Complex Manifolds}$$"]
+    A1 --> A2["$$\text{Better Reconstruction for Non-Linear Data}$$"]
+
+    P5 -.->|"$$\text{Comparison}$$"| A2
+
+```
+
+**In this diagram:**
+
+* PCA uses linear transformations to reduce and reconstruct data, which works well for linearly correlated features.
+* Autoencoders use non-linear functions (neural networks) to capture complex patterns, making them more powerful for intricate datasets.
+
+## 4. Common Types of Autoencoders
+
+* **Denoising Autoencoder:** Trained to ignore "noise" by receiving a corrupted input and trying to reconstruct the clean version.
+* **Sparse Autoencoder:** Uses a penalty in the loss function to ensure only a few neurons in the bottleneck are "active" at once.
+* **Variational Autoencoder (VAE):** Instead of learning a fixed code, it learns a probability distribution of the latent space. (Great for generating new data!)
+
+## 5. Practical Implementation (Keras/TensorFlow)
+
+```python
+import tensorflow as tf
+from tensorflow.keras import layers, losses
+
+# 1. Define the Encoder
+encoder = tf.keras.Sequential([
+    layers.Input(shape=(784,)),
+    layers.Dense(128, activation='relu'),
+    layers.Dense(64, activation='relu'),
+    layers.Dense(32, activation='relu'), # The Bottleneck
+])
+
+# 2. Define the Decoder
+decoder = tf.keras.Sequential([
+    layers.Dense(64, activation='relu'),
+    layers.Dense(128, activation='relu'),
+    layers.Dense(784, activation='sigmoid'), # Reconstruct to original size
+])
+
+# 3. Create the Autoencoder
+autoencoder = tf.keras.Model(inputs=encoder.input, outputs=decoder(encoder.output))
+autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError())
+
+# 4. Train (Notice: x_train is both input and target!)
+autoencoder.fit(x_train, x_train, epochs=10, shuffle=True)
+
+```
+
+## 6. Real-World Applications
+
+1. **Anomaly Detection:** If an autoencoder is trained on "normal" data, it will fail to reconstruct "anomalous" data correctly. A high reconstruction error indicates an anomaly.
+2. **Image Denoising:** Removing grain or artifacts from photos.
+3. **Dimensionality Reduction:** For visualization or speeding up other ML models.
+
+## References for More Details
+
+* **TensorFlow Tutorial: Intro to Autoencoders:**
+* [Link](https://www.tensorflow.org/tutorials/generative/autoencoder)
+* *Best for:* Step-by-step code examples for denoising and anomaly detection.
+
+---
+
+**Autoencoders are the bridge between Unsupervised Learning and Deep Learning. Ready to see how we evaluate all these different models?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx
index e69de29..4ad0daf 100644
--- a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx
+++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx
@@ -0,0 +1,136 @@
+---
+title: "Principal Component Analysis (PCA)"
+sidebar_label: PCA
+description: "Mastering feature extraction, variance preservation, and the math behind Eigenvalues and Eigenvectors."
+tags: [machine-learning, unsupervised-learning, dimensionality-reduction, pca, linear-algebra]
+---
+**Principal Component Analysis (PCA)** is a statistical technique used to simplify complex datasets. It transforms a large set of variables into a smaller one that still contains most of the information (variance) from the original set.
+
+Think of PCA as taking a 3D object and finding the perfect angle to take a 2D photo of it so that you can still tell exactly what the object is.
+
+## 1. How PCA Works (The Intuition)
+
+PCA finds new "axes" for your data called **Principal Components (PCs)**.
+* **PC1:** The direction in space along which the data varies the most.
+* **PC2:** The direction orthogonal (perpendicular) to PC1 that captures the next highest amount of variation.
+
+### The Step-by-Step Logic
+
+1.  **Standardize the Data:** PCA is sensitive to the scale of the data, so we standardize features to have a mean of 0 and a standard deviation of 1.
+2.  **Compute the Covariance Matrix:** This matrix shows how features vary together.
+3.  **Calculate Eigenvalues and Eigenvectors:** These help identify the directions (eigenvectors) where the data varies the most (eigenvalues).
+4.  **Sort and Select Principal Components:** We sort the eigenvalues in descending order and select the top `k` eigenvectors to form a new feature space.
+
+**Now let's visualize this process:**
+
+```mermaid
+graph LR
+    X["$$X \in \mathbb{R}^{n \times d}$$<br/>$$\text{High-Dimensional Data}$$"]
+
+    X --> S["$$\text{Standardize Data}$$"]
+    S --> C["$$\Sigma = \frac{1}{n}X^TX$$<br/>$$\text{Covariance Matrix}$$"]
+
+    C --> E["$$\Sigma v = \lambda v$$<br/>$$\text{Eigen Decomposition}$$"]
+
+    E --> PC1["$$\text{PC}_1$$<br/>$$\max \ \text{Var}(Xv_1)$$"]
+    E --> PC2["$$\text{PC}_2$$<br/>$$\max \ \text{Var}(Xv_2)$$"]
+    E --> PCk["$$\text{PC}_k$$<br/>$$v_i^\top v_j = 0$$"]
+
+    PC1 --> P1["$$\text{Direction of Maximum Variance}$$"]
+    PC2 --> P2["$$\text{Orthogonal to PC}_1$$"]
+    PCk --> P3["$$\text{Explains Remaining Variance}$$"]
+
+    P1 --> R["$$Z = XV_k$$<br/>$$\text{Reduced Representation}$$"]
+    P2 --> R
+    P3 --> R
+
+    R --> G["$$\text{Lower-Dimensional Space}$$"]
+    G --> B["$$\text{Less Noise, Faster Models}$$"]
+
+```
+
+**In this diagram:**
+
+* We start with high-dimensional data $X$.
+* We standardize it and compute the covariance matrix $\Sigma$.
+* We perform eigen decomposition to find eigenvalues and eigenvectors.
+* We identify the principal components $(PC_1, PC_2, ..., PC_k)$.
+* Finally, we project the data onto the new lower-dimensional space $Z$.
+
+## 2. The Mathematical Foundation
+
+To perform PCA, we solve for the **Eigenvectors** of the covariance matrix.
+
+### Step 1: Covariance Matrix ($\Sigma$)
+
+If we have a standardized matrix $X$, the covariance matrix is calculated as:
+
+$$
+\Sigma = \frac{1}{n-1} X^T X
+$$
+Where:
+
+* **$\Sigma$**: Covariance Matrix (measures how features vary together).
+* **$X^T$**: Transpose of the standardized data matrix.
+* **$n$**: Number of samples.
+
+### Step 2: Eigenvalue Decomposition
+
+We find the Eigenvectors ($v$) and Eigenvalues ($\lambda$) such that:
+
+$$
+\Sigma v = \lambda v
+$$
+
+Where:
+
+* **Eigenvectors ($v$):** Define the direction of the new axes (Principal Components).
+* **Eigenvalues ($\lambda$):** Define the magnitude (how much variance) is captured in that direction.
+
+## 3. The Explained Variance Ratio
+
+When you reduce dimensions, you lose some information. We measure this using the **Explained Variance Ratio**. If PC1 explains 70% of the variance and PC2 explains 20%, using both allows you to represent 90% of the original data complexity in just two variables.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.decomposition import PCA
+from sklearn.preprocessing import StandardScaler
+
+# 1. PCA is extremely sensitive to scale!
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# 2. Initialize PCA
+# n_components can be an integer (2) or a percentage (0.95)
+pca = PCA(n_components=2)
+
+# 3. Fit and Transform the data
+X_pca = pca.fit_transform(X_scaled)
+
+# 4. Check how much information was kept
+print(f"Explained Variance: {pca.explained_variance_ratio_}")
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Removes Noise:** By dropping low-variance components, you remove random fluctuations. | **Loss of Interpretability:** The new "PCs" are combinations of features; they no longer have "real world" names. |
+| **Visualization:** Turns 100+ dimensions into a 2D plot you can actually see. | **Linearity:** PCA assumes relationships are linear. It fails on curved structures. |
+| **Efficiency:** Speeds up training for other algorithms by reducing feature count. | **Scaling Sensitive:** If you don't scale your data, PCA will focus on the features with the largest units. |
+
+## 6. When to use PCA?
+
+1. **High Dimensionality:** When you have too many features and your model is overfitting.
+2. **Multicollinearity:** When your features are highly correlated with each other.
+3. **Visualization:** When you need to plot high-dimensional clusters on a graph.
+
+## References for More Details
+
+* **[Scikit-Learn PCA Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html):** Learning about `IncrementalPCA` for datasets too large for memory.
+
+---
+
+**PCA is amazing for linear structures. But what if your data is twisted or curved? For visualizing complex, non-linear patterns (like the "Swiss Roll"), we use a different tool.**
\ No newline at end of file