diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx index e69de29..e6ebc5d 100644 --- a/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx +++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/actor-critic.mdx @@ -0,0 +1,109 @@ +--- +title: Actor-Critic Methods +sidebar_label: Actor-Critic +description: "Combining value-based and policy-based methods for stable and efficient reinforcement learning." +tags: [machine-learning, reinforcement-learning, actor-critic, a2c, a3c] +--- + +**Actor-Critic** methods are a hybrid architecture in Reinforcement Learning that combine the best of both worlds: **Policy Gradients** and **Value-Based** learning. + +In this setup, we use two neural networks: +1. **The Actor:** Learns the strategy (Policy). It decides which action to take. +2. **The Critic:** Learns to evaluate the action. It tells the Actor how "good" the action was by estimating the Value function. + +## 1. Why use Actor-Critic? + +* **Policy Gradients (Actor only):** Have high variance and can be slow to converge because they rely on full episode returns. +* **Q-Learning (Critic only):** Can be biased and struggles with continuous action spaces. +* **Actor-Critic:** Uses the Critic to reduce the variance of the Actor, leading to faster and more stable learning. + +## 2. How it Works: The Advantage + +The Critic doesn't just predict the reward; it predicts the **Advantage** ($A$). The Advantage tells us if an action was better than the average action expected from that state. + +$$ +A(s, a) = Q(s, a) - V(s) +$$ + +Where: + +* **$Q(s, a)$:** The value of taking a specific action. +* **$V(s)$:** The average value of the state (The Baseline). + +If $A > 0$, the Actor is encouraged to take that action more often. If $A < 0$, the Actor is discouraged. + +## 3. The Learning Loop + +```mermaid +graph TD + S[State] --> Actor(Actor: Policy) + S --> Critic(Critic: Value) + Actor --> A[Action] + A --> E[Environment] + E --> R[Reward] + E --> NS[Next State] + R --> TD[TD Error / Advantage] + NS --> TD + TD -->|Feedback| Actor + TD -->|Feedback| Critic + + style Actor fill:#e1f5fe,stroke:#01579b,color:#333 + style Critic fill:#fff3e0,stroke:#ef6c00,color:#333 + style TD fill:#fce4ec,stroke:#d81b60,color:#333 + +``` + +## 4. Popular Variations + +### A2C (Advantage Actor-Critic) + +A synchronous version where multiple agents run in parallel environments. The "Master" agent waits for all workers to finish their steps before updating the global network. + +### A3C (Asynchronous Advantage Actor-Critic) + +Introduced by DeepMind, this version is asynchronous. Each worker updates the global network independently without waiting for others, making it extremely fast. + +### PPO (Proximal Policy Optimization) + +A modern, state-of-the-art Actor-Critic algorithm used by OpenAI. It ensures that updates to the policy aren't "too large," preventing the model from collapsing during training. + +## 5. Implementation Logic (Pseudo-code) + +```python +# 1. Get action from Actor +probs = actor(state) +action = sample(probs) + +# 2. Interact with Environment +next_state, reward = env.step(action) + +# 3. Get values from Critic +value = critic(state) +next_value = critic(next_state) + +# 4. Calculate Advantage (TD Error) +# Advantage = (r + gamma * next_v) - v +advantage = reward + gamma * next_value - value + +# 5. Backpropagate +actor_loss = -log_prob(action) * advantage.detach() +critic_loss = advantage.pow(2) + +(actor_loss + critic_loss).backward() + +``` + +## 6. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Lower Variance:** Much more stable than pure Policy Gradients. | **Complexity:** Harder to tune because you are training two networks at once. | +| **Online Learning:** Can update after every step (doesn't need to wait for the end of an episode). | **Sample Inefficient:** Can still require millions of interactions for complex games. | +| **Continuous Actions:** Handles continuous movement smoothly. | **Sensitive to Hyperparameters:** Learning rates for Actor and Critic must be balanced. | + +## References + +* **DeepMind's A3C Paper:** "Asynchronous Methods for Deep Reinforcement Learning." +* **OpenAI Spinning Up:** Documentation on PPO and Actor-Critic variants. +* **Reinforcement Learning with David Silver:** Lecture 7 (Policy Gradient and Actor-Critic). +* **Sutton & Barto's "Reinforcement Learning: An Introduction":** Chapter on Actor-Critic Methods. \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx index e69de29..da1e178 100644 --- a/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx +++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/deep-q-networks.mdx @@ -0,0 +1,182 @@ +--- +title: "Deep Q-Networks (DQN)" +sidebar_label: Deep Q-Networks +description: "Scaling Reinforcement Learning with Deep Learning using Experience Replay and Target Networks." +tags: [machine-learning, reinforcement-learning, dqn, deep-learning, neural-networks] +--- + +**Deep Q-Networks (DQN)** represent the fusion of Reinforcement Learning and Deep Neural Networks. While standard [Q-Learning](/tutorial/machine-learning/machine-learning-core/reinforcement-learning/q-learning) uses a table to store values, DQN uses a **Neural Network** to approximate the Q-value function. + +This advancement allowed RL agents to handle environments with high-dimensional state spaces, such as raw pixels from a video game screen. + +## 1. Why Deep Learning for Q-Learning? + +In a complex environment, the number of possible states is astronomical. +* **Atari 2600:** A $210 \times 160$ pixel screen with 128 colors has more possible states than there are atoms in the universe. +* **The Solution:** Instead of a table, we use a Neural Network ($Q_\theta$) that takes a **State** as input and outputs the predicted **Q-values** for all possible actions. + +## 2. The Two "Secret Ingredients" of DQN + +Standard neural networks struggle with RL because the data is highly correlated (sequential frames in a game are nearly identical). To fix this, DQN introduced two revolutionary concepts: + +### A. Experience Replay +Instead of learning from the current experience immediately, the agent saves its experiences $(s, a, r, s')$ in a **Replay Buffer**. During training, we sample a **random batch** of these experiences. +* **Benefit:** It breaks the correlation between consecutive samples and allows the model to "re-learn" from past successes and failures multiple times. + +### B. Target Networks +In standard Q-Learning, the "target" we are chasing changes every time we update the weights. This is like a dog chasing its own tail. +* **The Fix:** We maintain two networks: + 1. **Policy Network:** The one we are constantly training. + 2. **Target Network:** A frozen copy of the Policy Network used to calculate the "target" value. We only update this copy every few thousand steps. + +## 3. The DQN Mathematical Objective + +The loss function for DQN is the squared difference between the **Target Q-value** and the **Predicted Q-value**: + +$$ +L(\theta) = E \left[ \left( \underbrace{r + \gamma \max_{a'} Q_{\theta^{-}}(s', a')}_{\text{Target (Target Network)}} - \underbrace{Q_{\theta}(s, a)}_{\text{Prediction (Policy Network)}} \right)^2 \right] +$$ + +Where: + +* **$\theta$**: Weights of the Policy Network. +* **$\theta^{-}$**: Weights of the Target Network (frozen). +* **$r$**: Reward received after taking action $a$ in state $s$. +* **$\gamma$**: Discount factor for future rewards. + +## 4. The DQN Workflow + +```mermaid +graph LR + ENV["$$\text{Environment}$$"] + + ENV --> S["$$s_t$$
$$\text{Current State}$$"] + + S --> NET["$$Q(s,a;\theta)$$
$$\text{Online Q-Network}$$"] + + NET --> ACT["$$\varepsilon\text{-greedy Policy}a_t=\begin{cases} \text{random action} & \varepsilon \\ \arg\max_a Q(s_t,a;\theta) & 1-\varepsilon \end{cases}$$"] + + ACT --> ENV + + ENV --> R["$$r_t,\ s_{t+1}$$"] + + R --> MEM["$$\text{Replay Buffer } \mathcal{D}$$"] + + MEM --> SAMPLE["$$\text{Sample Mini-batch}$$"] + + SAMPLE --> TARGET["$$y_t = r_t + \gamma \max_a Q(s_{t+1},a;\theta^-)$$"] + + TARGET --> LOSS["$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_t - Q(s_t,a_t;\theta))^2\right]$$"] + + LOSS --> GRAD["$$\nabla_\theta \mathcal{L}$$"] + + GRAD --> UPDATE["$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$$"] + + UPDATE --> NET + + NET -.->|"$$\text{Periodically Copy}$$"| TNET["$$\theta^-$$
$$\text{Target Network}$$"] + + +``` + +## 5. Implementation logic (PyTorch-style) + +```python +# The DQN Model +class DQN(nn.Module): + def __init__(self, state_dim, action_dim): + super(DQN, self).__init__() + self.net = nn.Sequential( + nn.Linear(state_dim, 128), + nn.ReLU(), + nn.Linear(128, action_dim) + ) + + def forward(self, x): + return self.net(x) + +# Training Step +def train_step(): + # 1. Sample random batch from replay buffer + states, actions, rewards, next_states, dones = buffer.sample(batch_size) + + # 2. Get current Q-values from Policy Network + current_q = policy_net(states).gather(1, actions) + + # 3. Get maximum Q-values for next states from Target Network + with torch.no_grad(): + next_q = target_net(next_states).max(1)[0] + target_q = rewards + (gamma * next_q * (1 - dones)) + + # 4. Minimize the Loss + loss = F.mse_loss(current_q, target_q.unsqueeze(1)) + optimizer.zero_grad() + loss.backward() + optimizer.step() + +``` + +## 6. Beyond DQN + +While DQN was a massive breakthrough, it has been improved by: + +* **Double DQN:** Reduces the tendency to overestimate Q-values. +* **Dueling DQN:** Separates the calculation of state value and action advantage. +* **Prioritized Experience Replay:** Samples "important" experiences (those with high error) more frequently. + +```mermaid +graph LR + ENV["$$\text{Atari Environment}$$"] + + ENV --> S["$$s_t$$
$$\text{Game State}$$"] + + %% Standard DQN + S --> DQN["Standard DQN"] + + DQN --> Q1["$$Q(s,a;\theta)$$"] + Q1 --> T1["$$y = r + \gamma \max_a Q(s',a;\theta^-)$$"] + T1 --> O1["$$\text{Overestimation Bias}$$"] + O1 --> P1["$$\text{Unstable Learning}$$"] + + %% Double DQN + S --> DDQN["Double DQN"] + + DDQN --> Q2["$$Q(s,a;\theta)$$"] + Q2 --> T2["$$y = r + \gamma Q(s', \arg\max_a Q(s',a;\theta);\theta^-)$$"] + T2 --> O2["$$\text{Reduced Overestimation}$$"] + O2 --> P2["$$\text{More Stable Q-Values}$$"] + + %% Dueling DQN + S --> DUEL["Dueling DQN"] + + DUEL --> V["$$V(s;\theta_v)$$
$$\text{State Value}$$"] + DUEL --> A["$$A(s,a;\theta_a)$$
$$\text{Action Advantage}$$"] + + V --> Q3["$$Q(s,a)=V(s)+A(s,a)-\frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$"] + A --> Q3 + + Q3 --> P3["$$\text{Better State Representation}$$"] + P3 --> G3["$$\text{Faster Learning on Atari}$$"] + + %% Experience Replay Enhancement + ENV --> MEM["$$\text{Replay Buffer}$$"] + + MEM --> PER["$$\text{Prioritized Experience Replay}$$"] + PER --> ERR["$$p_i \propto |\delta_i|$$
$$\text{TD Error-Based Sampling}$$"] + ERR --> UPD["$$\text{Faster Convergence}$$"] + + %% Comparison Links + P1 -.->|"$$\text{Beyond DQN}$$"| O2 + O2 -.->|"$$\text{Combined}$$"| G3 + UPD -.->|"$$\text{Boosts All}$$"| G3 + +``` + +## References + +* **Mnih et al. (2015):** "Human-level control through deep reinforcement learning" (The original Nature paper). +* **DeepLizard RL Series:** Excellent visual tutorials on DQN mechanics. + +--- + +**DQN is great for discrete actions (like buttons on a controller). But how do we handle continuous actions, like the pressure applied to a gas pedal?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx index e69de29..745e902 100644 --- a/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx +++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/policy-gradients.mdx @@ -0,0 +1,128 @@ +--- +title: Policy Gradients +sidebar_label: Policy Gradients +description: "Optimizing the policy directly: understanding the REINFORCE algorithm, stochastic policies, and the Policy Gradient Theorem." +tags: [machine-learning, reinforcement-learning, policy-gradients, reinforce] +--- + +**Policy Gradient** methods are a class of reinforcement learning algorithms that optimize the policy ($\pi$) directly. Unlike [Q-Learning](./q-learning), which learns the value of being in a state, Policy Gradients learn the probability distribution of actions. + +## 1. Why Choose Policy Gradients? + +While Q-Learning is powerful, it struggles with: +1. **Continuous Action Spaces:** It's hard to find the maximum Q-value if there are infinite possible actions (e.g., the exact degree to turn a steering wheel). +2. **Stochastic Policies:** In some games (like Rock-Paper-Scissors), the best strategy is to be random. Q-Learning is inherently deterministic. +3. **High Variance:** Value functions can be unstable. + +## 2. The Core Concept + +We represent the policy using a parameterized function (usually a Neural Network) $\pi_\theta(a|s)$. This function outputs the probability of taking action $a$ given state $s$. + +```mermaid +graph LR + S["$$s_t \in \mathcal{S}$$
$$\text{State}$$"] + + S --> NN["$$\pi_\theta$$
$$\text{Neural Network Policy}$$"] + + NN --> Z["$$z = f_\theta(s_t)$$
$$\text{Latent Representation}$$"] + + Z --> LOG["$$\text{Action Logits}$$"] + + LOG --> SOFT["$$\text{Softmax}$$
$$\pi_\theta(a|s_t)=\frac{e^{z_a}}{\sum_{a'} e^{z_{a'}}}$$"] + + SOFT --> A1["$$P(a_1|s_t)$$"] + SOFT --> A2["$$P(a_2|s_t)$$"] + SOFT --> Ak["$$P(a_k|s_t)$$"] + + A1 --> SAMPLE["$$a_t \sim \pi_\theta(\cdot|s_t)$$"] + A2 --> SAMPLE + Ak --> SAMPLE + + SAMPLE --> ENV["$$\text{Environment}$$"] + ENV --> R["$$r_t,\ s_{t+1}$$"] + + R --> LOSS["$$\mathcal{L}(\theta) = -\mathbb{E}[\log \pi_\theta(a_t|s_t)\, G_t]$$"] + LOSS --> UPDATE["$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$"] + + UPDATE --> NN + +``` + +### The Policy Gradient Theorem + +The goal is to adjust the weights to maximize the total expected reward . We use gradient ascent to update the parameters: + +$$ +\nabla_\theta J(\theta) \approx E_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) G_t] +$$ + +Where: + +* **$\nabla_\theta \log \pi_\theta(a|s)$**: The direction that increases the probability of the action taken. +* **$G_t$**: The total return (cumulative reward). If $G_t$ is high, we push the probability up; if $G_t$ is low (or negative), we push the probability down. + +## 3. The REINFORCE Algorithm (Monte Carlo Policy Gradient) + +REINFORCE is the most fundamental policy gradient algorithm. It follows these steps: + +1. **Act:** Run the policy to complete an entire episode and record $(s_t, a_t, r_t)$. +2. **Calculate Returns:** For each step, calculate the total future reward $G_t$. +3. **Update:** Update the weights using the gradient. + +## 4. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Action Flexibility:** Works perfectly with continuous and high-dimensional action spaces. | **High Variance:** Updates can be very noisy because one "lucky" or "unlucky" episode can heavily bias the gradient. | +| **Simplicty:** Optimizes the performance measure directly. | **Sample Inefficient:** Often requires thousands of episodes to learn simple tasks. | +| **Convergence:** Generally has better convergence properties than value-based methods. | **Local Optima:** Can get stuck in sub-optimal strategies easily. | + +## 5. Improving Policy Gradients: Baselines + +To reduce the high variance of the gradient, we often subtract a **Baseline** $b(s)$ (usually the average reward expected from that state). This ensures we only push the probability up if the reward was *better than average*. + +$$ +\nabla_\theta J(\theta) = E_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) (G_t - b(s))] +$$ + +## 6. Implementation Sketch (PyTorch) + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +# 1. Define the Policy Network +class Policy(nn.Module): + def __init__(self): + super(Policy, self).__init__() + self.affine1 = nn.Linear(4, 128) + self.affine2 = nn.Linear(128, 2) # 2 possible actions + + def forward(self, x): + x = torch.relu(self.affine1(x)) + action_scores = self.affine2(x) + return torch.softmax(action_scores, dim=1) + +# 2. Select Action based on probabilities +probs = policy(state) +m = torch.distributions.Categorical(probs) +action = m.sample() + +# 3. Update Policy (after episode) +# loss = -log_prob * reward +loss = -m.log_prob(action) * cumulative_reward +optimizer.zero_grad() +loss.backward() +optimizer.step() + +``` + +## References + +* **Andrej Karpathy's "Deep Reinforcement Learning: Pong from Pixels":** The best blog post for understanding the intuition of Policy Gradients. +* **Spinning Up in Deep RL (OpenAI):** A comprehensive educational resource for policy-based methods. + +--- + +**Policy Gradients are great for actions, but they are noisy. Q-Learning is stable but biased. What if we combined them?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx b/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx index e69de29..cc78bcd 100644 --- a/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx +++ b/docs/machine-learning/machine-learning-core/reinforcement-learning/q-learning.mdx @@ -0,0 +1,162 @@ +--- +title: "Q-Learning: Learning Through Rewards and Penalties" +sidebar_label: Q-Learning +description: "Mastering the Bellman Equation, Temporal Difference learning, and the Exploration-Exploitation trade-off." +tags: [machine-learning, reinforcement-learning, q-learning, bellman-equation] +--- + +**Q-Learning** is a model-free, off-policy reinforcement learning algorithm. It aims to learn a **Policy**, which tells an agent what action to take under what circumstances to maximize the total reward over time. + +Unlike Supervised Learning, there is no "correct label." The agent learns by interacting with an environment, receiving feedback (rewards or penalties), and updating its internal "knowledge" (the Q-Table). + +## 1. The RL Framework: Agent & Environment + +In any Q-Learning problem, we have: +* **Agent:** The learner/decision-maker. +* **State ($s$):** The current situation of the agent (e.g., coordinates on a grid). +* **Action ($a$):** What the agent can do (e.g., move Up, Down, Left, Right). +* **Reward ($r$):** The feedback received from the environment. + +## 2. The Q-Table + +The "Q" in Q-Learning stands for **Quality**. The Q-Table is a lookup table where rows represent **States** and columns represent **Actions**. Each cell $Q(s, a)$ contains a value representing the expected future reward for taking action $a$ in state $s$. + +```mermaid +graph TD + subgraph Q-Table + T[States / Actions] --> A1[Action: Left] + T --> A2[Action: Right] + S1[State 1] --> V1[0.5] + S1 --> V2[1.2] + S2[State 2] --> V3[-0.1] + S2 --> V4[0.8] + end + + style V2 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#333 + +``` + +## 3. The Bellman Equation (The Heart of Q-Learning) + +The agent updates its Q-values using the **Bellman Equation**. This formula allows the agent to learn the value of a state based on the rewards it expects to get in the *future*. + +$$ +Q(s, a) \leftarrow Q(s, a) + \alpha [R + \gamma \max_{a'} Q(s', a') - Q(s, a)] +$$ + +**Breaking down the math:** + +* **$\alpha$ (Learning Rate):** How much new information overrides old information (0 to 1). +* **$R$:** The immediate reward received. +* **$\gamma$ (Discount Factor):** How much we care about future rewards vs. immediate ones (0 = short-sighted, 1 = long-term vision). +* **$\max_{a'} Q(s', a')$:** The maximum predicted reward for the *next* state. +* **$[ \dots ]$ (Temporal Difference):** The difference between the "Target" (what we found) and the "Estimate" (what we previously thought). + +## 4. Exploration vs. Exploitation ($\epsilon$-greedy) + +An agent faces a dilemma: should it try new things or stick to what it knows works? + +* **Exploration:** Choosing a random action to discover more about the environment. +* **Exploitation:** Choosing the action with the highest Q-value. + +We use the **Epsilon-Greedy Strategy**: + +1. Generate a random number between 0 and 1. +2. If number $< \epsilon$, **Explore**. +3. Otherwise, **Exploit**. *(Usually, $\epsilon$ decays over time as the agent becomes more confident.)* + +## 5. Visualizing the Q-Learning Process + +```mermaid +graph LR + ENV["$$\text{Environment}$$
$$K\text{-Armed Bandit}$$"] + + ENV --> A1["$$a_1$$"] + ENV --> A2["$$a_2$$"] + ENV --> A3["$$a_3$$"] + + A1 --> R1["$$r \sim \mathcal{D}_1(\mu_1)$$"] + A2 --> R2["$$r \sim \mathcal{D}_2(\mu_2)$$"] + A3 --> R3["$$r \sim \mathcal{D}_3(\mu_3)$$"] + + R1 --> EST["$$\hat{\mu}_a,\ N_a$$"] + R2 --> EST + R3 --> EST + + EST --> POLICY["$$\pi(a)$$
$$\text{Action Selection Policy}$$"] + + POLICY -->|"$$1-\varepsilon$$"| EXPLOIT["$$\arg\max_a \hat{\mu}_a$$
$$\text{Exploitation}$$"] + POLICY -->|"$$\varepsilon$$"| EXPLORE["$$\text{Sample Non-Greedy Arm}$$
$$\text{Exploration}$$"] + + EXPLOIT --> UPDATE["$$\hat{\mu}_a \leftarrow \hat{\mu}_a + \alpha(r-\hat{\mu}_a)$$"] + EXPLORE --> UPDATE + + UPDATE --> UNC["$$\text{Uncertainty Shrinks as } N_a \uparrow$$"] + UNC --> REG["$$\text{Cumulative Regret}$$"] + REG --> POLICY + + %% Advanced Note + POLICY -.->|"$$\text{UCB / Thompson Sampling}$$"| ADV["$$\hat{\mu}_a + c\sqrt{\frac{\ln t}{N_a}}$$
$$\text{Optimism / Posterior Sampling}$$"] + +``` + +**In this diagram:** + +* The agent interacts with a K-Armed Bandit environment. +* It selects actions based on an $\epsilon$-greedy policy. +* It updates its estimates of action values based on received rewards. + +## 6. Basic Implementation (Python) + +```python +import numpy as np + +# 1. Initialize Q-Table with zeros +q_table = np.zeros([state_space_size, action_space_size]) + +# 2. Hyperparameters +alpha = 0.1 # Learning rate +gamma = 0.95 # Discount factor +epsilon = 0.1 # Exploration rate + +# 3. Training Loop +for episode in range(1000): + state = env.reset() + done = False + + while not done: + # Action Selection (Epsilon-Greedy) + if np.random.uniform(0, 1) < epsilon: + action = env.action_space.sample() # Explore + else: + action = np.argmax(q_table[state]) # Exploit + + # Perform action + next_state, reward, done, _ = env.step(action) + + # Update Q-Value (Bellman Equation) + old_value = q_table[state, action] + next_max = np.max(q_table[next_state]) + + new_value = old_value + alpha * (reward + gamma * next_max - old_value) + q_table[state, action] = new_value + + state = next_state + +``` + +## 7. Limitations of Tabular Q-Learning + +While powerful, standard Q-Learning fails when the **State Space** is too large. + +* **Example:** In Chess, there are possible states. A Q-Table cannot fit in any computer's RAM. +* **Solution:** Use a Neural Network to *approximate* the Q-values instead of storing them in a table. This is called **Deep Q-Learning (DQN)**. + +## References + +* **[Reinforcement Learning: An Introduction (Sutton & Barto)]:** The definitive textbook on the subject. +* **DeepMind's Introduction to RL:** Great for understanding the transition to Deep Learning. + +--- + +**You've seen how agents learn through trial and error. But how do we scale this to complex games like Atari or Go?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx index e69de29..e2aac60 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/dbscan.mdx @@ -0,0 +1,99 @@ +--- +title: "DBSCAN: Density-Based Clustering and Outlier Detection" +sidebar_label: DBSCAN +description: "Discovering clusters of arbitrary shapes and identifying outliers using density-based spatial clustering." +tags: [machine-learning, unsupervised-learning, clustering, dbscan, outliers] +--- + +**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering algorithm that views clusters as high-density regions separated by low-density regions. Unlike [K-Means](/tutorial/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans), DBSCAN does not require you to specify the number of clusters in advance and can find clusters of **arbitrary shapes**. + +## 1. How it Works: Core, Border, and Noise + +DBSCAN classifies every data point into one of three categories based on its local neighborhood: + +1. **Core Points:** A point is a "Core Point" if it has at least `min_samples` within a distance of `eps` (epsilon) around it. +2. **Border Points:** A point that has fewer than `min_samples` within `eps`, but is reachable from a Core Point. +3. **Noise (Outliers):** Any point that is neither a Core Point nor a Border Point. These are ignored by the clustering process. + +## 2. Key Hyperparameters + +DBSCAN's performance depends almost entirely on two parameters: + +* **`eps` (Epsilon):** The maximum distance between two samples for one to be considered as in the neighborhood of the other. + * *Too small:* Most data will be labeled as noise. + * *Too large:* Clusters will merge into one giant blob. +* **`min_samples`:** The number of samples in a neighborhood for a point to be considered a core point. + * Higher values are better for noisy datasets. + +## 3. Handling Arbitrary Shapes + +While K-Means and Hierarchical clustering struggle with non-spherical data, DBSCAN excels at finding "natural" shapes like rings, crescents, or nested structures. + +```mermaid +graph LR + X["$$X = \{x_1, x_2, \dots, x_n\}$$
$$\text{Non-Spherical Data (Moons / Circles)}$$"] + + X --> KM["K-Means"] + + KM --> K1["$$\text{Assumes Spherical Clusters}$$"] + K1 --> K2["$$\min \sum_{i=1}^{k} \sum_{x \in C_i} \|x-\mu_i\|^2$$"] + K2 --> K3["$$\text{Distance to Centroid}$$"] + K3 --> K4["$$\text{Fails on Moons / Circles}$$"] + K4 --> K5["$$\text{Forces Incorrect Boundaries}$$"] + + X --> DB["DBSCAN"] + + DB --> D1["$$\varepsilon\text{-Neighborhood}$$"] + D1 --> D2["$$\text{MinPts Density Criterion}$$"] + D2 --> D3["$$\text{Density-Based Clustering}$$"] + D3 --> D4["$$\text{Finds Arbitrary Shapes}$$"] + D4 --> D5["$$\text{Handles Noise + Outliers}$$"] + D5 --> D6["$$\text{Works Well on Moons / Circles}$$"] + + K5 -.->|"$$\text{Comparison}$$"| D6 + +``` + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.cluster import DBSCAN +from sklearn.preprocessing import StandardScaler + +# 1. DBSCAN is distance-based; scaling is CRITICAL +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# 2. Initialize and Fit +# eps is the radius, min_samples is the density threshold +dbscan = DBSCAN(eps=0.5, min_samples=5) +labels = dbscan.fit_predict(X_scaled) + +# 3. Identifying Outliers +# In Scikit-Learn, noise points are assigned the label -1 +n_outliers = list(labels).count(-1) +print(f"Number of outliers detected: {n_outliers}") + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **No K needed:** Automatically detects the number of clusters. | **Varying Densities:** Struggles if clusters have vastly different densities. | +| **Outlier Detection:** Naturally identifies noise; doesn't force outliers into clusters. | **Sensitive to eps:** Choosing the right epsilon can be difficult and data-dependent. | +| **Shape Flexible:** Can find clusters of any shape (even "clusters within clusters"). | **Distance Metric:** Effectiveness drops in very high-dimensional data (Curse of Dimensionality). | + +## 6. Determining Epsilon: The K-Distance Plot + +To find the "optimal" epsilon, engineers often use a **K-Distance Plot**. You calculate the distance to the nearest neighbor for every point, sort them, and look for the "elbow." The distance at the elbow is usually a good starting point for `eps`. + +## References for More Details + +* **[Visualizing DBSCAN (Interactive)](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/):** Seeing exactly how the density-reachable logic unfolds point-by-point. + +* **[Scikit-Learn DBSCAN Guide](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html):** Understanding how to use different distance metrics (Manhattan, Cosine). + +--- + +**Clustering reveals groups, but often we have too many dimensions to visualize them. How do we compress our data into 2D or 3D without losing the structure?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx index e69de29..8ec3584 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/gaussian-mixtures.mdx @@ -0,0 +1,82 @@ +--- +title: "Gaussian Mixture Models (GMM)" +sidebar_label: Gaussian Mixtures +description: "Probabilistic clustering using Expectation-Maximization and the Normal distribution." +tags: [machine-learning, unsupervised-learning, clustering, gmm, probability] +--- + +**Gaussian Mixture Models (GMM)** are a sophisticated type of clustering that assumes all data points are generated from a mixture of a finite number of **Gaussian (Normal) Distributions** with unknown parameters. + +Think of GMM as a "generalized" version of [K-Means](./kmeans). While K-Means creates circular clusters, GMM can handle **elliptical** shapes and provides the **probability** of a point belonging to a cluster. + +## 1. Hard vs. Soft Clustering + +Most clustering algorithms provide "Hard" assignments. GMM provides "Soft" assignments. + +* **Hard Clustering (K-Means):** "This point belongs to Cluster A. Period." +* **Soft Clustering (GMM):** "There is a 70% chance this point is in Cluster A, and a 30% chance it is in Cluster B." + +## 2. How it Works: Expectation-Maximization (EM) + +GMM uses a clever two-step iterative process to find the best-fitting Gaussians: + +1. **Expectation (E-step):** For each data point, calculate the probability that it belongs to each cluster based on current Gaussian parameters (mean, variance). +2. **Maximization (M-step):** Update the Gaussian parameters (moving the center and stretching the shape) to better fit the points assigned to them. + + +## 3. The Power of Covariance Shapes + +The "shape" of a Gaussian distribution is determined by its **Covariance**. In Scikit-Learn, you can control the flexibility of these shapes: + +* **Spherical:** Clusters must be circular (like K-Means). +* **Diag:** Clusters can be ellipses, but only aligned with the axes. +* **Tied:** All clusters must share the same shape. +* **Full:** Each cluster can be any oriented ellipse. **(Most Flexible)** + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.mixture import GaussianMixture + +# 1. Initialize the model +# n_components is the number of clusters +gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42) + +# 2. Fit the model +gmm.fit(X) + +# 3. Predict 'Soft' probabilities +# Returns an array of shape (n_samples, n_clusters) +probs = gmm.predict_proba(X) + +# 4. Predict 'Hard' labels (picks the highest probability) +labels = gmm.predict(X) + +``` + +## 5. Choosing the number of clusters: BIC and AIC + +Since GMM is a probabilistic model, we don't use the "Elbow Method." Instead, we use information criteria: + +* **BIC (Bayesian Information Criterion)** +* **AIC (Akaike Information Criterion)** + +We look for the number of clusters that **minimizes** these scores. They reward a good fit but penalize the model for becoming too complex (having too many clusters). + +## 6. GMM vs. K-Means + +| Feature | K-Means | GMM | +| --- | --- | --- | +| **Cluster Shape** | Strictly Circular (Spherical) | Flexible Ellipses | +| **Assignment** | Hard (0 or 1) | Soft (Probabilities) | +| **Math** | Distance-based | Density-based (Statistical) | +| **Flexibility** | Low | High | + + +## References for More Details + +* **[Scikit-Learn GMM Guide](https://scikit-learn.org/stable/modules/mixture.html):** Understanding advanced variants like Bayesian Gaussian Mixture Models. + +--- + +**You have now covered all the major Clustering techniques! However, sometimes the problem isn't the groups, but the number of features. Let's learn how to simplify massive datasets.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx index e69de29..978d990 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/hierarchical.mdx @@ -0,0 +1,84 @@ +--- +title: Hierarchical Clustering +sidebar_label: Hierarchical +description: "Understanding Agglomerative clustering, Dendrograms, and linkage criteria." +tags: [machine-learning, unsupervised-learning, clustering, hierarchical, dendrogram] +--- + +**Hierarchical Clustering** is an unsupervised learning algorithm that builds a hierarchy of clusters. Unlike [K-Means](./kmeans), which partitions data into a flat set of $K$ groups, Hierarchical clustering produces a tree-based representation of the data. + +## 1. Types of Hierarchical Clustering + +There are two main strategies for building the hierarchy: + +1. **Agglomerative (Bottom-Up):** + * Starts with each data point as its own single cluster. + * Successively merges the two closest clusters until only one cluster (containing all points) remains. + * **This is the most common approach used in Scikit-Learn.** + +2. **Divisive (Top-Down):** + * Starts with one giant cluster containing all points. + * Successively splits the cluster into smaller ones until each point is its own cluster. + +## 2. The Dendrogram: Visualizing the Tree + +A **Dendrogram** is a type of diagram that records the sequences of merges or splits. It is the most powerful tool in hierarchical clustering because it shows how every point is related. + +* **Vertical Axis:** Represents the distance (dissimilarity) between clusters. +* **Horizontal Axis:** Represents individual data points. +* **Choosing Clusters:** You can "cut" the dendrogram at a specific height. Any vertical line intersected by your horizontal cut represents a distinct cluster. + +## 3. Linkage Criteria + +In step 2 of the algorithm, we must decide what "closest" means when comparing two clusters (instead of just two points). This is called the **Linkage Criterion**. + +| Linkage Type | Description | Result | +| :--- | :--- | :--- | +| **Ward** | Minimizes the variance within clusters. | Creates clusters of relatively equal size (Default in Sklearn). | +| **Complete** | Uses the maximum distance between points in two clusters. | Tends to create compact, tightly bound clusters. | +| **Average** | Uses the average distance between all points in two clusters. | A balanced approach between Single and Complete. | +| **Single** | Uses the minimum distance between points in two clusters. | Can create long, "chain-like" clusters. | + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.cluster import AgglomerativeClustering +import scipy.cluster.hierarchy as sch +import matplotlib.pyplot as plt + +# 1. Visualize the Dendrogram (using Scipy) +plt.figure(figsize=(10, 7)) +dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) +plt.show() + +# 2. Fit the Model +# n_clusters=None + distance_threshold=0 allows you to compute the full tree +model = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward') +labels = model.fit_predict(X) + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **No fixed K:** You don't need to know the number of clusters beforehand. | **Computational Cost:** Very slow on large datasets ($O(n^3)$ or $O(n^2)$). | +| **Intuitive Visualization:** Dendrograms provide great insight into data relationships. | **Irreversible:** Once a merge is made, it cannot be undone in later steps. | +| **Stable:** Produces the same results every time (no random initialization). | **Sensitive to Noise:** Outliers can cause branches to merge incorrectly. | + +## 6. Comparison with K-Means + +| Feature | K-Means | Hierarchical | +| --- | --- | --- | +| **Number of Clusters** | Must be specified (K) | Flexible (determined by cut) | +| **Efficiency** | High (Good for big data) | Low (Good for small/medium data) | +| **Shape** | Assumes spherical clusters | Can handle various shapes | +| **Underlying Math** | Centroid-based | Connectivity-based | + +## References for More Details + +* **[Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering):** Understanding connectivity constraints to speed up the algorithm. + +--- + +**Clustering helps us find groups, but sometimes we have too many variables to visualize effectively. How do we compress 100 features down to 2 without losing the "essence" of the data?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx index e69de29..d88db6c 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/clustering/kmeans.mdx @@ -0,0 +1,102 @@ +--- +title: K-Means Clustering +sidebar_label: K-Means +description: "Grouping data into K clusters by minimizing within-cluster variance." +tags: [machine-learning, unsupervised-learning, clustering, kmeans, centroids] +--- + +**K-Means** is an unsupervised learning algorithm that groups data points into $K$ distinct, non-overlapping clusters. Unlike Supervised Learning, there are no "correct answers" or labels; the algorithm finds structure based purely on the features of the data. + +## 1. How the Algorithm Works + +K-Means is an iterative process that follows these steps: + +1. **Initialization:** Choose $K$ (the number of clusters) and randomly place $K$ points in the feature space. these are the **Centroids**. +2. **Assignment:** Assign each data point to the nearest centroid (usually using Euclidean distance). +3. **Update:** Calculate the mean of all points assigned to each centroid. Move the centroid to this new mean position. +4. **Repeat:** Keep repeating the Assignment and Update steps until the centroids stop moving or a maximum number of iterations is reached. + +## 2. Choosing the Optimal 'K': The Elbow Method + +One of the biggest challenges in K-Means is knowing how many clusters to use. If you have too few, you miss patterns; too many, and you over-segment the data. + +We use **Inertia** (the sum of squared distances of samples to their closest cluster center). As $K$ increases, inertia always decreases. We look for the "Elbow"—the point where the rate of decrease shifts significantly. + +## 3. Implementation with Scikit-Learn + +```python +from sklearn.cluster import KMeans +import matplotlib.pyplot as plt + +# 1. Initialize the model +# n_clusters is the 'K' +# n_init='auto' handles the number of times the algorithm runs with different seeds +kmeans = KMeans(n_clusters=3, n_init='auto', random_state=42) + +# 2. Fit the model (Notice: we only pass X, not y!) +kmeans.fit(X) + +# 3. Get cluster assignments and centroid locations +labels = kmeans.labels_ +centroids = kmeans.cluster_centers_ + +# 4. Predict the cluster for a new point +new_point_cluster = kmeans.predict([[5.1, 3.5]]) + +``` + +## 4. Important Considerations + +### Scaling is Mandatory + +Since K-Means relies on distance (Euclidean), features with larger ranges will dominate the clustering. Always use `StandardScaler` before running K-Means. + +### Sensitivity to Initialization + +Randomly picking centroids can sometimes lead to poor results (local minima). Scikit-Learn uses **K-Means++** by default, a smart initialization technique that spreads out initial centroids to ensure better convergence. + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Scalable:** Very fast and efficient for large datasets. | **Manual K:** You must specify the number of clusters upfront. | +| **Simple:** Easy to interpret and implement. | **Spherical Bias:** Struggles with clusters of irregular shapes (like crescents). | +| **Guaranteed Convergence:** Will always find a solution. | **Sensitive to Outliers:** Outliers can pull centroids away from the true center. | + +```mermaid +graph LR + subgraph SPH["Spherical Clusters (K-Means Works Well)"] + A1["$$X$$ (Data Points)"] --> B1["$$k$$ Random Centroids"] + B1 --> C1["$$\\text{Distance to Centroid}$$"] + C1 --> D1["$$\\text{Voronoi Partitions}$$"] + D1 --> E1["$$\\text{Compact, Round Clusters}$$"] + E1 --> F1["$$\\text{Low Inertia}$$"] + F1 --> G1["$$\\text{Correct Clustering}$$"] + end + + subgraph MOON["Non-Spherical Clusters (Moons)"] + A2["$$X$$ (Moons Data)"] --> B2["$$k$$ Random Centroids"] + B2 --> C2["$$\\text{Euclidean Distance}$$"] + C2 --> D2["$$\\text{Linear Boundaries}$$"] + D2 --> E2["$$\\text{Forced Spherical Split}$$"] + E2 --> F2["$$\\text{High Inertia}$$"] + F2 --> G2["$$\\text{Incorrect Clustering}$$"] + end + + G1 -.->|"$$\\text{Assumption Holds}$$"| G2["$$\\text{Assumption Violated}$$"] + +``` + +## 6. Real-World Use Cases + +* **Customer Segmentation:** Grouping customers by purchasing behavior for targeted marketing. +* **Image Compression:** Reducing the number of colors in an image by clustering similar pixel values. +* **Anomaly Detection:** Identifying data points that are very far from any cluster centroid. + +## References for More Details + +* **[Scikit-Learn Clustering Guide](https://scikit-learn.org/stable/modules/clustering.html#k-means):** Technical details on K-Means++ and the ELKAN algorithm. + +--- + +**K-Means assumes clusters are circular. What if your data is organized in hierarchies or nested structures?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx index e69de29..33d2293 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/autoencoders.mdx @@ -0,0 +1,139 @@ +--- +title: Autoencoders +sidebar_label: Autoencoders +description: "Neural network-based dimensionality reduction: Encoder-Decoder architecture and bottleneck representations." +tags: [machine-learning, unsupervised-learning, deep-learning, autoencoders, dimensionality-reduction] +--- + +An **Autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim is to learn a compressed representation (encoding) for a set of data, typically for dimensionality reduction or feature learning. + +## 1. The Architecture: "The Hourglass" + +An autoencoder consists of two main parts connected by a "bottleneck": + +1. **The Encoder:** This part of the network compresses the input into a latent-space representation. It reduces the input dimensions layer by layer. +2. **The Code (Bottleneck):** This is the hidden layer that contains the compressed representation of the input data. It is the "knowledge" extracted from the input. +3. **The Decoder:** This part of the network tries to reconstruct the original input from the compressed code. + +### The Learning Flow +```mermaid +graph LR + Input[Original Input] --> Encoder(Encoder) + Encoder --> Latent[Latent Space / Code] + Latent --> Decoder(Decoder) + Decoder --> Output[Reconstructed Input] + + style Latent fill:#fff3e0,stroke:#ef6c00,stroke-width:4px,color:#333 + style Input fill:#e1f5fe,stroke:#01579b,color:#333 + style Output fill:#e8f5e9,stroke:#2e7d32,color:#333 + +``` + +## 2. The Loss Function: Reconstruction Loss + +The autoencoder is trained to minimize the difference between the **Input** and the **Reconstruction**. Since we want the output to be as close to the input as possible, we use a loss function like **Mean Squared Error (MSE)**. + +$$ +L(x, \hat{x}) = ||x - \hat{x}||^2 +$$ + +Where: + +* $x$ is the original input. +* $\hat{x}$ is the reconstructed output from the decoder. + +The network is forced to prioritize the most important features of the data because the "bottleneck" (Code) doesn't have enough capacity to store everything. + +## 3. Autoencoders vs. PCA + +| Feature | PCA | Autoencoder | +| --- | --- | --- | +| **Mapping** | Linear | Non-Linear (via activation functions) | +| **Complexity** | Simple / Fast | Complex / Resource Intensive | +| **Features** | Principal Components (Orthogonal) | Latent Variables (Flexible) | +| **Use Case** | Tabular data / Simple compression | Image, Audio, and Complex patterns | + +### 3.1 Visual Comparison + +```mermaid +graph LR + X["$$X \in \mathbb{R}^{n \times d}$$
$$\text{Input Data}$$"] + + %% PCA Path + X --> PCA["PCA"] + + PCA --> P1["$$Z = XV_k$$
$$\text{Linear Projection}$$"] + P1 --> P2["$$\hat{X} = ZV_k^\top$$
$$\text{Linear Reconstruction}$$"] + P2 --> P3["$$\min \|X - \hat{X}\|_2^2$$"] + P3 --> P4["$$\text{Captures Linear Variance}$$"] + P4 --> P5["$$\text{Fails on Non-Linear Manifolds}$$"] + + %% Autoencoder Path + X --> AE["Autoencoder"] + + AE --> E["$$Z = f_\theta(X)$$
$$\text{Non-Linear Encoder}$$"] + E --> D["$$\hat{X} = g_\phi(Z)$$
$$\text{Non-Linear Decoder}$$"] + D --> L["$$\min \|X - \hat{X}\|_2^2$$"] + L --> A1["$$\text{Learns Complex Manifolds}$$"] + A1 --> A2["$$\text{Better Reconstruction for Non-Linear Data}$$"] + + P5 -.->|"$$\text{Comparison}$$"| A2 + +``` + +**In this diagram:** + +* PCA uses linear transformations to reduce and reconstruct data, which works well for linearly correlated features. +* Autoencoders use non-linear functions (neural networks) to capture complex patterns, making them more powerful for intricate datasets. + +## 4. Common Types of Autoencoders + +* **Denoising Autoencoder:** Trained to ignore "noise" by receiving a corrupted input and trying to reconstruct the clean version. +* **Sparse Autoencoder:** Uses a penalty in the loss function to ensure only a few neurons in the bottleneck are "active" at once. +* **Variational Autoencoder (VAE):** Instead of learning a fixed code, it learns a probability distribution of the latent space. (Great for generating new data!) + +## 5. Practical Implementation (Keras/TensorFlow) + +```python +import tensorflow as tf +from tensorflow.keras import layers, losses + +# 1. Define the Encoder +encoder = tf.keras.Sequential([ + layers.Input(shape=(784,)), + layers.Dense(128, activation='relu'), + layers.Dense(64, activation='relu'), + layers.Dense(32, activation='relu'), # The Bottleneck +]) + +# 2. Define the Decoder +decoder = tf.keras.Sequential([ + layers.Dense(64, activation='relu'), + layers.Dense(128, activation='relu'), + layers.Dense(784, activation='sigmoid'), # Reconstruct to original size +]) + +# 3. Create the Autoencoder +autoencoder = tf.keras.Model(inputs=encoder.input, outputs=decoder(encoder.output)) +autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError()) + +# 4. Train (Notice: x_train is both input and target!) +autoencoder.fit(x_train, x_train, epochs=10, shuffle=True) + +``` + +## 6. Real-World Applications + +1. **Anomaly Detection:** If an autoencoder is trained on "normal" data, it will fail to reconstruct "anomalous" data correctly. A high reconstruction error indicates an anomaly. +2. **Image Denoising:** Removing grain or artifacts from photos. +3. **Dimensionality Reduction:** For visualization or speeding up other ML models. + +## References for More Details + +* **TensorFlow Tutorial: Intro to Autoencoders:** +* [Link](https://www.tensorflow.org/tutorials/generative/autoencoder) +* *Best for:* Step-by-step code examples for denoising and anomaly detection. + +--- + +**Autoencoders are the bridge between Unsupervised Learning and Deep Learning. Ready to see how we evaluate all these different models?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx index e69de29..4ad0daf 100644 --- a/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx +++ b/docs/machine-learning/machine-learning-core/unsupervised-learning/dimensionality-reduction/pca.mdx @@ -0,0 +1,136 @@ +--- +title: "Principal Component Analysis (PCA)" +sidebar_label: PCA +description: "Mastering feature extraction, variance preservation, and the math behind Eigenvalues and Eigenvectors." +tags: [machine-learning, unsupervised-learning, dimensionality-reduction, pca, linear-algebra] +--- +**Principal Component Analysis (PCA)** is a statistical technique used to simplify complex datasets. It transforms a large set of variables into a smaller one that still contains most of the information (variance) from the original set. + +Think of PCA as taking a 3D object and finding the perfect angle to take a 2D photo of it so that you can still tell exactly what the object is. + +## 1. How PCA Works (The Intuition) + +PCA finds new "axes" for your data called **Principal Components (PCs)**. +* **PC1:** The direction in space along which the data varies the most. +* **PC2:** The direction orthogonal (perpendicular) to PC1 that captures the next highest amount of variation. + +### The Step-by-Step Logic + +1. **Standardize the Data:** PCA is sensitive to the scale of the data, so we standardize features to have a mean of 0 and a standard deviation of 1. +2. **Compute the Covariance Matrix:** This matrix shows how features vary together. +3. **Calculate Eigenvalues and Eigenvectors:** These help identify the directions (eigenvectors) where the data varies the most (eigenvalues). +4. **Sort and Select Principal Components:** We sort the eigenvalues in descending order and select the top `k` eigenvectors to form a new feature space. + +**Now let's visualize this process:** + +```mermaid +graph LR + X["$$X \in \mathbb{R}^{n \times d}$$
$$\text{High-Dimensional Data}$$"] + + X --> S["$$\text{Standardize Data}$$"] + S --> C["$$\Sigma = \frac{1}{n}X^TX$$
$$\text{Covariance Matrix}$$"] + + C --> E["$$\Sigma v = \lambda v$$
$$\text{Eigen Decomposition}$$"] + + E --> PC1["$$\text{PC}_1$$
$$\max \ \text{Var}(Xv_1)$$"] + E --> PC2["$$\text{PC}_2$$
$$\max \ \text{Var}(Xv_2)$$"] + E --> PCk["$$\text{PC}_k$$
$$v_i^\top v_j = 0$$"] + + PC1 --> P1["$$\text{Direction of Maximum Variance}$$"] + PC2 --> P2["$$\text{Orthogonal to PC}_1$$"] + PCk --> P3["$$\text{Explains Remaining Variance}$$"] + + P1 --> R["$$Z = XV_k$$
$$\text{Reduced Representation}$$"] + P2 --> R + P3 --> R + + R --> G["$$\text{Lower-Dimensional Space}$$"] + G --> B["$$\text{Less Noise, Faster Models}$$"] + +``` + +**In this diagram:** + +* We start with high-dimensional data $X$. +* We standardize it and compute the covariance matrix $\Sigma$. +* We perform eigen decomposition to find eigenvalues and eigenvectors. +* We identify the principal components $(PC_1, PC_2, ..., PC_k)$. +* Finally, we project the data onto the new lower-dimensional space $Z$. + +## 2. The Mathematical Foundation + +To perform PCA, we solve for the **Eigenvectors** of the covariance matrix. + +### Step 1: Covariance Matrix ($\Sigma$) + +If we have a standardized matrix $X$, the covariance matrix is calculated as: + +$$ +\Sigma = \frac{1}{n-1} X^T X +$$ +Where: + +* **$\Sigma$**: Covariance Matrix (measures how features vary together). +* **$X^T$**: Transpose of the standardized data matrix. +* **$n$**: Number of samples. + +### Step 2: Eigenvalue Decomposition + +We find the Eigenvectors ($v$) and Eigenvalues ($\lambda$) such that: + +$$ +\Sigma v = \lambda v +$$ + +Where: + +* **Eigenvectors ($v$):** Define the direction of the new axes (Principal Components). +* **Eigenvalues ($\lambda$):** Define the magnitude (how much variance) is captured in that direction. + +## 3. The Explained Variance Ratio + +When you reduce dimensions, you lose some information. We measure this using the **Explained Variance Ratio**. If PC1 explains 70% of the variance and PC2 explains 20%, using both allows you to represent 90% of the original data complexity in just two variables. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.decomposition import PCA +from sklearn.preprocessing import StandardScaler + +# 1. PCA is extremely sensitive to scale! +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# 2. Initialize PCA +# n_components can be an integer (2) or a percentage (0.95) +pca = PCA(n_components=2) + +# 3. Fit and Transform the data +X_pca = pca.fit_transform(X_scaled) + +# 4. Check how much information was kept +print(f"Explained Variance: {pca.explained_variance_ratio_}") + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Removes Noise:** By dropping low-variance components, you remove random fluctuations. | **Loss of Interpretability:** The new "PCs" are combinations of features; they no longer have "real world" names. | +| **Visualization:** Turns 100+ dimensions into a 2D plot you can actually see. | **Linearity:** PCA assumes relationships are linear. It fails on curved structures. | +| **Efficiency:** Speeds up training for other algorithms by reducing feature count. | **Scaling Sensitive:** If you don't scale your data, PCA will focus on the features with the largest units. | + +## 6. When to use PCA? + +1. **High Dimensionality:** When you have too many features and your model is overfitting. +2. **Multicollinearity:** When your features are highly correlated with each other. +3. **Visualization:** When you need to plot high-dimensional clusters on a graph. + +## References for More Details + +* **[Scikit-Learn PCA Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html):** Learning about `IncrementalPCA` for datasets too large for memory. + +--- + +**PCA is amazing for linear structures. But what if your data is twisted or curved? For visualizing complex, non-linear patterns (like the "Swiss Roll"), we use a different tool.** \ No newline at end of file