Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
title: Elastic Net Regression
sidebar_label: Elastic Net
description: "Combining L1 and L2 regularization for the ultimate balance in feature selection and model stability."
tags: [machine-learning, supervised-learning, regression, elastic-net, regularization]
---

**Elastic Net** is a regularized regression method that linearly combines the $L1$ and $L2$ penalties of the [Lasso](./lasso) and [Ridge](./ridge) methods.

It was developed to overcome the limitations of Lasso, particularly when dealing with highly correlated features or situations where the number of features exceeds the number of samples.

## 1. The Mathematical Objective

Elastic Net adds both penalties to the loss function. It uses a ratio to determine how much of each penalty to apply.

The cost function is:

$$
Cost = \text{MSE} + \alpha \cdot \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha \cdot (1 - \rho)}{2} \sum_{j=1}^{p} \beta_j^2
$$

* **$\alpha$ (Alpha):** The overall regularization strength.
* **$\rho$ (L1 Ratio):** Controls the mix between Lasso and Ridge.
* If $\rho = 1$, it is pure **Lasso**.
* If $\rho = 0$, it is pure **Ridge**.
* If $0 < \rho < 1$, it is a **combination**.

## 2. Why use Elastic Net?

### A. Overcoming Lasso's Limitations
Lasso tends to pick one variable from a group of highly correlated variables and ignore the others. Elastic Net is more likely to keep the whole group in the model (the "grouping effect") thanks to the Ridge component.

### B. High-Dimensional Data
In cases where the number of features ($p$) is greater than the number of observations ($n$), Lasso can only select at most $n$ variables. Elastic Net can select more than $n$ variables if necessary.

### C. Maximum Flexibility
Because you can tune the ratio, you can "slide" your model anywhere on the spectrum between Ridge and Lasso to find the exact point that minimizes validation error.

```mermaid
graph LR
subgraph RIDGE["Ridge (L2) Coefficient Path"]
R0["$$\\alpha = 0$$"] --> R1["$$w_1, w_2, w_3$$"]
R1 --> R2["$$\\alpha \\uparrow$$"]
R2 --> R3["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
R3 --> R4["$$w_i \\neq 0$$"]
end

subgraph LASSO["Lasso (L1) Coefficient Path"]
L0["$$\\alpha = 0$$"] --> L1["$$w_1, w_2, w_3$$"]
L1 --> L2["$$\\alpha \\uparrow$$"]
L2 --> L3["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
L3 --> L4["$$w_j = 0$$ for some j"]
end

subgraph ENET["Elastic Net (L1 + L2) Coefficient Path"]
E0["$$\\alpha = 0$$"] --> E1["$$w_1, w_2, w_3$$"]
E1 --> E2["$$\\alpha \\uparrow$$"]
E2 --> E3["$$\\text{Mixed Shrinkage}$$"]
E3 --> E4["$$\\text{Grouped Selection + Stability}$$"]
end

R4 -.->|"$$\\text{No Sparsity}$$"| L4
L4 -.->|"$$\\text{Pure Sparsity}$$"| E4
```

## 3. Key Hyperparameters in Scikit-Learn

* **`alpha`**: Constant that multiplies the penalty terms. High values mean more regularization.
* **`l1_ratio`**: The $\rho$ parameter. Scikit-Learn uses `l1_ratio=0.5` by default, giving equal weight to $L1$ and $L2$.

## 4. Implementation with Scikit-Learn

```python
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler

# 1. Scaling is mandatory
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Initialize and Train
# l1_ratio=0.5 means 50% Lasso, 50% Ridge
model = ElasticNet(alpha=1.0, l1_ratio=0.5)
model.fit(X_scaled, y)

# 3. View the results
print(f"Coefficients: {model.coef_}")

```

## 5. Decision Matrix: Which one to use?

| Scenario | Recommended Model |
| --- | --- |
| Most features are useful and small. | **Ridge** |
| You suspect only a few features are actually important. | **Lasso** |
| You have many features that are highly correlated with each other. | **Elastic Net** |
| Number of features is much larger than the number of samples ($p \gg n$). | **Elastic Net** |


## 6. Automated Tuning with ElasticNetCV

Like Ridge and Lasso, Scikit-Learn provides a cross-validation version that tests multiple `alpha` values and `l1_ratio` values to find the best combination for you.

```python
from sklearn.linear_model import ElasticNetCV

# Search for the best alpha and l1_ratio
model_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5)
model_cv.fit(X_scaled, y)

print(f"Best Alpha: {model_cv.alpha_}")
print(f"Best L1 Ratio: {model_cv.l1_ratio_}")

```

## References for More Details

* **[Scikit-Learn ElasticNet Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html):** Understanding technical parameters like `tol` (tolerance) and `max_iter`.

---

**You've now covered all the primary linear regression models! But what if your goal isn't to predict a number, but to group similar data points together?** Head over to the [Clustering](/tutorial/category/clustering) section to explore techniques like K-Means and DBSCAN!
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "Lasso Regression (L1 Regularization)"
sidebar_label: Lasso Regression
description: "Understanding L1 regularization, sparse models, and automated feature selection."
tags: [machine-learning, supervised-learning, regression, lasso, l1-regularization]
---

**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that uses **L1 Regularization**.

While standard Linear Regression tries to minimize only the error, Lasso adds a penalty equal to the absolute value of the magnitude of the coefficients. This forces the model to not only be accurate but also as simple as possible.

## 1. The Mathematical Objective

Lasso minimizes the following cost function:

$$
Cost = \text{MSE} + \alpha \sum_{j=1}^{p} |\beta_j|
$$

Where:

* **MSE (Mean Squared Error):** Keeps the model accurate.
* **$\alpha$ (Alpha):** The tuning parameter that controls the strength of the penalty.
* **$|\beta_j|$:** The absolute value of the coefficients.

## 2. Feature Selection: The Power of Zero

The most significant difference between Lasso and its sibling, [Ridge Regression](./ridge), is that Lasso can shrink coefficients **exactly to zero**.

When a coefficient becomes zero, that feature is effectively removed from the model. This makes Lasso an excellent tool for:
1. **Automated Feature Selection:** Identifying the most important variables in a dataset with hundreds of features.
2. **Model Interpretability:** Creating "sparse" models that are easier for humans to understand.

```mermaid
graph LR
subgraph RIDGE["Ridge Regression (L2)"]
A1["$$\\alpha = 0$$"] --> B1["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
B1 --> C1["$$\\alpha \\uparrow$$"]
C1 --> D1["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
D1 --> E1["$$w_i \\neq 0$$ for all i"]
E1 --> F1["$$\\text{No Feature Elimination}$$"]
end

subgraph LASSO["Lasso Regression (L1)"]
A2["$$\\alpha = 0$$"] --> B2["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
B2 --> C2["$$\\alpha \\uparrow$$"]
C2 --> D2["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
D2 --> E2["$$w_j = 0$$ for some j"]
E2 --> F2["$$\\text{Automatic Feature Selection}$$"]
end

F1 -.->|"$$\\text{Shrinkage Path Comparison}$$"| F2
```

## 3. Choosing the Alpha ($\alpha$) Parameter

* **If $\alpha = 0$:** The penalty is removed, and the result is standard Ordinary Least Squares (OLS).
* **As $\alpha$ increases:** More coefficients are pushed to zero, leading to a simpler, more biased model.
* **If $\alpha$ is too high:** All coefficients become zero, and the model predicts only the mean (Underfitting).

## 4. Implementation with Scikit-Learn

```python
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# 1. Scaling is REQUIRED for Lasso
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 2. Initialize and Train
# 'alpha' is the regularization strength
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

# 3. Check which features were selected (non-zero)
import pandas as pd
importance = pd.Series(lasso.coef_, index=feature_names)
print(importance[importance != 0])

```

## 5. Lasso vs. Ridge

| Feature | Ridge ($L2$) | Lasso ($L1$) |
| --- | --- | --- |
| **Penalty** | Square of coefficients | Absolute value of coefficients |
| **Coefficients** | Shrink towards zero, but never reach it | Can shrink exactly to **zero** |
| **Use Case** | When most features are useful | When you have many "noisy" or useless features |
| **Model Type** | Dense (all features kept) | Sparse (some features removed) |

## 6. Limitations of Lasso

1. **Correlated Features:** If two features are highly correlated, Lasso will randomly pick one and discard the other, which can lead to instability.
2. **Sample Size:** If , Lasso can select at most features.

## References for More Details

* **[Scikit-Learn Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html):** Exploring `LassoCV`, which automatically finds the best Alpha using cross-validation.
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
title: "Ridge Regression (L2 Regularization)"
sidebar_label: Ridge Regression
description: "Mastering L2 regularization to prevent overfitting and handle multicollinearity in regression models."
tags: [machine-learning, supervised-learning, regression, ridge, l2-regularization]
---

**Ridge Regression** is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle **overfitting** and issues caused by **multicollinearity** (when input features are highly correlated).

## 1. The Mathematical Objective

In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ($\beta$).

The cost function becomes:

$$
Cost = \text{MSE} + \alpha \sum_{j=1}^{p} \beta_j^2
$$

* **MSE (Mean Squared Error):** The standard loss (prediction error).
* **$\alpha$ (Alpha):** The complexity parameter. It controls how much you want to penalize the size of the coefficients.
* **$\beta_j^2$:** The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero.

## 2. Why use Ridge?

### A. Preventing Overfitting
When a model has too many features or the features are highly correlated, the coefficients ($\beta$) can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable.

### B. Handling Multicollinearity
If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small.

```mermaid
graph LR
subgraph RIDGE["Ridge Regression Coefficient Shrinkage"]
A["$$\\alpha = 0$$"] --> B["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
B --> C["$$\\alpha \\uparrow$$"]
C --> D["$$|w_i| \\downarrow$$ (Shrinkage)"]
D --> E["$$w_i \\to 0$$ (Never Exactly Zero)"]
E --> F["$$\\text{Reduced Model Variance}$$"]
end

subgraph OBJ["Optimization View"]
L["$$\\min \\sum (y - \\hat{y})^2 + \\alpha \\sum w_i^2$$"]
L --> G["$$\\text{Penalty Grows with } \\alpha$$"]
G --> H["$$\\text{Stronger Pull Toward Origin}$$"]
end

C -.->|"$$\\text{Controls Strength}$$"| G
F -.->|"$$\\text{Bias–Variance Tradeoff}$$"| H

```

## 3. The Alpha ($\alpha$) Trade-off

Choosing the right $\alpha$ is a balancing act between **Bias** and **Variance**:

* **$\alpha = 0$:** Equivalent to standard Linear Regression (High variance, Low bias).
* **As $\alpha \to \infty$:** The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias).

## 4. Implementation with Scikit-Learn

```python
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# 1. Scaling is MANDATORY for Ridge
# Because the penalty is based on the size of coefficients,
# features with larger scales will be penalized unfairly.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 2. Initialize and Train
# alpha=1.0 is the default
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# 3. Predict
y_pred = ridge.predict(X_test_scaled)

```

## 5. Ridge vs. Lasso: A Summary

| Feature | Ridge Regression ($L2$) | Lasso Regression ($L1$) |
| --- | --- | --- |
| **Penalty Term** | $\alpha \sum_{j=1}^{p} \beta_j^2$ | $\alpha \sum_{j=1}^{p} \vert \beta_j \vert$ |
| **Mathematical Goal** | Minimizes the **square** of the weights. | Minimizes the **absolute value** of the weights. |
| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients **exactly to zero**, effectively removing the feature. |
| **Feature Selection** | **No.** Keeps all predictors in the final model, though some may have tiny weights. | **Yes.** Acts as an automated feature selector by discarding unimportant variables. |
| **Model Complexity** | Produces a **dense** model (uses all features). | Produces a **sparse** model (uses a subset of features). |
| **Ideal Scenario** | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. |
| **Handling Correlated Data** | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. |

```mermaid
graph LR
subgraph L2["L2 Regularization Constraint (Ridge)"]
O1["$$w_1^2 + w_2^2 \leq t$$"] --> C1["$$\text{Circle (L2 Ball)}$$"]
C1 --> E1["$$\text{Smooth Boundary}$$"]
E1 --> S1["$$\text{Rarely touches axes}$$"]
S1 --> R1["$$w_1, w_2 \neq 0$$"]
end

subgraph L1["L1 Regularization Constraint (Lasso)"]
O2["$$|w_1| + |w_2| \leq t$$"] --> D1["$$\text{Diamond (L1 Ball)}$$"]
D1 --> C2["$$\text{Sharp Corners}$$"]
C2 --> A1["$$\text{Corners lie on axes}$$"]
A1 --> Z1["$$w_1 = 0 \ \text{or}\ w_2 = 0$$"]
end

R1 -.->|"$$\text{Geometry Explains Behavior}$$"| Z1
```

## 6. RidgeCV: Finding the Best Alpha

Finding the perfect manually is tedious. Scikit-Learn provides `RidgeCV`, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically.

```python
from sklearn.linear_model import RidgeCV

# Define a list of alphas to test
alphas = [0.1, 1.0, 10.0, 100.0]

# RidgeCV finds the best one automatically
ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train_scaled, y_train)

print(f"Best Alpha: {ridge_cv.alpha_}")

```

## References for More Details

* **[Scikit-Learn: Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression):** Technical details on the solvers used (like 'cholesky' or 'sag').