codeharborhub · ajay-dhangar · Jan 3, 2026 · Jan 2, 2026 · Jan 3, 2026
@@ -0,0 +1,123 @@
+---
+title: Elastic Net Regression
+sidebar_label: Elastic Net
+description: "Combining L1 and L2 regularization for the ultimate balance in feature selection and model stability."
+tags: [machine-learning, supervised-learning, regression, elastic-net, regularization]
+---
+
+**Elastic Net** is a regularized regression method that linearly combines the $L1$ and $L2$ penalties of the [Lasso](./lasso) and [Ridge](./ridge) methods.
+
+It was developed to overcome the limitations of Lasso, particularly when dealing with highly correlated features or situations where the number of features exceeds the number of samples.
+
+## 1. The Mathematical Objective
+
+Elastic Net adds both penalties to the loss function. It uses a ratio to determine how much of each penalty to apply.
+
+The cost function is:
+
+$$
+Cost = \text{MSE} + \alpha \cdot \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha \cdot (1 - \rho)}{2} \sum_{j=1}^{p} \beta_j^2
+$$
+
+* **$\alpha$ (Alpha):** The overall regularization strength.
+* **$\rho$ (L1 Ratio):** Controls the mix between Lasso and Ridge.
+    * If $\rho = 1$, it is pure **Lasso**.
+    * If $\rho = 0$, it is pure **Ridge**.
+    * If $0 < \rho < 1$, it is a **combination**.
+
+## 2. Why use Elastic Net?
+
+### A. Overcoming Lasso's Limitations
+Lasso tends to pick one variable from a group of highly correlated variables and ignore the others. Elastic Net is more likely to keep the whole group in the model (the "grouping effect") thanks to the Ridge component.
+
+### B. High-Dimensional Data
+In cases where the number of features ($p$) is greater than the number of observations ($n$), Lasso can only select at most $n$ variables. Elastic Net can select more than $n$ variables if necessary.
+
+### C. Maximum Flexibility
+Because you can tune the ratio, you can "slide" your model anywhere on the spectrum between Ridge and Lasso to find the exact point that minimizes validation error.
+
+```mermaid
+graph LR
+    subgraph RIDGE["Ridge (L2) Coefficient Path"]
+        R0["$$\\alpha = 0$$"] --> R1["$$w_1, w_2, w_3$$"]
+        R1 --> R2["$$\\alpha \\uparrow$$"]
+        R2 --> R3["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
+        R3 --> R4["$$w_i \\neq 0$$"]
+    end
+
+    subgraph LASSO["Lasso (L1) Coefficient Path"]
+        L0["$$\\alpha = 0$$"] --> L1["$$w_1, w_2, w_3$$"]
+        L1 --> L2["$$\\alpha \\uparrow$$"]
+        L2 --> L3["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
+        L3 --> L4["$$w_j = 0$$ for some j"]
+    end
+
+    subgraph ENET["Elastic Net (L1 + L2) Coefficient Path"]
+        E0["$$\\alpha = 0$$"] --> E1["$$w_1, w_2, w_3$$"]
+        E1 --> E2["$$\\alpha \\uparrow$$"]
+        E2 --> E3["$$\\text{Mixed Shrinkage}$$"]
+        E3 --> E4["$$\\text{Grouped Selection + Stability}$$"]
+    end
+
+    R4 -.->|"$$\\text{No Sparsity}$$"| L4
+    L4 -.->|"$$\\text{Pure Sparsity}$$"| E4
+```
+
+## 3. Key Hyperparameters in Scikit-Learn
+
+* **`alpha`**: Constant that multiplies the penalty terms. High values mean more regularization.
+* **`l1_ratio`**: The $\rho$ parameter. Scikit-Learn uses `l1_ratio=0.5` by default, giving equal weight to $L1$ and $L2$.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.linear_model import ElasticNet
+from sklearn.preprocessing import StandardScaler
+
+# 1. Scaling is mandatory
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# 2. Initialize and Train
+# l1_ratio=0.5 means 50% Lasso, 50% Ridge
+model = ElasticNet(alpha=1.0, l1_ratio=0.5)
+model.fit(X_scaled, y)
+
+# 3. View the results
+print(f"Coefficients: {model.coef_}")
+
+```
+
+## 5. Decision Matrix: Which one to use?
+
+| Scenario | Recommended Model |
+| --- | --- |
+| Most features are useful and small. | **Ridge** |
+| You suspect only a few features are actually important. | **Lasso** |
+| You have many features that are highly correlated with each other. | **Elastic Net** |
+| Number of features is much larger than the number of samples ($p \gg n$). | **Elastic Net** |
+
+
+## 6. Automated Tuning with ElasticNetCV
+
+Like Ridge and Lasso, Scikit-Learn provides a cross-validation version that tests multiple `alpha` values and `l1_ratio` values to find the best combination for you.
+
+```python
+from sklearn.linear_model import ElasticNetCV
+
+# Search for the best alpha and l1_ratio
+model_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5)
+model_cv.fit(X_scaled, y)
+
+print(f"Best Alpha: {model_cv.alpha_}")
+print(f"Best L1 Ratio: {model_cv.l1_ratio_}")
+
+```
+
+## References for More Details
+
+* **[Scikit-Learn ElasticNet Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html):** Understanding technical parameters like `tol` (tolerance) and `max_iter`.
+
+---
+
+**You've now covered all the primary linear regression models! But what if your goal isn't to predict a number, but to group similar data points together?** Head over to the [Clustering](/tutorial/category/clustering) section to explore techniques like K-Means and DBSCAN!
@@ -0,0 +1,99 @@
+---
+title: "Lasso Regression (L1 Regularization)"
+sidebar_label: Lasso Regression
+description: "Understanding L1 regularization, sparse models, and automated feature selection."
+tags: [machine-learning, supervised-learning, regression, lasso, l1-regularization]
+---
+
+**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that uses **L1 Regularization**. 
+
+While standard Linear Regression tries to minimize only the error, Lasso adds a penalty equal to the absolute value of the magnitude of the coefficients. This forces the model to not only be accurate but also as simple as possible.
+
+## 1. The Mathematical Objective
+
+Lasso minimizes the following cost function:
+
+$$
+Cost = \text{MSE} + \alpha \sum_{j=1}^{p} |\beta_j|
+$$
+
+Where:
+
+* **MSE (Mean Squared Error):** Keeps the model accurate.
+* **$\alpha$ (Alpha):** The tuning parameter that controls the strength of the penalty.
+* **$|\beta_j|$:** The absolute value of the coefficients.
+
+## 2. Feature Selection: The Power of Zero
+
+The most significant difference between Lasso and its sibling, [Ridge Regression](./ridge), is that Lasso can shrink coefficients **exactly to zero**.
+
+When a coefficient becomes zero, that feature is effectively removed from the model. This makes Lasso an excellent tool for:
+1.  **Automated Feature Selection:** Identifying the most important variables in a dataset with hundreds of features.
+2.  **Model Interpretability:** Creating "sparse" models that are easier for humans to understand.
+
+```mermaid
+graph LR
+    subgraph RIDGE["Ridge Regression (L2)"]
+        A1["$$\\alpha = 0$$"] --> B1["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
+        B1 --> C1["$$\\alpha \\uparrow$$"]
+        C1 --> D1["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
+        D1 --> E1["$$w_i \\neq 0$$ for all i"]
+        E1 --> F1["$$\\text{No Feature Elimination}$$"]
+    end
+
+    subgraph LASSO["Lasso Regression (L1)"]
+        A2["$$\\alpha = 0$$"] --> B2["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
+        B2 --> C2["$$\\alpha \\uparrow$$"]
+        C2 --> D2["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
+        D2 --> E2["$$w_j = 0$$ for some j"]
+        E2 --> F2["$$\\text{Automatic Feature Selection}$$"]
+    end
+
+    F1 -.->|"$$\\text{Shrinkage Path Comparison}$$"| F2
+```
+
+## 3. Choosing the Alpha ($\alpha$) Parameter
+
+* **If $\alpha = 0$:** The penalty is removed, and the result is standard Ordinary Least Squares (OLS).
+* **As $\alpha$ increases:** More coefficients are pushed to zero, leading to a simpler, more biased model.
+* **If $\alpha$ is too high:** All coefficients become zero, and the model predicts only the mean (Underfitting).
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.linear_model import Lasso
+from sklearn.preprocessing import StandardScaler
+
+# 1. Scaling is REQUIRED for Lasso
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+
+# 2. Initialize and Train
+# 'alpha' is the regularization strength
+lasso = Lasso(alpha=0.1)
+lasso.fit(X_train_scaled, y_train)
+
+# 3. Check which features were selected (non-zero)
+import pandas as pd
+importance = pd.Series(lasso.coef_, index=feature_names)
+print(importance[importance != 0])
+
+```
+
+## 5. Lasso vs. Ridge
+
+| Feature | Ridge ($L2$) | Lasso ($L1$) |
+| --- | --- | --- |
+| **Penalty** | Square of coefficients | Absolute value of coefficients |
+| **Coefficients** | Shrink towards zero, but never reach it | Can shrink exactly to **zero** |
+| **Use Case** | When most features are useful | When you have many "noisy" or useless features |
+| **Model Type** | Dense (all features kept) | Sparse (some features removed) |
+
+## 6. Limitations of Lasso
+
+1. **Correlated Features:** If two features are highly correlated, Lasso will randomly pick one and discard the other, which can lead to instability.
+2. **Sample Size:** If , Lasso can select at most  features.
+
+## References for More Details
+
+* **[Scikit-Learn Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html):** Exploring `LassoCV`, which automatically finds the best Alpha using cross-validation.
@@ -0,0 +1,133 @@
+---
+title: "Ridge Regression (L2 Regularization)"
+sidebar_label: Ridge Regression
+description: "Mastering L2 regularization to prevent overfitting and handle multicollinearity in regression models."
+tags: [machine-learning, supervised-learning, regression, ridge, l2-regularization]
+---
+
+**Ridge Regression** is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle **overfitting** and issues caused by **multicollinearity** (when input features are highly correlated).
+
+## 1. The Mathematical Objective
+
+In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ($\beta$).
+
+The cost function becomes:
+
+$$
+Cost = \text{MSE} + \alpha \sum_{j=1}^{p} \beta_j^2
+$$
+
+* **MSE (Mean Squared Error):** The standard loss (prediction error).
+* **$\alpha$ (Alpha):** The complexity parameter. It controls how much you want to penalize the size of the coefficients.
+* **$\beta_j^2$:** The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero.
+
+## 2. Why use Ridge?
+
+### A. Preventing Overfitting
+When a model has too many features or the features are highly correlated, the coefficients ($\beta$) can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable.
+
+### B. Handling Multicollinearity
+If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small.
+
+```mermaid
+graph LR
+    subgraph RIDGE["Ridge Regression Coefficient Shrinkage"]
+        A["$$\\alpha = 0$$"] --> B["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
+        B --> C["$$\\alpha \\uparrow$$"]
+        C --> D["$$|w_i| \\downarrow$$ (Shrinkage)"]
+        D --> E["$$w_i \\to 0$$ (Never Exactly Zero)"]
+        E --> F["$$\\text{Reduced Model Variance}$$"]
+    end
+
+    subgraph OBJ["Optimization View"]
+        L["$$\\min \\sum (y - \\hat{y})^2 + \\alpha \\sum w_i^2$$"]
+        L --> G["$$\\text{Penalty Grows with } \\alpha$$"]
+        G --> H["$$\\text{Stronger Pull Toward Origin}$$"]
+    end
+
+    C -.->|"$$\\text{Controls Strength}$$"| G
+    F -.->|"$$\\text{Bias–Variance Tradeoff}$$"| H
+
+```
+
+## 3. The Alpha ($\alpha$) Trade-off
+
+Choosing the right $\alpha$ is a balancing act between **Bias** and **Variance**:
+
+* **$\alpha = 0$:** Equivalent to standard Linear Regression (High variance, Low bias).
+* **As $\alpha \to \infty$:** The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias).
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.linear_model import Ridge
+from sklearn.preprocessing import StandardScaler
+
+# 1. Scaling is MANDATORY for Ridge
+# Because the penalty is based on the size of coefficients,
+# features with larger scales will be penalized unfairly.
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+
+# 2. Initialize and Train
+# alpha=1.0 is the default
+ridge = Ridge(alpha=1.0)
+ridge.fit(X_train_scaled, y_train)
+
+# 3. Predict
+y_pred = ridge.predict(X_test_scaled)
+
+```
+
+## 5. Ridge vs. Lasso: A Summary
+
+| Feature | Ridge Regression ($L2$) | Lasso Regression ($L1$) |
+| --- | --- | --- |
+| **Penalty Term** | $\alpha \sum_{j=1}^{p} \beta_j^2$ | $\alpha \sum_{j=1}^{p} \vert \beta_j \vert$ |
+| **Mathematical Goal** | Minimizes the **square** of the weights. | Minimizes the **absolute value** of the weights. |
+| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients **exactly to zero**, effectively removing the feature. |
+| **Feature Selection** | **No.** Keeps all predictors in the final model, though some may have tiny weights. | **Yes.** Acts as an automated feature selector by discarding unimportant variables. |
+| **Model Complexity** | Produces a **dense** model (uses all features). | Produces a **sparse** model (uses a subset of features). |
+| **Ideal Scenario** | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. |
+| **Handling Correlated Data** | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. |
+
+```mermaid
+graph LR
+    subgraph L2["L2 Regularization Constraint (Ridge)"]
+        O1["$$w_1^2 + w_2^2 \leq t$$"] --> C1["$$\text{Circle (L2 Ball)}$$"]
+        C1 --> E1["$$\text{Smooth Boundary}$$"]
+        E1 --> S1["$$\text{Rarely touches axes}$$"]
+        S1 --> R1["$$w_1, w_2 \neq 0$$"]
+    end
+
+    subgraph L1["L1 Regularization Constraint (Lasso)"]
+        O2["$$|w_1| + |w_2| \leq t$$"] --> D1["$$\text{Diamond (L1 Ball)}$$"]
+        D1 --> C2["$$\text{Sharp Corners}$$"]
+        C2 --> A1["$$\text{Corners lie on axes}$$"]
+        A1 --> Z1["$$w_1 = 0 \ \text{or}\ w_2 = 0$$"]
+    end
+
+    R1 -.->|"$$\text{Geometry Explains Behavior}$$"| Z1
+```
+
+## 6. RidgeCV: Finding the Best Alpha
+
+Finding the perfect  manually is tedious. Scikit-Learn provides `RidgeCV`, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically.
+
+```python
+from sklearn.linear_model import RidgeCV
+
+# Define a list of alphas to test
+alphas = [0.1, 1.0, 10.0, 100.0]
+
+# RidgeCV finds the best one automatically
+ridge_cv = RidgeCV(alphas=alphas)
+ridge_cv.fit(X_train_scaled, y_train)
+
+print(f"Best Alpha: {ridge_cv.alpha_}")
+
+```
+
+## References for More Details
+
+* **[Scikit-Learn: Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression):** Technical details on the solvers used (like 'cholesky' or 'sag').