From 4eb31551202e369d71a03be7648df7909b7f3f61 Mon Sep 17 00:00:00 2001 From: Ajay Dhangar Date: Fri, 2 Jan 2026 21:08:28 +0530 Subject: [PATCH 1/2] done supervised learn... --- .../regression/elastic-net.mdx | 123 ++++++++++++++++ .../supervised-learning/regression/lasso.mdx | 99 +++++++++++++ .../supervised-learning/regression/ridge.mdx | 133 ++++++++++++++++++ 3 files changed, 355 insertions(+) diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/elastic-net.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/elastic-net.mdx index e69de29..0f96e7f 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/elastic-net.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/elastic-net.mdx @@ -0,0 +1,123 @@ +--- +title: Elastic Net Regression +sidebar_label: Elastic Net +description: "Combining L1 and L2 regularization for the ultimate balance in feature selection and model stability." +tags: [machine-learning, supervised-learning, regression, elastic-net, regularization] +--- + +**Elastic Net** is a regularized regression method that linearly combines the $L1$ and $L2$ penalties of the [Lasso](./lasso) and [Ridge](./ridge) methods. + +It was developed to overcome the limitations of Lasso, particularly when dealing with highly correlated features or situations where the number of features exceeds the number of samples. + +## 1. The Mathematical Objective + +Elastic Net adds both penalties to the loss function. It uses a ratio to determine how much of each penalty to apply. + +The cost function is: + +$$ +Cost = \text{MSE} + \alpha \cdot \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha \cdot (1 - \rho)}{2} \sum_{j=1}^{p} \beta_j^2 +$$ + +* **$\alpha$ (Alpha):** The overall regularization strength. +* **$\rho$ (L1 Ratio):** Controls the mix between Lasso and Ridge. + * If $\rho = 1$, it is pure **Lasso**. + * If $\rho = 0$, it is pure **Ridge**. + * If $0 < \rho < 1$, it is a **combination**. + +## 2. Why use Elastic Net? + +### A. Overcoming Lasso's Limitations +Lasso tends to pick one variable from a group of highly correlated variables and ignore the others. Elastic Net is more likely to keep the whole group in the model (the "grouping effect") thanks to the Ridge component. + +### B. High-Dimensional Data +In cases where the number of features ($p$) is greater than the number of observations ($n$), Lasso can only select at most $n$ variables. Elastic Net can select more than $n$ variables if necessary. + +### C. Maximum Flexibility +Because you can tune the ratio, you can "slide" your model anywhere on the spectrum between Ridge and Lasso to find the exact point that minimizes validation error. + +```mermaid +graph LR + subgraph RIDGE["Ridge (L2) Coefficient Path"] + R0["$$\\alpha = 0$$"] --> R1["$$w_1, w_2, w_3$$"] + R1 --> R2["$$\\alpha \\uparrow$$"] + R2 --> R3["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"] + R3 --> R4["$$w_i \\neq 0$$"] + end + + subgraph LASSO["Lasso (L1) Coefficient Path"] + L0["$$\\alpha = 0$$"] --> L1["$$w_1, w_2, w_3$$"] + L1 --> L2["$$\\alpha \\uparrow$$"] + L2 --> L3["$$|w_i| \\downarrow$$ (Linear Shrinkage)"] + L3 --> L4["$$w_j = 0$$ for some j"] + end + + subgraph ENET["Elastic Net (L1 + L2) Coefficient Path"] + E0["$$\\alpha = 0$$"] --> E1["$$w_1, w_2, w_3$$"] + E1 --> E2["$$\\alpha \\uparrow$$"] + E2 --> E3["$$\\text{Mixed Shrinkage}$$"] + E3 --> E4["$$\\text{Grouped Selection + Stability}$$"] + end + + R4 -.->|"$$\\text{No Sparsity}$$"| L4 + L4 -.->|"$$\\text{Pure Sparsity}$$"| E4 +``` + +## 3. Key Hyperparameters in Scikit-Learn + +* **`alpha`**: Constant that multiplies the penalty terms. High values mean more regularization. +* **`l1_ratio`**: The $\rho$ parameter. Scikit-Learn uses `l1_ratio=0.5` by default, giving equal weight to $L1$ and $L2$. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.linear_model import ElasticNet +from sklearn.preprocessing import StandardScaler + +# 1. Scaling is mandatory +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# 2. Initialize and Train +# l1_ratio=0.5 means 50% Lasso, 50% Ridge +model = ElasticNet(alpha=1.0, l1_ratio=0.5) +model.fit(X_scaled, y) + +# 3. View the results +print(f"Coefficients: {model.coef_}") + +``` + +## 5. Decision Matrix: Which one to use? + +| Scenario | Recommended Model | +| --- | --- | +| Most features are useful and small. | **Ridge** | +| You suspect only a few features are actually important. | **Lasso** | +| You have many features that are highly correlated with each other. | **Elastic Net** | +| Number of features is much larger than the number of samples ($p \gg n$). | **Elastic Net** | + + +## 6. Automated Tuning with ElasticNetCV + +Like Ridge and Lasso, Scikit-Learn provides a cross-validation version that tests multiple `alpha` values and `l1_ratio` values to find the best combination for you. + +```python +from sklearn.linear_model import ElasticNetCV + +# Search for the best alpha and l1_ratio +model_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5) +model_cv.fit(X_scaled, y) + +print(f"Best Alpha: {model_cv.alpha_}") +print(f"Best L1 Ratio: {model_cv.l1_ratio_}") + +``` + +## References for More Details + +* **[Scikit-Learn ElasticNet Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html):** Understanding technical parameters like `tol` (tolerance) and `max_iter`. + +--- + +**You've now covered all the primary linear regression models! But what if your goal isn't to predict a number, but to group similar data points together?** Head over to the [Clustering](/tutorial/category/clustering) section to explore techniques like K-Means and DBSCAN! \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx index e69de29..8ab7e78 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx @@ -0,0 +1,99 @@ +--- +title: "Lasso Regression (L1 Regularization)" +sidebar_label: Lasso Regression +description: "Understanding L1 regularization, sparse models, and automated feature selection." +tags: [machine-learning, supervised-learning, regression, lasso, l1-regularization] +--- + +**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that uses **L1 Regularization**. + +While standard Linear Regression tries to minimize only the error, Lasso adds a penalty equal to the absolute value of the magnitude of the coefficients. This forces the model to not only be accurate but also as simple as possible. + +## 1. The Mathematical Objective + +Lasso minimizes the following cost function: + +$$ +Cost = \text{MSE} + \alpha \sum_{j=1}^{p} |\beta_j| +$$ + +Where: + +* **MSE (Mean Squared Error):** Keeps the model accurate. +* **$\alpha$ (Alpha):** The tuning parameter that controls the strength of the penalty. +* **$|\beta_j|$:** The absolute value of the coefficients. + +## 2. Feature Selection: The Power of Zero + +The most significant difference between Lasso and its sibling, [Ridge Regression](/tutorial/machine-learning/supervised-learning/regression/ridge), is that Lasso can shrink coefficients **exactly to zero**. + +When a coefficient becomes zero, that feature is effectively removed from the model. This makes Lasso an excellent tool for: +1. **Automated Feature Selection:** Identifying the most important variables in a dataset with hundreds of features. +2. **Model Interpretability:** Creating "sparse" models that are easier for humans to understand. + +```mermaid +graph LR + subgraph RIDGE["Ridge Regression (L2)"] + A1["$$\\alpha = 0$$"] --> B1["$$w_1, w_2, w_3$$ (OLS Coefficients)"] + B1 --> C1["$$\\alpha \\uparrow$$"] + C1 --> D1["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"] + D1 --> E1["$$w_i \\neq 0$$ for all i"] + E1 --> F1["$$\\text{No Feature Elimination}$$"] + end + + subgraph LASSO["Lasso Regression (L1)"] + A2["$$\\alpha = 0$$"] --> B2["$$w_1, w_2, w_3$$ (OLS Coefficients)"] + B2 --> C2["$$\\alpha \\uparrow$$"] + C2 --> D2["$$|w_i| \\downarrow$$ (Linear Shrinkage)"] + D2 --> E2["$$w_j = 0$$ for some j"] + E2 --> F2["$$\\text{Automatic Feature Selection}$$"] + end + + F1 -.->|"$$\\text{Shrinkage Path Comparison}$$"| F2 +``` + +## 3. Choosing the Alpha ($\alpha$) Parameter + +* **If $\alpha = 0$:** The penalty is removed, and the result is standard Ordinary Least Squares (OLS). +* **As $\alpha$ increases:** More coefficients are pushed to zero, leading to a simpler, more biased model. +* **If $\alpha$ is too high:** All coefficients become zero, and the model predicts only the mean (Underfitting). + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.linear_model import Lasso +from sklearn.preprocessing import StandardScaler + +# 1. Scaling is REQUIRED for Lasso +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) + +# 2. Initialize and Train +# 'alpha' is the regularization strength +lasso = Lasso(alpha=0.1) +lasso.fit(X_train_scaled, y_train) + +# 3. Check which features were selected (non-zero) +import pandas as pd +importance = pd.Series(lasso.coef_, index=feature_names) +print(importance[importance != 0]) + +``` + +## 5. Lasso vs. Ridge + +| Feature | Ridge ($L2$) | Lasso ($L1$) | +| --- | --- | --- | +| **Penalty** | Square of coefficients | Absolute value of coefficients | +| **Coefficients** | Shrink towards zero, but never reach it | Can shrink exactly to **zero** | +| **Use Case** | When most features are useful | When you have many "noisy" or useless features | +| **Model Type** | Dense (all features kept) | Sparse (some features removed) | + +## 6. Limitations of Lasso + +1. **Correlated Features:** If two features are highly correlated, Lasso will randomly pick one and discard the other, which can lead to instability. +2. **Sample Size:** If , Lasso can select at most features. + +## References for More Details + +* **[Scikit-Learn Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html):** Exploring `LassoCV`, which automatically finds the best Alpha using cross-validation. \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/ridge.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/ridge.mdx index e69de29..51687cf 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/ridge.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/ridge.mdx @@ -0,0 +1,133 @@ +--- +title: "Ridge Regression (L2 Regularization)" +sidebar_label: Ridge Regression +description: "Mastering L2 regularization to prevent overfitting and handle multicollinearity in regression models." +tags: [machine-learning, supervised-learning, regression, ridge, l2-regularization] +--- + +**Ridge Regression** is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle **overfitting** and issues caused by **multicollinearity** (when input features are highly correlated). + +## 1. The Mathematical Objective + +In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ($\beta$). + +The cost function becomes: + +$$ +Cost = \text{MSE} + \alpha \sum_{j=1}^{p} \beta_j^2 +$$ + +* **MSE (Mean Squared Error):** The standard loss (prediction error). +* **$\alpha$ (Alpha):** The complexity parameter. It controls how much you want to penalize the size of the coefficients. +* **$\beta_j^2$:** The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero. + +## 2. Why use Ridge? + +### A. Preventing Overfitting +When a model has too many features or the features are highly correlated, the coefficients ($\beta$) can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable. + +### B. Handling Multicollinearity +If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small. + +```mermaid +graph LR + subgraph RIDGE["Ridge Regression Coefficient Shrinkage"] + A["$$\\alpha = 0$$"] --> B["$$w_1, w_2, w_3$$ (OLS Coefficients)"] + B --> C["$$\\alpha \\uparrow$$"] + C --> D["$$|w_i| \\downarrow$$ (Shrinkage)"] + D --> E["$$w_i \\to 0$$ (Never Exactly Zero)"] + E --> F["$$\\text{Reduced Model Variance}$$"] + end + + subgraph OBJ["Optimization View"] + L["$$\\min \\sum (y - \\hat{y})^2 + \\alpha \\sum w_i^2$$"] + L --> G["$$\\text{Penalty Grows with } \\alpha$$"] + G --> H["$$\\text{Stronger Pull Toward Origin}$$"] + end + + C -.->|"$$\\text{Controls Strength}$$"| G + F -.->|"$$\\text{Bias–Variance Tradeoff}$$"| H + +``` + +## 3. The Alpha ($\alpha$) Trade-off + +Choosing the right $\alpha$ is a balancing act between **Bias** and **Variance**: + +* **$\alpha = 0$:** Equivalent to standard Linear Regression (High variance, Low bias). +* **As $\alpha \to \infty$:** The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias). + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.linear_model import Ridge +from sklearn.preprocessing import StandardScaler + +# 1. Scaling is MANDATORY for Ridge +# Because the penalty is based on the size of coefficients, +# features with larger scales will be penalized unfairly. +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) + +# 2. Initialize and Train +# alpha=1.0 is the default +ridge = Ridge(alpha=1.0) +ridge.fit(X_train_scaled, y_train) + +# 3. Predict +y_pred = ridge.predict(X_test_scaled) + +``` + +## 5. Ridge vs. Lasso: A Summary + +| Feature | Ridge Regression ($L2$) | Lasso Regression ($L1$) | +| --- | --- | --- | +| **Penalty Term** | $\alpha \sum_{j=1}^{p} \beta_j^2$ | $\alpha \sum_{j=1}^{p} \vert \beta_j \vert$ | +| **Mathematical Goal** | Minimizes the **square** of the weights. | Minimizes the **absolute value** of the weights. | +| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients **exactly to zero**, effectively removing the feature. | +| **Feature Selection** | **No.** Keeps all predictors in the final model, though some may have tiny weights. | **Yes.** Acts as an automated feature selector by discarding unimportant variables. | +| **Model Complexity** | Produces a **dense** model (uses all features). | Produces a **sparse** model (uses a subset of features). | +| **Ideal Scenario** | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. | +| **Handling Correlated Data** | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. | + +```mermaid +graph LR + subgraph L2["L2 Regularization Constraint (Ridge)"] + O1["$$w_1^2 + w_2^2 \leq t$$"] --> C1["$$\text{Circle (L2 Ball)}$$"] + C1 --> E1["$$\text{Smooth Boundary}$$"] + E1 --> S1["$$\text{Rarely touches axes}$$"] + S1 --> R1["$$w_1, w_2 \neq 0$$"] + end + + subgraph L1["L1 Regularization Constraint (Lasso)"] + O2["$$|w_1| + |w_2| \leq t$$"] --> D1["$$\text{Diamond (L1 Ball)}$$"] + D1 --> C2["$$\text{Sharp Corners}$$"] + C2 --> A1["$$\text{Corners lie on axes}$$"] + A1 --> Z1["$$w_1 = 0 \ \text{or}\ w_2 = 0$$"] + end + + R1 -.->|"$$\text{Geometry Explains Behavior}$$"| Z1 +``` + +## 6. RidgeCV: Finding the Best Alpha + +Finding the perfect manually is tedious. Scikit-Learn provides `RidgeCV`, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically. + +```python +from sklearn.linear_model import RidgeCV + +# Define a list of alphas to test +alphas = [0.1, 1.0, 10.0, 100.0] + +# RidgeCV finds the best one automatically +ridge_cv = RidgeCV(alphas=alphas) +ridge_cv.fit(X_train_scaled, y_train) + +print(f"Best Alpha: {ridge_cv.alpha_}") + +``` + +## References for More Details + +* **[Scikit-Learn: Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression):** Technical details on the solvers used (like 'cholesky' or 'sag'). From e2c9e1e24cb16e8083eb8df2e0493d6f6c885876 Mon Sep 17 00:00:00 2001 From: Ajay Dhangar Date: Sat, 3 Jan 2026 09:35:36 +0530 Subject: [PATCH 2/2] Fix: broken links found --- .../supervised-learning/regression/lasso.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx index 8ab7e78..0dff9d1 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/lasso.mdx @@ -25,7 +25,7 @@ Where: ## 2. Feature Selection: The Power of Zero -The most significant difference between Lasso and its sibling, [Ridge Regression](/tutorial/machine-learning/supervised-learning/regression/ridge), is that Lasso can shrink coefficients **exactly to zero**. +The most significant difference between Lasso and its sibling, [Ridge Regression](./ridge), is that Lasso can shrink coefficients **exactly to zero**. When a coefficient becomes zero, that feature is effectively removed from the model. This makes Lasso an excellent tool for: 1. **Automated Feature Selection:** Identifying the most important variables in a dataset with hundreds of features.