Data Science Salary Estimator

End to End MLOps Data Science Project: "Predicting Salary of a Data Scientist in India"

With MLflow Experiment Tracking

🚀 Live Application

🌐 The application is deployed and live

Note

The initial load of the web app may take 1-2 minutes. Once loaded, refresh the page to ensure all features work correctly.

Tip

For the best experience, please refer to the Usage Guide section below to learn how to navigate and use the web app effectively.

📌 Project Overview

Developed a robust model to predict the salary of Data Scientists in India.
Collected data from glassdoor website, scraped over 800 job postings.
Cleaned and pre-processed the raw data.
Engineered new featues, created new features that captures the importance of tools like 'python', 'r', 'sql', 'aws', 'spark', 'genai', 'LLMs' for a data science role.
Trained multiple machine learning algorithms and evaluated them using cross-validation and GridSearch.
Integrated MLflow to track experiments, metrics, hyperparameters, and model artifacts automatically.
Deployed the best-performing model as a Flask API.
Successfully hosted the web app on Render for continuous availability

🧱 Project Workflow

1. Data Collection:

Using selenium framework I scraped the Data Science job postings within India from the glassdoor website.
Scraped all the job postings from the website (around 900 job postings).
For each job I collected the following:
- Company Name
- Job Title
- Salary Estimate
- Location of the job
- Job Description
- Rating of the company

2. Data Cleaning & Preprocessing:

Once the data is scraped I performed data clearning process and also prepared the data for model building.
During the clearning process I did the following:
- Filled the missing values using the most suitable method (there were a lot of missing values so couldn't just drop it)
- Removed unwanted text, black spaces from the values of different columns
- Parsed numeric data from 'Salary Esitmate' column.
- Found the age of the company using 'Founded' column.
- Created the following new columns for the skills, tools listed in 'Job Description' column:
  - Python
  - r
  - sql
  - aws
  - spark
  - genai
  - LLMs
- Created new features for type of roles, seniority levels.

3. Exploratory Data Analysis & Feature Engineering:

After the data is clearned I analyzed the data to find hidden patterns, trends other relationship between features.
Performed both univariate and bivariate/multi-variate analysis.
Visualized the distribution of each features and explored the values and their counts of each features.
Visualized the presence of missing values in the dataset.
Found relationship (correlation) between features.
Found relationship between the revenue of the company and the salary they provide.
Found the companies which has higher ratings (more than 4.0 & 4.5)
Found the common industries and sectors the company is in and so on.

4. Model Building with MLFlow tracking:

Split the dataset into train and test sets.
Trained multiple models (Linear Regression, Ridge, Lasso, Random Forest, XGBoost, CatBoost).
Logged model parameters, metrics, and artifacts to MLflow.
Used MLflow to register and compare the best-performing model based on R² score.

5. Productionization & Deployment:

Built a Flask API endpoint that takes in job posting details and returns estimated salary.
Designed an intuitive web interface using HTML and CSS for user interaction..
Deployed the application on Render with continuous availability.
The live application is accessible to anyone with internet access.

🛠 Tech Stack

Technology	Description
Python	Programming language used
Render	Cloud platform for deployment and hosting
Selenium	Scraping real world data
Flask	Web framework for UI and API integration
MLflow	Experiment tracking and model registry
HTML & CSS	Frontend design and styling
Pandas	Cleaning and preprocessing the data
Numpy	Performing numerical operations
Matplotlib	Visualization of the data

🚀 Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/Dhanush-Raj1/Data-Science-Salary-Project.git
cd Data-Science-Salary-Project

2️⃣ Create a Virtual Environment

conda create -p envi python==3.9 -y
source venv/bin/activate   # On macOS/Linux
conda activate envi     # On Windows

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Start MLflow Tracking Server

mlflow ui --backend-store-uri sqlite:///mlruns.db  --default-artifact-root ./mlruns --host 127.0.0.1 --port 8000

Access the mlflow UI at: http://127.0.0.1:8000

5️⃣ Run the Training Script

python main.py

To access the Flask App

python app.py

The app will be available at: http://127.0.0.1:5000/

🌐 Usage Guide

Access the web app

1️⃣ Click the link to open the web app in your browser
2️⃣ Click the "Predict" button on the home page of the web app which will take you to the predict page
3️⃣ Enter the company details in the respective dropdowns
4️⃣ Click the "Predit" button and scroll down to see the predicted results

📸 Screenshots

MLFlow UI (model logging, best model registry)

🟠 Home Page

🔵 Predict Page

Result

🎯 Future Enhancements

✅ Add more job platforms like LinkedIn and Indeed for better data
✅Host MLflow Tracking Server remotely for persistent experiment logs
✅Automate retraining pipelines with GitHub Actions and CI/CD
✅Add real-time salary updates based on market trends

🤝 Contributing

💡 Contributions, issues, and pull requests are welcome! Feel free to open an issue or submit a PR to improve this project. 🚀

📄 License

This project is licensed under the Apache License – see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
artifacts		artifacts
data		data
notebooks		notebooks
readme_images		readme_images
src		src
static		static
templates		templates
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
chromedriver1.exe		chromedriver1.exe
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science Salary Estimator

End to End MLOps Data Science Project: "Predicting Salary of a Data Scientist in India"

With MLflow Experiment Tracking

🚀 Live Application

📌 Project Overview

🧱 Project Workflow

1. Data Collection:

2. Data Cleaning & Preprocessing:

3. Exploratory Data Analysis & Feature Engineering:

4. Model Building with MLFlow tracking:

5. Productionization & Deployment:

🛠 Tech Stack

🚀 Installation & Setup

1️⃣ Clone the Repository

2️⃣ Create a Virtual Environment

3️⃣ Install Dependencies

4️⃣ Start MLflow Tracking Server

5️⃣ Run the Training Script

To access the Flask App

🌐 Usage Guide

📸 Screenshots

MLFlow UI (model logging, best model registry)

🟠 Home Page

🔵 Predict Page

Result

🎯 Future Enhancements

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Dhanush-Raj1/Data-Science-Salary-Project

Folders and files

Latest commit

History

Repository files navigation

Data Science Salary Estimator

End to End MLOps Data Science Project: "Predicting Salary of a Data Scientist in India"

With MLflow Experiment Tracking

🚀 Live Application

📌 Project Overview

🧱 Project Workflow

1. Data Collection:

2. Data Cleaning & Preprocessing:

3. Exploratory Data Analysis & Feature Engineering:

4. Model Building with MLFlow tracking:

5. Productionization & Deployment:

🛠 Tech Stack

🚀 Installation & Setup

1️⃣ Clone the Repository

2️⃣ Create a Virtual Environment

3️⃣ Install Dependencies

4️⃣ Start MLflow Tracking Server

5️⃣ Run the Training Script

To access the Flask App

🌐 Usage Guide

📸 Screenshots

MLFlow UI (model logging, best model registry)

🟠 Home Page

🔵 Predict Page

Result

🎯 Future Enhancements

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages