A complete MLOps project for predicting hotel reservation cancellations using LightGBM, with automated CI/CD pipelines using Jenkins and deployment on Google Cloud Platform (GCP).
- Overview
- Features
- Project Architecture
- Tech Stack
- Project Structure
- Installation
- Configuration
- Usage
- Pipeline Stages
- Deployment
- Web Application
- MLFlow Integration
- Contributing
- License
This project implements a complete machine learning pipeline to predict whether a hotel reservation will be canceled. It demonstrates best practices in MLOps including:
- Automated data ingestion from Google Cloud Storage
- Feature engineering and preprocessing
- Model training with hyperparameter tuning
- Experiment tracking with MLFlow
- Containerization with Docker
- CI/CD automation with Jenkins
- Cloud deployment on Google Cloud Run
- Data Ingestion: Automated data download from GCP buckets
- Data Preprocessing:
- Handling imbalanced data using SMOTE
- Feature selection using Random Forest
- Label encoding for categorical variables
- Model Training:
- LightGBM classifier with RandomizedSearchCV
- Hyperparameter tuning
- Experiment Tracking: MLFlow for logging parameters, metrics, and artifacts
- Web Interface: Flask-based web application for predictions
- CI/CD Pipeline: Automated Jenkins pipeline for build and deployment
- Cloud Deployment: Containerized deployment on Google Cloud Run
βββββββββββββββββββ
β GCP Bucket β
β (Raw Data) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Data Ingestion β
β (Download & β
β Split Data) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Data Processing β
β (Preprocessing,β
β Balancing, β
β Feature Sel.) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Model Training β
β (LightGBM with β
β MLFlow) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Flask Web App β
β (Predictions) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Docker β
β (Containerize) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Jenkins β
β (CI/CD) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Google Cloud β
β Run β
βββββββββββββββββββ
- Language: Python 3.x
- ML Framework: LightGBM, scikit-learn
- Data Processing: Pandas, NumPy, imbalanced-learn
- Experiment Tracking: MLFlow
- Web Framework: Flask
- Cloud Platform: Google Cloud Platform (GCS, Cloud Run)
- Containerization: Docker
- CI/CD: Jenkins
- Version Control: Git
Hotel-Reservation-Prediction/
β
βββ src/ # Source code
β βββ data_injection.py # Data ingestion from GCP
β βββ data_preprocessing.py # Data preprocessing pipeline
β βββ model_training.py # Model training with MLFlow
β βββ logger.py # Logging configuration
β βββ custom_exception.py # Custom exception handling
β
βββ pipeline/ # Pipeline scripts
β βββ training_pipeline.py # Complete training pipeline
β
βββ config/ # Configuration files
β βββ config.yaml # Main configuration
β βββ paths_config.py # Path configurations
β βββ model_params.py # Model hyperparameters
β
βββ utils/ # Utility functions
β βββ common_functions.py # Common helper functions
β
βββ templates/ # Flask HTML templates
β βββ index.html # Web interface
β
βββ static/ # Static files (CSS, JS)
β
βββ artifacts/ # Generated artifacts
β βββ raw/ # Raw data
β βββ processed/ # Processed data
β βββ models/ # Trained models
β
βββ notebook/ # Jupyter notebooks
β βββ notebook.ipynb # Exploratory data analysis
β
βββ jenkins/ # Jenkins configuration
β βββ Dockerfile # Jenkins Docker setup
β
βββ application.py # Flask application
βββ Dockerfile # Application Dockerfile
βββ Jenkinsfile # Jenkins pipeline definition
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
βββ README.md # This file
- Python 3.8 or higher
- Google Cloud Platform account (for data ingestion and deployment)
- Docker (optional, for containerization)
- Jenkins (optional, for CI/CD)
-
Clone the repository
git clone https://github.com/Sameeh07/Hotel-Reservation-Prediction.git cd Hotel-Reservation-Prediction -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install --upgrade pip pip install -e .
Edit config/config.yaml:
data_ingestion:
bucket_name: "your-bucket-name" # Your GCP bucket name
bucket_file_name: "HotelReservations.csv"
train_ratio: 0.8export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"Modify config/model_params.py to adjust hyperparameters:
LIGHTGM_PARAMS = {
'n_estimators': randint(100, 500),
'max_depth': randint(5, 50),
'learning_rate': uniform(0.01, 0.2),
# ... more parameters
}Run the complete training pipeline:
python pipeline/training_pipeline.pyThis will:
- Download data from GCP bucket
- Split data into train/test sets
- Preprocess and balance the data
- Select top features
- Train LightGBM model with hyperparameter tuning
- Log experiments to MLFlow
- Save the trained model
Data Ingestion Only:
python src/data_injection.pyData Preprocessing Only:
python src/data_preprocessing.pyModel Training Only:
python src/model_training.pypython application.pyThe application will be available at http://localhost:8080
Build the Docker image:
docker build -t hotel-reservation-app .Run the container:
docker run -p 8080:8080 hotel-reservation-app- Downloads CSV data from GCP bucket
- Splits data into training (80%) and testing (20%) sets
- Saves raw data to
artifacts/raw/
- Handles imbalanced data using SMOTE (Synthetic Minority Over-sampling Technique)
- Encodes categorical variables using LabelEncoder
- Selects top 10 features using Random Forest feature importance
- Handles skewness in numerical features
- Saves processed data to
artifacts/processed/
- Trains LightGBM classifier
- Performs hyperparameter tuning using RandomizedSearchCV
- Evaluates model on test set (accuracy, precision, recall, F1-score)
- Logs all parameters, metrics, and artifacts to MLFlow
- Saves trained model to
artifacts/models/
The Jenkinsfile defines a 4-stage pipeline:
- Clone Repository: Pulls latest code from GitHub
- Setup Environment: Creates virtual environment and installs dependencies
- Build & Push: Builds Docker image and pushes to Google Container Registry (GCR)
- Deploy: Deploys container to Google Cloud Run
-
Build and push Docker image:
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/hotel-reservation-app
-
Deploy to Cloud Run:
gcloud run deploy hotel-reservation-app \ --image gcr.io/YOUR_PROJECT_ID/hotel-reservation-app \ --platform managed \ --region us-central1 \ --allow-unauthenticated
The Flask web application provides a user-friendly interface for making predictions. Users can input reservation details:
- Lead time
- Number of special requests
- Average price per room
- Arrival month and date
- Market segment type
- Number of weeknights and weekend nights
- Type of meal plan
- Room type reserved
The model predicts whether the reservation is likely to be canceled.
MLFlow tracks all experiments, including:
- Parameters: Model hyperparameters
- Metrics: Accuracy, precision, recall, F1-score
- Artifacts:
- Training and test datasets
- Trained model files
- Model parameters
To view MLFlow UI:
mlflow uiAccess at http://localhost:5000
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Sameeh
- GitHub: @Sameeh07
- Dataset: Hotel Reservations Dataset
- LightGBM team for the excellent gradient boosting framework
- MLFlow community for experiment tracking tools
- Google Cloud Platform for hosting infrastructure
Note: Make sure to replace placeholder values (like your-bucket-name, your-project-name) with your actual configuration values before running the project.