NLP Project for Multi-Class Text Classification using BERT Model

Project Overview

This project dives into advanced techniques for multiclass text classification. In this project, we harness the power of BERT (Bidirectional Encoder Representations) - an open-source ML framework by Google, renowned for delivering state-of-the-art results in various NLP tasks.

Business Overview

In our journey through NLP algorithms, from Naïve Bayes to RNN and LSTM, we now embark on the efficiency of BERT for text classification. BERT's bidirectional encoding and pre-trained capabilities elevate our project to new heights of accuracy and performance.

Aim

Our goal is to leverage the pre-trained BERT model for multiclass text classification, utilizing a dataset containing over two million customer complaints about consumer financial products.

Data Description

The dataset includes customer complaints with corresponding product categories. Our objective is to predict product categories based on the text of the complaints.

Tech Stack

Language: Python
Libraries: pandas, torch, nltk, numpy, pickle, re, tqdm, sklearn, transformers

Prerequisite

Before diving in, ensure you have the required packages installed. Refer to the requirements.txt file for the specific libraries and versions needed.

Approach

Installing Necessary Packages:
- Use the pip command to install required packages.
Importing Required Libraries:
- Set the stage by importing essential libraries.
Defining Configuration File Paths:
- Establish paths for configuration files.
Processing Text Data:
- Read and preprocess the CSV file.
- Handle null values, duplicate labels, and encode labels.
Data Preprocessing:
- Convert text to lowercase.
- Remove punctuation, digits, consecutive instances of 'x', and extra spaces.
- Tokenize the text and save tokens.
Model Building:
- Create the BERT model.
- Define PyTorch dataset functions.
- Implement functions for model training and testing.
Train BERT Model:
- Load files and split data.
- Create PyTorch datasets and data loaders.
- Define the model, loss function, and optimizer.
- Train the model and test its performance.
Predictions of New Text:
- Make predictions on new text data using the trained model.

Modular Code Overview

Upon unzipping the modular_code.zip file, you'll find folders:

Input:
- Contains the analysis data, in this case, complaints.csv.
Output:
- Contains essential files for model training:
  - bert_pre_trained.pth
  - label_encoder.pkl
  - labels.pkl
  - tokens.pkl
Source:
- Holds modularized code in Python files for better organization:
  - model.py
  - data.py
  - utils.py
Config:
- config.py file with project configurations.
Engine:
- Engine.py is the main file for running the entire code, training the model, and saving it in the output folder.
Notebook:
- bert.ipynb is the original notebook used in the development.
Processing and Predictions:
- processing.py processes the data.
- predict.py makes predictions on the data.
README and Requirements:
- README.md provides detailed instructions, and requirements.txt lists necessary libraries.

Project Takeaways

Understanding the Business Problem:
- Grasping the intricacies of multiclass text classification.
Exploring Pre-trained Models:
- Introduction to the concept and significance of pre-trained models.
BERT Model Insights:
- Understanding the architecture and functioning of BERT.
Data Preparation Techniques:
- Handling spaces, digits, and punctuation for effective model input.
BERT Tokenization:
- Implementing BERT tokenization for text processing.
Model Architecture and Training:
- Creating and training the BERT model using CUDA or CPU.
Predictions on New Text Data:
- Applying the trained model for predictions on unseen text data.

Additional Information

Feel free to explore the modular_code.zip for organized code snippets. The project provides a seamless experience with pre-trained models, ensuring quick and efficient use without the need to retrain from scratch.

For a more hands-on experience, refer to the bert.ipynb notebook and follow the instructions in the README.md file for detailed guidance.

Happy coding! 🚀✨

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Project for Multi-Class Text Classification using BERT Model

Project Overview

Business Overview

Aim

Data Description

Tech Stack

Prerequisite

Approach

Modular Code Overview

Project Takeaways

Additional Information

About

Uh oh!

Releases

Packages

ssjiyobindas/Multi-Class-Text-Classification-using-BERT-Model

Folders and files

Latest commit

History

Repository files navigation

NLP Project for Multi-Class Text Classification using BERT Model

Project Overview

Business Overview

Aim

Data Description

Tech Stack

Prerequisite

Approach

Modular Code Overview

Project Takeaways

Additional Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages