This project dives into advanced techniques for multiclass text classification. In this project, we harness the power of BERT (Bidirectional Encoder Representations) - an open-source ML framework by Google, renowned for delivering state-of-the-art results in various NLP tasks.
In our journey through NLP algorithms, from Naïve Bayes to RNN and LSTM, we now embark on the efficiency of BERT for text classification. BERT's bidirectional encoding and pre-trained capabilities elevate our project to new heights of accuracy and performance.
Our goal is to leverage the pre-trained BERT model for multiclass text classification, utilizing a dataset containing over two million customer complaints about consumer financial products.
The dataset includes customer complaints with corresponding product categories. Our objective is to predict product categories based on the text of the complaints.
- Language: Python
- Libraries: pandas, torch, nltk, numpy, pickle, re, tqdm, sklearn, transformers
Before diving in, ensure you have the required packages installed. Refer to the requirements.txt file for the specific libraries and versions needed.
-
Installing Necessary Packages:
- Use the pip command to install required packages.
-
Importing Required Libraries:
- Set the stage by importing essential libraries.
-
Defining Configuration File Paths:
- Establish paths for configuration files.
-
Processing Text Data:
- Read and preprocess the CSV file.
- Handle null values, duplicate labels, and encode labels.
-
Data Preprocessing:
- Convert text to lowercase.
- Remove punctuation, digits, consecutive instances of 'x', and extra spaces.
- Tokenize the text and save tokens.
-
Model Building:
- Create the BERT model.
- Define PyTorch dataset functions.
- Implement functions for model training and testing.
-
Train BERT Model:
- Load files and split data.
- Create PyTorch datasets and data loaders.
- Define the model, loss function, and optimizer.
- Train the model and test its performance.
-
Predictions of New Text:
- Make predictions on new text data using the trained model.
Upon unzipping the modular_code.zip file, you'll find folders:
-
Input:
- Contains the analysis data, in this case,
complaints.csv.
- Contains the analysis data, in this case,
-
Output:
- Contains essential files for model training:
bert_pre_trained.pthlabel_encoder.pkllabels.pkltokens.pkl
- Contains essential files for model training:
-
Source:
- Holds modularized code in Python files for better organization:
model.pydata.pyutils.py
- Holds modularized code in Python files for better organization:
-
Config:
config.pyfile with project configurations.
-
Engine:
Engine.pyis the main file for running the entire code, training the model, and saving it in the output folder.
-
Notebook:
bert.ipynbis the original notebook used in the development.
-
Processing and Predictions:
processing.pyprocesses the data.predict.pymakes predictions on the data.
-
README and Requirements:
README.mdprovides detailed instructions, andrequirements.txtlists necessary libraries.
-
Understanding the Business Problem:
- Grasping the intricacies of multiclass text classification.
-
Exploring Pre-trained Models:
- Introduction to the concept and significance of pre-trained models.
-
BERT Model Insights:
- Understanding the architecture and functioning of BERT.
-
Data Preparation Techniques:
- Handling spaces, digits, and punctuation for effective model input.
-
BERT Tokenization:
- Implementing BERT tokenization for text processing.
-
Model Architecture and Training:
- Creating and training the BERT model using CUDA or CPU.
-
Predictions on New Text Data:
- Applying the trained model for predictions on unseen text data.
Feel free to explore the modular_code.zip for organized code snippets. The project provides a seamless experience with pre-trained models, ensuring quick and efficient use without the need to retrain from scratch.
For a more hands-on experience, refer to the bert.ipynb notebook and follow the instructions in the README.md file for detailed guidance.
Happy coding! 🚀✨