10 Steps of Machine Learning Process:
Demystifying the Process of Building and Training Machine Learning Models
The term “machine learning” was first coined by Arthur Samuel in 1959 when he defined it as a “field of study that gives computers the ability to learn without being explicitly programmed.”
Table of Content
- Load the data
- Preprocess and transform the data
- Split data into train, validation, and test sets
- Define the architecture of the Model
- Define the loss function
- Define the optimizer with a learning rate
- Train the model for a specific number of epochs and in batch
- Evaluate the model
- Fine-tune the model (Optional)
- Make predictions on new data (Inference)
Step 1: Load the data
Data is the foundation of successful machine learning models. The phrase ‘Data! Data! Data! I cannot make bricks without clay’ is attributed to Sherlock Holmes, highlighting the crucial role of data in the model development process.
We start by loading the data, which can be in various forms like CSV files or databases, making it available for our model to process. Python, a popular language in the machine learning community, often employs pandas, a powerful library for data manipulation, to load and handle the data as dataframes
Step 2: Preprocess and transform the data
This step involves two aspects:
- Transforming Data with Libraries like PyTorch and TensorFlow: In machine learning, we often utilize libraries like PyTorch (by Facebook) or TensorFlow (by Google) to build and train models. To effectively use these libraries, we need to transform our data into specific data types, such as tensors (a commonly used data structure). Converting the data into tensors allows us to perform mathematical operations and other computations efficiently, enabling seamless integration with these powerful machine learning frameworks.
- Handling Data Issues: As we work with real-world data, we encounter various challenges that can hinder model training and performance. Common issues include handling missing values, scaling numerical features, and encoding categorical variables. To address these problems, we must preprocess the data before feeding it to our machine learning models. By carefully removing missing values, scaling features, and encoding categorical variables, we ensure that the data is in a suitable format for model training and enables our models to learn effectively from the available information.
Step 3: Split data into train, validation, and test sets
Let’s understand two scenarios:
In the real world, machine learning models are often trained on massive datasets consisting of billions of data points (for simplicity, let’s consider millions of rows in a table). Training such models can be a resource-intensive process, consuming days, enormous memory, and significant financial investment.
Machine learning models go through multiple iterations on the same data, known as epochs. Each epoch helps the model learn and improve its performance for subsequent runs. However, this repetitive learning on the same data raises concerns about how well the model will perform on new, unseen data. Will it generalize well and deliver satisfactory results in real-world situations?
To address these concerns, data is split into three distinct groups:
- Training Set: This data subset is used to train the model. During training, the model learns from this data, adjusting its parameters to fit the patterns present in the training set.
- Validation Set: The validation set serves as a checkpoint during the training process. After each epoch, the model is tested on this unseen data to assess its performance on new instances. If the model’s performance is unsatisfactory, adjustments can be made to its architecture, hyperparameters, or other elements to improve its generalization capabilities.
- Testing Set: Once the model’s training is complete, it faces the ultimate challenge — evaluation on completely new, unseen data. The testing set simulates real-world scenarios, providing insights into the model’s ability to handle novel situations. Evaluating the model’s performance on the testing set provides a measure of its real-world effectiveness.
By splitting the data into these three groups, we create a robust training and evaluation pipeline. The model’s performance on the testing set offers valuable feedback on its generalization capabilities, ensuring that it can make accurate predictions on new data beyond the training and validation sets. Ultimately, this process helps us build machine learning models that are both efficient and effective in real-world applications.
Step 4: Define the architecture of the Model
In the next step, we define the architecture of the model, specifying the number of layers, the number of neurons in each layer, and the activation function used in each layer. Further details about this process will be covered in the upcoming articles. Today we are focusing on the steps, not the implementation and internal working.
Step 5: Define the loss function
The loss function, as the name suggests, evaluates the model’s performance on the validation set. After training the model on the training set, it assesses its accuracy on the validation set. Essentially, the loss function is a mathematical function that measures the difference between the predicted value and the actual value.
Step 6: Define the optimizer with a learning rate
Now that we have an understanding of how the loss function provides insights into the performance of our model, the next crucial step is optimization. Optimization techniques are employed to adjust the weights (parameters) of the model based on the loss. The objective of optimization is to iteratively update the weights in a way that reduces the loss in the subsequent runs (epochs).
The learning rate plays a significant role in the optimization process. It determines the magnitude of weight adjustments during each update. To illustrate, let’s consider a scenario where the optimization algorithm identifies that the weights are too large, denoted as W. The learning rate dictates how much the optimizer should scale down these weights. Depending on the learning rate chosen, it could be 0.1W, 0.5W, or even 0.001W. This scalar factor is set by us, the model builders, and is usually assigned a small value within the range of 0.05 to 0.001.
By carefully selecting an appropriate learning rate, we can strike the right balance between fast convergence and overshooting the optimal weight values. An optimal learning rate enables the model to efficiently navigate the weight space and converge towards a solution that minimizes the loss, leading to a well-trained and high-performing machine learning model.
Step 7: Train the model for a specific number of epochs and in batch
When dealing with datasets containing billions of data points (rows), two main challenges arise. First, computing mathematical outputs for all these inputs at once can be time-consuming and memory-intensive for machine learning models. Second, waiting until the end of the entire dataset to know the loss and update weights can be inefficient.
To address these challenges, a wise approach is to use batching. Batching involves breaking down the large dataset into smaller subsets or batches. Instead of processing the entire dataset at once, the model runs on each batch, calculates the loss, and updates the weights in an iterative manner.
For instance, suppose we have a dataset with 10 billion data points. We can divide it into batches of 2 billion data points each. The model runs through the first batch (data points 1 to 2 billion), calculates the loss, and updates the weights. Then it proceeds to the next batch (data points 3 to 4 billion), and so on, until the last batch (data points 9 to 10 billion) has been processed. This entire process, covering all batches, constitutes one epoch. If we want to train the model for 5 epochs, the same procedure will be repeated five times.
By employing batching, we can optimize the training process, efficiently manage computational resources, and ensure that the model converges to a suitable solution without being overwhelmed by the sheer size of the dataset.
Step 8: Evaluate the model
The evaluation process provides an unbiased estimate of the model’s ability to generalize to new, unseen data. The choice of evaluation metrics depends on the specific problem type. For classification tasks, metrics like accuracy, precision, recall, and F1-score are commonly used. On the other hand, for regression tasks, metrics such as mean absolute error and root mean squared error are commonly employed.
Step 9: Fine-tune the model (Optional)
To improve the model’s performance, we can optimize hyperparameters such as epochs, learning rate, and other factors that can be adjusted before or after the model training. These hyperparameters play a crucial role in shaping the model’s behavior and overall accuracy.
Fine-tuning the model involves experimenting with different hyperparameter values to achieve the highest possible accuracy. By systematically adjusting these hyperparameters, we can observe how the model responds and identify the optimal configuration that yields the best results. This iterative process of fine-tuning allows us to unlock the full potential of the model and improve its performance on the task at hand.
Step 10: Make predictions on new data (Inference)
After achieving satisfaction with our model, we proceed to deploy it into production, where it will operate on real-world data that has not been previously seen. In this production environment, the model will be utilized to make predictions, aid in decision-making, and perform various other tasks to address practical scenarios and real-time challenges.
Thank You
Thank you for taking the time to read this article. I hope it provided you with a clear understanding of the concepts and their implementation.
I value your feedback! If you have any comments or questions, please feel free to share them with me on comment or email me directly.
Python has emerged as the most popular choice for machine learning due to its versatility, ease of use, and robust ecosystem of libraries. According to a survey conducted by Kaggle, a leading data science community, a staggering 78.8% of data scientists and machine learning practitioners prefer Python as their primary programming language.
I have published six articles that provide a comprehensive understanding of Python. Ensure you utilize this knowledge to its fullest potential.
Link for Part 1: https://medium.com/@siddp6/python-programming-language-part-1-6-8b937f7297bf
Link for Part 2: https://siddp6.medium.com/python-programming-language-part-2-6-403dabaa7c6a
Link for Part 3: https://medium.com/@siddp6/python-programming-language-part-3-6-ab0af8000e27
Link for Part 4: https://medium.com/@siddp6/conditionals-and-loops-python-programming-language-part-4-6-b5b1a8c9521e
Link for Part 5: https://siddp6.medium.com/functions-in-python-programming-language-part-5-6-5c2c5b1df5fe
Link for Part 6: https://siddp6.medium.com/classes-and-object-oriented-programming-oop-in-python-programming-language-part-6-6-4e2fca5e1eb9
Copyright © 2023 Siddhartha Purwar. All rights reserved. Portions of this content were enhanced grammatically and refined with the assistance of ChatGPT, an AI language model by OpenAI.