Breaking News

Step-by-Step Building Block For Machine Learning Models – Analytics India Magazine

Machine learning is a process where the machine can learn hidden patterns from the data and has the potential to give predictions. It is also called the subset and application of Artificial Intelligence. There are many different real-life use cases of machine learning that are widely used today for example, in the banking sector where the authorities use machine learning models to predict whether a loan applicant will be a defaulter or not. The website that generates your credit score also uses machine learning for calculations. There are mainly two types of tasks that are done in machine learning that includes Classification and Regression. Classification is a task where predictive models are trained to classify data into different classes like classifying different fruits by passing images to the model whereas regression is a task where models are built to predict continuous variables like predicting the temperature of the next day.

In this article, we will explore classification tasks mainly and we will see how to build a classification model in machine learning following the different steps that are required. We will make use of Iris data set that is publicly available for downloading on the UCI Machine learning Repository. The data set contains the length and width of sepals and petals with their respective species. We will build a machine learning model that would be able to predict which species the flower belongs to when we pass these lengths of the flower to the model.

What Will You Learn From This Article? 

Import data from csv files. Exploratory Data analysisData visualisationSplitting data into training and testing Building machine learning modelsPredictions by the modelsModel Evaluation 

Importing the data from csv files

There is a function in the pandas package that is widely used for importing datasets. It allows you to import data in different formats like csv files, xlsx, etc. We will make use of the same function. Use the below code to import the data set and print the first 10 rows in the data. We will first import the pandas package and then read the data. 

import pandas as pd

df = pd.read_csv(‘iris.csv’)



Exploratory Data Analysis 

Exploratory data analysis (EDA) is a process where we explore the data set to get familiar with it. We find out the shape of the data, missing values, data type, etc. All the tasks that are done on data before building a machine learning model come under EDA. We will now explore the data set we just imported. Use the below code to check for basic EDA. 






We found there are 150 rows and 5 columns in this data having no null values. Species column is an object type column, and all others have float type values. There were no missing values in the data. We now transform the categorical column species using Label Encoder. Use the code shown below to do the conversion. 

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()



We will now check the descriptive statistics of the data and correlation between the columns of the data. Use the below code to do the same. 





Data Visualisation

It is the graphical representation of data that is used to check about the presence of outliers, patterns, distribution of the data, etc. There are different data visualisation libraries in python that include matplotlib, seaborn, etc. We will make use of the seaborn library to visualise the pairplots. Use the below code to check the pairplot. We will first import the seaborn library and then print the pairplot. 

import seaborn as sns

sns.pairplot(df, hue=’species’)


Splitting data into training and testing

Before building a machine learning model, data is always split into two different parts that are called Training and Testing. For the training purpose of the model, we only expose the training data and never allow testing data to be exposed. Once the model gets trained using that data, we make use of the model to compute the predictions over the testing data, which is stored in a single variable known as y_pred. We can store it in a different variable as well. We will first define the independent variable and dependent variable X and y, respectively. Now we will split our data. Use the below code to the same. 




y= df[‘species’]



from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)






We have split the data and checked the shape of training as well as testing data.

Building Machine Learning Models

We will now build the machine learning model using two different machine learning algorithms that are Logistic Regression and Random Forest. Logistics regression comes from linear models, whereas random forest is an ensemble method. We will first import these and then will pass the training data to both the models. After it gets trained, we will compute predictions over testing data and store in different variables. Use the below code to the same.

Model 1

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(),y_train)

y_pred_lr = lr.predict(X_test)

Model 2

from sklearn.ensemble import RandomForestClassifier

rfcl = RandomForestClassifier(),y_train)
See Also

y_pred_rf = rfcl.predict(X_test)

Prediction by the Models 

We will now compute predictions for some rows and check if the model can predict correctly. We will make predictions of 10-15 rows with model 1 and 15-20 with model 2. After prediction, we will compare them with the actual class. 

print(“Prediction by model 2: “, lr.predict(X_train.iloc[10:15]))

print(“nActual Labels: n”,y_train.iloc[15:20])


print(“Prediction by model 2: “, rfcl.predict(X_train.iloc[15:20]))

print(“nActual Labels: n”,y_train.iloc[15:20])


We can see that both the models have given the correct predictions for the respective predictions we made.

Model Evaluation 

Model evaluation is a technique where we check about the performance of the model by computing different error metrics.  There are many different error metrics like accuracy, confusion matrix, mean squared error, mean absolute error that is used to check the performance in classification as well as regression tasks. We have built our model for classification purposes so we would compute metrics that are used to evaluate the classification model. 

We will first compute the accuracy score followed by the confusion matrix and classification report. Use the below code to compute the same. 

Accuracy Score

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print(“Logistic Regression: “,accuracy_score(y_pred_lr,y_test))

print(“Random Forest: “, accuracy_score(y_pred_rf,y_test))


Confusion Matrix

print(“Logistic Regression: n”,confusion_matrix(y_pred_lr,y_test))

print(“nRandom Forest: n”,confusion_matrix(y_pred_rf,y_test))


Classification Report

print(“Logistic Regression: n”,classification_report(y_pred_lr,y_test))

print(“nRandom Forest: n”,classification_report(y_pred_rf,y_test))



I would conclude the article by hoping that now you have understood every step that is required to be done to build a machine learning model. We have built the classification model for classifying the species of flower and then evaluated it using different error metrics. You can now check this article “Hands-on-Guide to machine learning model deployment using Flask” where you can learn how to deploy these models and check their performance in real-time. Also, check this article that is titled “Model Evaluation and Error Metrics” where you can learn more on error metrics for model evaluation.
Provide your comments below commentsIf you loved this story, do join our Telegram Community.
Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.