4 min readAug 15, 2021

Practical:-2

AIM:-Data Preprocessing using Scikit Learn

This blog is basically for basic data preprocessing.

What is data preprocessing?

Data preprocessing is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, etc.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

About the Data Set:-

Data Set is all about Credit Card Approval Prediction. Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk. There is 19 Coloum is there.

Get More Information and Download Dataset from HERE.

Encoding:-

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

There are two types of encoders we will discuss here.

LabelEncoder:- when we convert a dataset that to convert those categories into numerical features we can use a Label encoder.

Here you can see the Female and Male labels will be transer into 0 and 1.

2. OneHotEncoder:-

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories.

Normalization:-

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

Standardization:-

Standardization is the process of developing and implementing technical standards. The process establishes a common agreement for engineering criteria, terms, principles, practices, materials, items, processes, and equipment parts, and components.

Imputing the Missing Values:-

This method is commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70–75% of missing values. This method is advised only when there are enough samples in the data set.

Discretization:-

In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers.

Conclusion
We learn about encoding, Normalization, Standardization, Imputing the Missing Values, and Discretization.

GitHub Link:-

https://github.com/YagnikBavishi/DataScience/tree/main/Data_Preprocessing_PR2

Practical:-2

AIM:-Data Preprocessing using Scikit Learn

Written by Yagnikbavishi

No responses yet