Is Data Cleaning tedious? Basic Approaches of Data Preprocessing and Exploration with Pandas
“The quality of inputs decides the quality of outputs”.
In Python, Pandas are mostly used for structured data operations and preprocessing as the data often found in the industries are noisy and inconsistent. Once we get our business hypothesis ready, it’s time to load the data! Generally, data comes in various formats, mostly structured if we are dealing with enterprise databases and warehouses, semi-structured if data is stored in XML or JSON and unstructured if data is not organized or does not have a pre-defined model.

While doing data analysis with Python, Pandas is considered to be the most useful and powerful library. But before moving towards analysis and model building, it is crucial to identify if the data is cleaned and preprocessed. Majority of Data Scientist spend most of their time (60 - 70 %) cleaning the data and rest of the time complaining about the dirty data as it is the most time-consuming task which requires strenuous efforts. Here, I would like to share some techniques that I use for data preprocessing and manipulations by building generic functions to optimize these tasks.

Basically, data preprocessing includes cleaning and transformation of variables or features which are used for building predictive models. After the data is in format and cleaned, some techniques are most useful for operations and analysis. With no further delay, I would like to share some of the frequently used techniques and how code snippets can help to achieve this in no time.
1. Remove multiple columns: Data we get might contain many columns which can be irrelevant to our hypothesis. Hence, removing these might help to build accurate models and improve their performances.
2. Check Missing Values: No data is perfect. There are always missing values present due to extraction errors and used fields and it is always imperative to check inconsistencies and noise.
3. Remove Missing Values in Columns: The easiest way to deal with noisy data is either to remove or impute with the statistical methods.
For instance, we have a data frame object ‘Customer’ with column ‘first name’ with some missing values, we can write a function and pass the parameter for any column to remove the missing value and returns a new data frame object.
4. Convert Categorical variable to Numerical Variable: Some machine learning models require variables to be in numerical format. For instance, if we are using a binary classification for the loan approval in banks as Yes or No. We need to convert them into numerical to use logistic regression for our model.
5. Creating Frequency Table: While dealing with Categorical Variable, we can use a generic function which returns the frequency count for that column. It is an initial way to describe a column which makes it easier for exploration.
6. Grouping Columns: Grouping and aggregating is one of the useful ways to summarize columns. For instance, we have records of people with a category gender and we want to summarize it using aggregate functions to see how likely it is important for our model.
7. Remove white space in columns: Data in columns can contain white space in the beginning which can look messy.
8. Change data types: When the dataset gets larger, we need to change the datatypes in order to save memory or sometimes we can have numerical values in other datatypes.
9. Removing strings in columns: Sometimes we can have newline characters in your data which could easily be removed.
10. Create Discrete values from Numerical Variables: For instance, we can have numerical values like ‘income’ of a person which we want to range these inputs to create a different categorical feature.
11. Generating Cross Tabs: To get the initial view of the data and test our basic hypothesis, we can also use the cross-tab function of pandas.

These are some of the techniques we can use to save time cleaning our data and exploring them. Making generic functions helps us to reuse them later in any of our projects wherever they are applicable. Before implementing a model in every data science project, the most significant part is understanding the business requirement and getting the most out of the data. This is only possible when we are well-informed how was the data collected, who are the audience and the targeted customers so that cleansing become more efficient. It is wiser to remove inconsistencies if the records are large but sometimes, losing this information can decline the achievement of our model hence diminishing the business value.
def drop_columns(column_list,df):
'''
Remove multiple columns if not required for analysis.
Here, column_list is the list of columns.
and df is the dataframe.
'''
df.drop(column_list,axis=1,inplace=True)
return df
def check_missing(df):
'''
Check missing values in columns
Either pass the whole data frame or each column and the value is returned.
'''
return df.isna().sum()
def remove_missing(df):
#Remove missing values and return a new data frame
df = df[df.notnull()]
return df
def convert_cat2num(df):
#Convert Categorical to Numerical Variable
encode_num = {'column_1': {'YES':1,'NO':0},
'column_2': {'High':1,'Mid':0,'Low':0}}
df.replace(encode_num,inplace=True)
return df
def frequency_table(df):
#Return Frequency Table(Can use normalize for proportion)
return df.value_counts()
def groupby(df,col_1):
#Summarize Columns
return df.groupby(col_1).count()
def remove_wspace(df,col):
#Remove White Space
df[col] = df[col].str.lstrip()
def change_dtypes(col_obj,col_float,df):
#Remove White Space
df[col_obj] = df[col_obj].astype('int64')
df[col_float] = df[col_float].astype('float32')
def remove_col_str(df):
#Remove New Line Characters
df['col_1'].replace('\n', '', regex=True, inplace=True)
def discrete_numerical(df):
#Range Numerical Values
income_bin = [0,5000,25000,40000,80000]
df['income'] = pd.cut(df['income'], bins=income_bin, labels=['Low','Medium','High','Very High'])
return df
def cross_tab(df,col_1,col_2,):
#Generate Cross Tabs
new_df= pd.crosstab(df[col_1],df[col_2],margins=True)
return new_df