The 4 Cs: Correct, Complete, Create and Convert

Arnau Bagó · September 27, 2022

Data collected in real-life situations can contain errors or gaps. Trying to build a model directly with this data without first being filtered and corrected can lead to a poor performance start. With the four Cs (Correct, Complete, Create and Convert), you can avoid these problems and easily transform the dataset to build a much more robust model in the future.

1. Correct

The first step will be to identify anomalies in the existing data as outliers. This will be a visual task as one needs to be familiar with the parameters being checked. Panda’s function describe may be useful to find these outliers to later delete them if necessary.

# Descriptive statistics include those that
# summarize the central tendency, dispersion
# and shape of a dataset’s distribution, 
# excluding NaN values
print(df.describe()['list', 'of', 'parameters'])

2. Complete

Second step of the workflow will be to complete the missing information gaps of the dataset. Panda’s function isnull will return a list of all params in the dataset and the number of rows that contain empty data.

# Detect missing values.
print(df.isnull().sum())

There are three possible solutions to address information gaps in the dataset: remove features containing missing information, delete rows containing missing information, or complete the information.

To drop features with missing information function drop can be used.

# Drop feature with missing information
df.drop(['feature'], axis=1, inplace=True)

To drop rows with missing information first null values must be transformed to np.nan and later, using dropna function, selected rows can be discarted.

# Transform null values to np.nan
df['feature'].replace('', np.nan, inplace=True)
# Drop rows with missing information
df.dropna(subset=['feature'], inplace=True)

Missing data may also be filled in with statistics. Fillna function will be of help to fill the np.nan values. Median, mean or mode of column feature can be used to fill in the gaps.

# Transform null values to np.nan
df['feature'].replace('', np.nan, inplace=True)
# Complete missing information with the median
df['feature'].fillna(df['feature'].median(), inplace=True)

3. Create

After correcting and completing the dataset this should not contain empty values. Now it’s time to understand the data we have and try to create new features to enrich the dataset. It is hard to give a few guidelines in this step as the features are different for all datasets instead, let’s make an example. Imagine we have a dataset containing information about packages from a factory. We have several features with the measurement of each package, can we combine this information to create a more valuable feature? The answer is probably yes, we may use the measurements to create a new feature with the volume. New features may be of great value to make our model more robust so, don’t be afraid to spend as much time as you need in this step.

4. Convert

At this point our dataset will probably contain a mixture of numeric (e.g. volume of a package) and categoric features (e.g. city to send the package). Algorithms only understand numbers so it is time to convert the categoric features. One of this two aproaches can be used:

  • Label encoding. Each label is assigned a unique integer based on alphabetical ordering.
# Import preprocessing module from sklearn library
from sklearn import preprocessing
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'City'. 
df['City']= label_encoder.fit_transform(df['City'])
  • One hot encoding. Creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.
# Import preprocessing module from sklearn library
from sklearn import preprocessing

# Given a dataset and a single feature, apply one hot 
# encoding to the feature
def oneHotFeature(df, feature):
    # one hot encoder object
    ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
    # create new dataframe with one hot features
    one_hot_df = pd.DataFrame(ohe.fit_transform(df[[feature]]).toarray())
    # trick to change column names to feature_n where n is a number
    columns = one_hot_df.columns
    for column in columns:
        one_hot_df.rename(columns={column: feature+"_"+str(column)},
         					inplace=True)
    
    # don't drop original column, join original and new 
    # one hot dataframe and let the user decide what to do
    return df.join(one_hot_df), one_hot_df.columns.values

Update 7/10/2022

To avoid increasing the number of features it’s common to avoid using one hot encoding leaving us with label encoding. Sometimes it is interesting to encode the categorical values using their frequency instead of simply assigning a unique integer based on the alphabetical order. This approach is called frequency encoding. The encoding will just replace all categorical features with its occurrence.

Continue with Dividing the data into train and test

Twitter