If the goal of a data scientist is to tell a story, nothing better than knowing the roots of its characters.

When I decided to enter the world of data science, I got an early advice stating that half the time a data scientist is cleaning data, and I should look for courses that would tackle this topic in detail.

Although it is possible to find plenty of material covering data cleaning and ETL in general, I took this idea and decided to abstract it a little bit. To try different Python tools (or whatever languages and/or softwares) is amazing and absolutely necessary to learn this skill. More importantly for my personal learning curve, however, was to make the process enjoyable as part of the process of understanding something better.

Get to know where the data comes from

If you are getting some raw (or some partially clean) data, that is a great opportunity of understanding what is going on. More than just having a mindset of transforming everything into integer, dates, lower case strings, etc, seeing the way the data was constructed can give you a good context of how it was obtained, and therefore think about other possible inconsistencies.

As an example, let’s create some fake data.

import pandas as pd

d = {'names': ['John', 'John', 'Jonh', 'Jon', 'johm'], 'values': ['3', '6', '4', '4', '47']}
df = pd.DataFrame(data=d)
df
  names values
0 John       3
1 John       6
2 Jonh       4
3 Jon        4
4 johm      47

If we expect all the names to be “John”, we have some obvious tasks to do related to fixing strings. But more than that, we can also have a clue for the rest of the data.

Take, for instance, John and Jonh. Both the letters n and h are similar, and depending on how the data was obtained, there could have been a mistake when importing it to the raw file. If it was scraping from pdf files that were poorly scanned and were originally hand written, this is something that could have easily happened. While it doesn’t affect so much this particular variable, it throws now some doubt on the other columns. Could it be that this same scraper got the values column correct or maybe it has some errors as well? If it was hand written, switching 4’s and 7’s is possible, not to mention the more alarming fact that whoever wrote the original data could just be distracted, got wrong numbers, copied from the wrong place, and so on.

Another example could be working with dates. If you have a dataset that comes from an online form, and almost all the dates look nice with a good format, like dd/MM/yyyy, you would infer that probably the date field of this form was from a date box. Then, when converting the column to datetime it just doesn’t work, because one of the values is in M-d-yy, suggesting that the date field was open as a normal string but probably with a suggestion of how to complete it. And maybe the rest of the form was like that too, so now you can be more careful about handling missing values, outliers and misspellings.

While this shouldn’t be enough to discard the dataset, it can shed some light on why some of the future analysis might yield unexpected or bad results. Also, having an idea of how the data was obtained until the point where it sits in your computer will prepare you moving on.

Understanding the subject

During the first months of this year I have been working with some data from the COVID-19 pandemic, associating Google Mobility Reports with information made available from Johns Hopkins University, regarding counter measures adopted by each country, infection rates, tested population, etc. While they are doing a great job of keep track of the diseases’ worldwide spread, getting accurate data, or any data at all, has been very tricky, considering that each country has its own policies, economic restrictions, political situation and other relevant factors.

Another contributor of this particular project had done some cleaning and analysis, looking for correlation between countermeasures (like self isolation, lockdown, among others), population mobility (using Google data) and death rates. He noticed that some important values were missing for some of the columns, so he proceeded to fill the blanks or to remove the rows altogether, typical tasks when preparing the dataset. The final data looked like this:

df = pd.read_csv('covid_data.csv')
df.head(5)
Unnamed: 0	Symptomatic isolation - targeted	Symptomatic isolation - blanket	Asymptomatic isolation - targeted	Asymptomatic isolation - blanket	Domestic travel restriction	Nonessential business suspension	International travel restriction	Testing	Contact tracing	...	Date	Country	Confirmed Cases	Deaths	Retail & recreation	Grocery & pharmacy	Parks	Transit stations	Workplaces	Residential
0	1071	NaN	NaN	NaN	NaN	NaN	NaN	5.0	8284.0	NaN	...	2020-03-29	Romania	1815	43	-0.81	-0.53	-0.60	-0.72	-0.39	0.15
1	2746	NaN	NaN	NaN	NaN	NaN	0.5	NaN	2900.0	NaN	...	2020-03-29	Turkey	9217	131	-0.75	-0.39	-0.58	-0.71	-0.45	0.17
2	1875	NaN	NaN	NaN	2.0	NaN	1.0	4.0	3000.0	NaN	...	2020-03-29	Finland	1240	11	-0.52	-0.21	0.48	-0.59	-0.25	0.09
3	468	NaN	NaN	2.0	NaN	NaN	NaN	5.0	14901.0	NaN	...	2020-03-29	Japan	1866	54	-0.26	-0.07	-0.25	-0.41	-0.09	0.07
4	2947	NaN	NaN	NaN	NaN	NaN	0.5	6.0	43735.0	NaN	...	2020-03-29	Norway	4284	25	-0.65	-0.32	-0.05	-0.57	-0.34	0.11
5 rows × 35 columns

All these NaN are expected. They mean that the government of a particular country had not yet adopted the countermeasure referred by the variable name, by when this analysis was done.

Then, the dataset was modified with the following lines:

df['Asymptomatic isolation - targeted'].fillna(0, inplace=True)
df['Asymptomatic isolation - blanket'].fillna(0, inplace=True)

These lines refer to actions taken by the government to isolate people who were asymptomatic. Null values were replaced by 0, which looks fine, meaning that there were no actions.

df['Nonessential business suspension'].fillna(round(df['Nonessential business suspension'].mean(),1), inplace=True)
df['International travel restriction'].fillna(round(df['International travel restriction'].mean(),0), inplace=True)
df['Gatherings banned'].fillna(round(df['Gatherings banned'].mean(),0), inplace=True)

Here we notice a common practice, replacing null values with the mean of the column. While this might be a good option for continuous variables, in this case, all of these three columns are discrete (and getting a round value is just masking the problem). International travel restriction is a variable with a range of integers from 0 to 7, with 0 meaning “no travel restrictions” and 7 “all air travelling cancelled”. Ultimately, if there is 0 or NaN on this variable, it means that this government didn’t put any restrictions, therefore it shouldn’t have a value on this column, or maybe fill the null values with 0’s as in the previous case. It could be argued that if all minus one countries restrict travelling, this lone country that didn’t won’t be able to travel abroad anyway, so it makes sense to use the average. However, this dataset has only 50 countries in it, and there is a big diversity on how and when each nation decided to proceed on this regard.

df['Testing'].fillna(round(df['Testing'].mean(),0), inplace=True)

And then there is this variable called Testing. This refers to how many people were tested for the virus in the country. The data was obtained by Johns Hopkins from each countries’ government website as they were made official.

One big issue with this data is that we are seeing very different countries in terms of social and economic development. This affects the politics, the healthcare system and how information goes out from each place. For example, if we get a list of the countries, we see the following:

df['Country'].unique()
array(['Romania', 'Turkey', 'Finland', 'Japan', 'Norway', 'Moldova',
       'Canada', 'Mexico', 'Spain', 'Bosnia and Herzegovina', 'Australia',
       'Latvia', 'Malta', 'Ireland', 'Italy', 'United Kingdom', 'Iraq',
       'Israel', 'Slovakia', 'Sweden', 'Denmark', 'Czechia',
       'Netherlands', 'Belarus', 'Singapore', 'Egypt', 'Greece',
       'Namibia', 'Austria', 'Croatia', 'Bulgaria', 'Luxembourg',
       'Georgia', 'Estonia', 'Hungary', 'France', 'Poland', 'Hong Kong',
       'Taiwan', 'Kazakhstan', 'Nepal', 'Portugal', 'Lithuania',
       'South Korea', 'Switzerland', 'Germany', 'South Africa',
       'Slovenia', 'Belgium', 'Uganda'], dtype=object)

We have richer countries like Germany, Japan and South Korea while we also get Nepal, Uganda and Namibia. Any attempt to average a variable which is fundamentally related to wealth (richer countries have more resources to test than poorer ones) will likely result in a biased value, for better or worse. This is a common problem when using, for example, GDP per capita. It simply divides the total production of a year per population, giving equal weight to each individual, as if all the wealth was equally distributed.

Calculating averages have an intrinsic danger when dealing with samples that are so heterogeneous: while it can give an idea of the overall behavior of a variable, it can fail to tell any realistic story of each observation at all, specially when the sample size is so small like this one. When we check the values for each country for the variable testing, we see the following (only the last 10):

df[['Country','Testing']].tail(10)
   Country         Testing
40 Nepal           34678.0
41 Portugal        34678.0
42 Lithuania        1154.0
43 South Korea    316664.0
44 Switzerland      4000.0
45 Germany        167000.0
46 South Africa     6438.0
47 Slovenia         9860.0
48 Belgium         18360.0
49 Uganda          34678.0

It is easy to infer what is the value of the mean when you have repeated results in a continuous variable that most likely would not repeat. The average of Testing is 34678, and that value was applied to Nepal, Portugal, Uganda, and probably others. Much richer countries (at least in comparison to Uganda and Nepal), like Switzerland and Belgium have way less tests in this dataset, which can show a troublesome data source. Selecting the countries that have the exact same value we see:

df[df['Testing'] == 34678]['Country']
5                    Moldova
9     Bosnia and Herzegovina
16                      Iraq
24                 Singapore
25                     Egypt
26                    Greece
27                   Namibia
30                  Bulgaria
31                Luxembourg
32                   Georgia
39                Kazakhstan
40                     Nepal
41                  Portugal
49                    Uganda
Name: Country, dtype: object

These are quite different countries, but what they have in common is that there were no recorded values for how many tests each country conducted.

While getting the average of a variable can be useful when understanding the overall picture, sometimes replacing null values for it can lead to some problems in the future. If we proceed to perform a linear regression, for example, we will be using an incorrect value and might get a good correlation between this variable and the independent one.

Conclusion

As an economist, maybe I can spot some inconsistencies within a social or economic dataset. For sure it would take me more time of proper research to do the same with a chemistry or computer hardware one.

Nevertheless, as much as data cleaning can look like a boring part of the job (which I disagree), it is a great opportunity of learning the subject. It makes you familiar with what you are dealing with. Instead of racing to the machine learning fun parts, take some time to enjoy and work on the data. Checking where all the values come from and what they represent can provide great insights for any analysis, future discussions and problem solving.