Isa AlDoseri - Global Socio-Economic Factors Part 1: Data Processing

Introduction

This is the dataset where we study historical indicators of wealth, prosperity and detriment (via CO2 production) in the world.

This data was taken from Kaggle. We will attempt to wrangle the data and bring it to a form that is suitable for predictive modelling later on (next chapter).

import pandas as pd
from pathlib import Path

These are the datasets in question

data_folder = 'data'
!ls -1 {data_folder}/*.csv

data/co2_production.csv
data/gross_national_income_per_capital.csv
data/human_development_index.csv
data/life_expectancy_by_birth.csv

Datasets

Let’s read one of them and print the columns…

data_folder = Path(data_folder)
life_exp = pd.read_csv(data_folder/'life_expectancy_by_birth.csv')
life_exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ISO3           195 non-null    object 
 1   Country        206 non-null    object 
 2   hdicode        191 non-null    object 
 3   region         151 non-null    object 
 4   hdi_rank_2021  191 non-null    float64
 5   le_1990        206 non-null    float64
 6   le_1991        206 non-null    float64
 7   le_1992        206 non-null    float64
 8   le_1993        206 non-null    float64
 9   le_1994        206 non-null    float64
 10  le_1995        206 non-null    float64
 11  le_1996        206 non-null    float64
 12  le_1997        206 non-null    float64
 13  le_1998        206 non-null    float64
 14  le_1999        206 non-null    float64
 15  le_2000        206 non-null    float64
 16  le_2001        206 non-null    float64
 17  le_2002        206 non-null    float64
 18  le_2003        206 non-null    float64
 19  le_2004        206 non-null    float64
 20  le_2005        206 non-null    float64
 21  le_2006        206 non-null    float64
 22  le_2007        206 non-null    float64
 23  le_2008        206 non-null    float64
 24  le_2009        206 non-null    float64
 25  le_2010        206 non-null    float64
 26  le_2011        206 non-null    float64
 27  le_2012        206 non-null    float64
 28  le_2013        206 non-null    float64
 29  le_2014        206 non-null    float64
 30  le_2015        206 non-null    float64
 31  le_2016        206 non-null    float64
 32  le_2017        206 non-null    float64
 33  le_2018        206 non-null    float64
 34  le_2019        206 non-null    float64
 35  le_2020        206 non-null    float64
 36  le_2021        206 non-null    float64
dtypes: float64(33), object(4)
memory usage: 59.7+ KB

Now let’s read the rest of the data

co2 = pd.read_csv(data_folder/'co2_production.csv')
gross_income_percapita = pd.read_csv(data_folder/'gross_national_income_per_capital.csv')
hdi_index = pd.read_csv(data_folder/'human_development_index.csv')

Let’s ensure all 206 countries are mentioned in all three datasets

hdi_index.shape[0] == co2.shape[0] == gross_income_percapita.shape[0] == life_exp.shape[0]

True

We can see the same number of columns too

print(hdi_index.shape[1])
print(co2.shape[1])
print(gross_income_percapita.shape[1])
print(life_exp.shape[1])

Determining ID column

We can look at the ISO3 & Country columns as global identifiers (we’ll single one out later on), let’s see if there’s any null values in the ISO3 column

life_exp[life_exp.ISO3.isnull()]

	ISO3	Country	hdicode	region	hdi_rank_2021	le_1990	le_1991	le_1992	le_1993	le_1994	...	le_2012	le_2013	le_2014	le_2015	le_2016	le_2017	le_2018	le_2019	le_2020	le_2021
195	NaN	Very high human development	NaN	NaN	NaN	73.776652	73.931896	74.058848	73.852640	74.007158	...	78.578754	78.822196	79.057616	79.062435	79.264933	79.428843	79.582768	79.822662	78.789745	78.521301
196	NaN	High human development	NaN	NaN	NaN	67.315701	67.567552	67.966260	68.358081	68.636211	...	74.212238	74.463776	74.732246	74.944071	75.152677	75.261530	75.600606	75.785826	75.120548	74.709094
197	NaN	Medium human development	NaN	NaN	NaN	58.757754	58.991558	59.592548	59.880467	60.250251	...	67.460860	67.893291	68.411503	68.895967	69.314997	69.694263	69.998178	70.219458	69.517580	67.438318
198	NaN	Low human development	NaN	NaN	NaN	50.351409	50.608373	50.467296	50.606850	51.070213	...	59.569566	60.010001	60.326771	60.657080	61.064843	61.424061	61.721181	62.083426	61.675690	61.310991
199	NaN	Arab States	NaN	NaN	NaN	62.973324	63.225764	63.520079	64.512882	65.562309	...	70.261228	70.274253	70.570127	70.836099	71.019940	71.508744	71.711891	71.922194	71.002105	70.895040
200	NaN	East Asia and the Pacific	NaN	NaN	NaN	67.161758	67.383754	67.905512	68.375688	68.640580	...	74.427861	74.677412	74.936109	75.156317	75.355177	75.419246	75.835426	76.036110	75.968330	75.579650
201	NaN	Europe and Central Asia	NaN	NaN	NaN	67.781962	67.609336	66.896095	66.718280	67.136734	...	72.606274	72.992396	73.357649	73.681015	73.983262	74.343005	74.505959	74.700741	72.780992	72.856526
202	NaN	Latin America and the Caribbean	NaN	NaN	NaN	67.770568	68.127209	68.458998	68.786974	69.166664	...	74.067656	74.316945	74.521732	74.579895	74.544240	74.705572	74.823478	75.014905	73.008316	72.099890
203	NaN	South Asia	NaN	NaN	NaN	58.830494	59.102221	59.727847	60.058622	60.441754	...	67.915940	68.395439	68.905849	69.407382	69.874722	70.245467	70.517018	70.722315	69.972908	67.855530
204	NaN	Sub-Saharan Africa	NaN	NaN	NaN	49.868704	49.989060	49.900445	49.817075	49.722427	...	57.987394	58.539213	59.010080	59.428775	59.954595	60.348837	60.735440	61.120031	60.729051	60.112467
205	NaN	World	NaN	NaN	NaN	65.144798	65.299082	65.584015	65.753616	65.986642	...	71.289119	71.581265	71.886202	72.111793	72.370895	72.568952	72.816114	73.012099	72.257297	71.365465

11 rows × 37 columns

The NaN values aren’t related to any countries, rather, they are to related to group of countries for brevity. We will drop these rows from the table.

life_exp_countries = life_exp.dropna(subset='ISO3', axis=0)
life_exp_countries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195 entries, 0 to 194
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ISO3           195 non-null    object 
 1   Country        195 non-null    object 
 2   hdicode        191 non-null    object 
 3   region         151 non-null    object 
 4   hdi_rank_2021  191 non-null    float64
 5   le_1990        195 non-null    float64
 6   le_1991        195 non-null    float64
 7   le_1992        195 non-null    float64
 8   le_1993        195 non-null    float64
 9   le_1994        195 non-null    float64
 10  le_1995        195 non-null    float64
 11  le_1996        195 non-null    float64
 12  le_1997        195 non-null    float64
 13  le_1998        195 non-null    float64
 14  le_1999        195 non-null    float64
 15  le_2000        195 non-null    float64
 16  le_2001        195 non-null    float64
 17  le_2002        195 non-null    float64
 18  le_2003        195 non-null    float64
 19  le_2004        195 non-null    float64
 20  le_2005        195 non-null    float64
 21  le_2006        195 non-null    float64
 22  le_2007        195 non-null    float64
 23  le_2008        195 non-null    float64
 24  le_2009        195 non-null    float64
 25  le_2010        195 non-null    float64
 26  le_2011        195 non-null    float64
 27  le_2012        195 non-null    float64
 28  le_2013        195 non-null    float64
 29  le_2014        195 non-null    float64
 30  le_2015        195 non-null    float64
 31  le_2016        195 non-null    float64
 32  le_2017        195 non-null    float64
 33  le_2018        195 non-null    float64
 34  le_2019        195 non-null    float64
 35  le_2020        195 non-null    float64
 36  le_2021        195 non-null    float64
dtypes: float64(33), object(4)
memory usage: 57.9+ KB

let’s repeat this step for the other datasets, and see if they match in # of rows

co2_countries = co2.dropna(subset='ISO3', axis=0)
gross_income_percapita_countries = gross_income_percapita.dropna(subset='ISO3', axis=0)
hdi_index_countries = hdi_index.dropna(subset='ISO3', axis=0)

co2_countries.shape[0] == hdi_index_countries.shape[0] == gross_income_percapita_countries.shape[0]

True

Time Series Data

These datasets, when combined, have all the makings of a time-series dataset! We wager that we can combine them into one big dataset and save it to disk. This will simplify our modeling in the next chapter.

We see some miscellaneous data that can be put aside; particularly the hdicode & hdi_rank_2021 columns

# Putting misc cols aside so it's not repeated
country_miscellaneous_cols = ['region', 'hdicode', 'hdi_rank_2021', 'ISO3']
country_misc_data = life_exp_countries[['Country']+country_miscellaneous_cols]

time_series_data = life_exp_countries.drop(country_miscellaneous_cols, axis=1)\
.merge(hdi_index_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(co2_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(gross_income_percapita_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\

time_series_data

	Country	le_1990	le_1991	le_1992	le_1993	le_1994	le_1995	le_1996	le_1997	le_1998	...	gnipc_2012	gnipc_2013	gnipc_2014	gnipc_2015	gnipc_2016	gnipc_2017	gnipc_2018	gnipc_2019	gnipc_2020	gnipc_2021
0	Afghanistan	45.9672	46.6631	47.5955	51.4664	51.4945	52.5442	53.2433	53.6342	52.9431	...	2125.862821	2193.553936	2178.507021	2101.589319	2077.566899	2085.487571	2054.939895	2097.889450	1997.852149	1824.190915
1	Angola	41.8933	43.8127	42.2088	42.1009	43.4217	45.8491	46.0329	46.3065	45.0570	...	7280.845666	7478.104777	7704.231949	7652.656486	7189.426672	6861.575738	6381.521946	6082.746624	5593.142060	5465.617791
2	Albania	73.1439	73.3776	73.7148	73.9391	74.1313	74.3616	74.5923	73.9039	74.9899	...	11146.263030	11552.982470	11691.648290	12016.297600	12484.624200	12802.148310	13302.705960	13485.311240	12996.762910	14131.110390
3	Andorra	78.4063	77.9805	80.3241	78.6633	82.6380	78.9616	80.3340	80.9439	79.4259	...	47126.814610	46385.095200	48483.720320	49936.874540	52267.738320	52650.225760	53483.306630	54465.047400	47878.666640	51166.626610
4	United Arab Emirates	71.9004	72.2414	72.3062	72.5213	72.5982	72.6945	72.7674	72.9367	73.0658	...	57445.954750	60005.695360	62573.505310	65577.512240	66881.329740	67667.508460	67195.095230	68590.900940	63016.401220	62573.591810
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
190	Samoa	67.6584	67.8814	68.3855	68.7009	68.9795	69.2984	69.5877	69.8746	70.1711	...	5683.651395	5675.889504	5757.486050	6222.783552	6401.830980	6287.673021	6280.573936	6356.988690	5812.868328	5307.953374
191	Yemen	58.6994	59.0490	59.4283	59.8595	59.7135	60.4532	60.5678	61.1193	61.3748	...	3152.900024	3212.651579	2775.842952	1785.788608	1494.230811	1302.425254	1341.656234	1349.567046	1370.601082	1314.270189
192	South Africa	63.3753	63.2649	63.3414	63.0447	62.6118	62.2616	61.4593	60.8053	60.0008	...	13602.253520	13732.252520	13700.834560	13694.728110	13545.358590	13475.988210	13491.221790	13366.474640	12449.671040	12948.373250
193	Zambia	47.9263	47.0971	46.5119	46.2094	45.8543	45.5534	45.2326	44.9446	44.7011	...	3333.576512	3389.478940	3263.039162	3403.471444	3237.505650	3330.552717	3418.096158	3365.410652	3178.619722	3217.767739
194	Zimbabwe	59.4264	58.0911	56.4354	54.4264	52.5878	50.5310	48.9551	47.9933	46.8192	...	3618.629526	3632.111591	3644.856047	3638.532892	3606.750671	3728.918785	3864.012419	3674.564482	3654.289051	3809.887158

195 rows × 129 columns

Melt Dataset

We like the earlier form of the dataset, but we want something better; considering that there are too many columns that are hard to read. Let’s melt it!

Basically, it will involve extracting the year out of the life_exp, hdi, etc. variables and having it as a separate column, with the value of that year displayed next to it. Finally we’re going to use that as a key to merge all the variables together.

Finally each row will take the following form…

# Melt the dataset
life_exp_year_columns = tuple(f'le_{year}' for year in range(1990, 2022))
hdi_year_columns = tuple(f'hdi_{year}' for year in range(1990, 2022))
gnipc_year_columns = tuple(f'gnipc_{year}' for year in range(1990, 2022))
co2_year_columns = tuple(f'co2_prod_{year}' for year in range(1990, 2022))

def melt_df_by_var(df, value_name, value_vars):
    melted = df.melt('Country', value_vars=value_vars, var_name='year', value_name=value_name)
    melted.year = melted.year.str.slice(-4).astype(int)
    return melted

melted_time_series = melt_df_by_var(time_series_data, 'life_exp', life_exp_year_columns)\
.merge(melt_df_by_var(time_series_data, 'hdi_index', hdi_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'co2', co2_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'gnipc', gnipc_year_columns), on=('Country', 'year'))

melted_time_series

	Country	year	life_exp	hdi_index	co2	gnipc
0	Afghanistan	1990	45.9672	0.273	0.209727	2684.550019
1	Angola	1990	41.8933	NaN	0.429586	4845.706901
2	Albania	1990	73.1439	0.647	1.656902	4742.215529
3	Andorra	1990	78.4063	NaN	7.461153	43773.146500
4	United Arab Emirates	1990	71.9004	0.728	28.277672	102433.136000
...	...	...	...	...	...	...
6235	Samoa	2021	72.7675	0.707	1.238975	5307.953374
6236	Yemen	2021	63.7534	0.455	0.327510	1314.270189
6237	South Africa	2021	62.3410	0.713	7.620420	12948.373250
6238	Zambia	2021	61.2234	0.565	0.357535	3217.767739
6239	Zimbabwe	2021	59.2531	0.593	0.708562	3809.887158

6240 rows × 6 columns

Finally writing it to disk

melted_time_series.to_csv(data_folder/'processed/time_series.csv', index=False)

Conclusion

This finalizes our steps to process the data. We’ll do a few more preprocessing steps necessary before modelling in the next chapter. Stay Tuned.