Data Exploratory-Olympic Games

Project Overview

This will one of my favourite projects I must say. I got this inspiration from a YouTube video uploaded by Simplilearn. In this project I will be exploring Olympic data from Kaggle by rgriffin, will be doing some data wrangling and some basic data visualisation.
Before starting the project, I have listed some of the questions that can guide me through the analysis.

  1. Top 10 highest count of participants by country
  2. Age distribution of the athletes - histogram
  3. Summer and Winter games unique sports
  4. Gender distribution. MALE & FEMALE
  5. Top 5 gold medal count per country- Gold
  6. Count of medals - Who won the most medal (not only gold)

Coding:

First thing first let's import all the libraries that I will be using for this project and import the datasets

mport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#importing the datasets
ath = pd.read_csv('C:/Users/User/Desktop/Data Analysis project/Exploratory data - Olympic/athlete_events.csv')
reg = pd.read_csv('C:/Users/User/Desktop/Data Analysis project/Exploratory data - Olympic/noc_regions.csv')

To explain more about this dataset

The file athlete_events.csv composed of 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are as per below:

  1. ID- Unique number for each athlete
  2. Name - Athlete's name
  3. Sex - M or F
  4. Age - Integer
  5. Height - In centimeters
  6. Weight - In kilograms
  7. Team - Team name
  8. NOC - National Olympic Committee 3-letter code
  9. Games - Year and season
  10. Year - Integer
  11. Season - Summer or Winter
  12. City - Host city
  13. Sport - Sport
  14. Event - Event
  15. Medal - Gold, Silver, Bronze, or NA

The file noc_regions.csv consists of 3 columns. The data is to match against the NOC country code (National Olympic Committee 3 letter code) to the Country name. The columns are as per below:

  1. NOC (National Olympic Committee 3 letter code)
  2. Region
  3. Notes

I am going to see if the data is properly imported by using the head() function to see the first 3 rows of the dataset

ath.head()

As mentioned above the second file (noc_regions.csv) contains the country name as per NOC code. So i am going to match the 2 file together using the merge() function, join to the left and on "NOC" column. After that we will make the the column name consistent.

#merging data
ath_df = ath.merge(reg, how = 'left', on = 'NOC')
#making the column title consistent
ath_df.rename(columns= {'region':'Region', 'notes':'Notes'}, inplace=True)
									
ath_df.head()

This is where the data exploratory begins. I will start with looking at the shape of the data using the shape() function to see the number of rows and columns.

ath_df.shape

Let's also look at the data information, I want to see the data type of the data and if there are any NaN values. By using the into() function, it can provide me with the information I wanted. As you can see below it shows the data type (int, float or object) as well as the number of data that has non-null values.

ath_df.info()

To look at more statistical summary/data we can use the describe() function for example percentile, mean and std of the numerical values of the Series or DataFrame.

ath_df.describe()

As I saw on the previous summary, I notice that there are quite some of the column has a null value. Let's inspect them in more detail. I will now create a list of a column to give a boolean value (True or False). If there is a null value (NA value), will print as True. If there is no null value, print False.

#checking for null value, will display value as boolean

nan_values = ath_df.isna()
nan_values.any()

As shown above column Age, Height, Weight, Medal Region and Notes has a null value. Now, let's check the count of null values for each column.

#checking number of null values in each column

ath_df.isnull().sum()

There are a few ways to deal with missing data, like dropping the data entry, replace it by frequency or by mean values or just leave them as is. In this case, I will leave it as missing data because I can't manipulate them.

Answering Questions

1. Top 10 highest count of participants by country
Using the value_count and sort them in descending order, the column we will be utilising is 'Team'

top_10 = ath_df.Team.value_counts().sort_values(ascending=False).head(10)
top_10

Let's make it inro a bar chart.

#ploting the figure
plt.figure(figsize=(12,6))
plt.title('Number of Participants by Country')
sns.barplot(x=top_10.index, y=top_10, palette = 'Set3' )

2. Age distribution of the athletes - histogram & boxplot
I am going plot a histogram to represent the age distribution of the athletes

#age distribution

plt.figure(figsize=(12,6))
plt.title("Age distribution of the athletes")
plt.xlabel('Age')
plt.ylabel('Number of participants')
plt.hist(ath_df.Age, bins = np.arange(10,80,2), color="blue", edgecolor = 'white');

Now lets use box plot to slow the age distribution.

#using boxplot
agedis = ath_df.Age
	
plt.figure(figsize=(16, 6))
ax = sns.boxplot(data=agedis, orient="h", palette="Set2")
	
plt.title('Box plot Age Distribution')
plt.xlabel('Age')
	
plt.show()

As shown on the charts above, we can see that majority of the participants are aged between 22-26.

3. Summer and Winter games unique sports
Olympic games are divided into 2 seasons, Winter and Summer. in this analysis I am going to extract the unique sports for each season. using unique() function we can extract the unique value for 'Sport' column.

# winter sports
winter_sports = ath_df[ath_df.Season == 'Winter'].Sport.unique()
winter_sports
# summer sports
summer_sports = ath_df[ath_df.Season == 'Summer'].Sport.unique()
summer_sports

4. Gender distribution. MALE & FEMALE
In this section, I would like to see the gender distribution. Using pie chart to look at the percentage of male vs female. We first have to create an object for the gender and then we will create a pie chart based on that object.

#gender data
gender_counts = ath_df.Sex.value_counts()

# plotting the pie chart
gender_counts.plot(kind='pie',figsize=(12,6), 
autopct='%1.1f%%',startangle=90,shadow=True)


plt.title('Male and Female participants')
plt.show()

The pie chart shown above represents the percentage of males and females. We can clearly see that there are more male participants coming at 72.5%, and the number of female participants is just below 30%.

5. Top 5 gold medal count per country- Gold
In this section, I want to see 5 countries that won the most gold medal. To get that data, we need to look at the 'Medal' column. I will create a bar chart to show the data.

#gold medal by countires
goldmedals = ath_df[(ath_df.Medal == 'Gold')]
topgoldcountries= goldmedals.Region.value_counts().reset_index(name='Medal').head(5)

#ploting bar chart
goldCountries = sns.catplot(x="index", y="Medal", data=topgoldcountries, kind="bar")

goldCountries.despine(left=False)
goldCountries.set_xlabels("Top 5 Counries")
goldCountries.set_ylabels("Number of Medals")
plt.title("Gold Medals per Country")

As expected, the USA has the highest count of gold medals.

6. Count of medals - Who won the most medal (not only gold)
I am going to filter the medal column to exclude any null values because a null value means that the participant did not win any medal.

allmedals = ath_df[ath_df['Medal'].notnull()]
mostmedals = allmedals.Name.value_counts().reset_index(name='Medal')
mostmedals = mostmedals.head(10)
mostmedals

As you can see Michael Fred Phelps, II won the most medal, coming at 28 medal in total. It can be any medals which is AMAZING.