Meteorological Data Analysis

Kartik Khandelwal
5 min readMay 6, 2021

In this project we’re going to perform analysis of Meteorological data. Our goal of this project is to transform the raw data into information and then convert that information into knowledge.

A null hypothesis to be considered is : “Ho : Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming”.

The Ho means we need to find whether the average Apparent temperature for the month of a month say April starting from 2006 to 2016 and the average humidity for the same period have increased or not.

Let’s begin:

Let us first import the required libraries.

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now, let’s read our data which is present in the ‘weatherHistory.csv’ file. You can get the dataset here.

In [3]:

data = pd.read_csv('../input/weather-dataset/weatherHistory.csv')
data.head()

It is important to note that our data should not contain any missing value. So, in order to check that, just do this:

In[4]:

data.isnull().sum()

Out[4]:

Formatted Date                0
Summary 0
Precip Type 517
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
dtype: int64

As it can clearly be seen that the feature — ‘Precip Type’ has 517 null values. Thankfully this feature is of no use to us, so we can simply drop this feature later. Now, let’s checkout how many rows we have and what are the data types does our features have.

In [5]:

data.info()

Out[5]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Formatted Date 96453 non-null object
1 Summary 96453 non-null object
2 Precip Type 95936 non-null object
3 Temperature (C) 96453 non-null float64
4 Apparent Temperature (C) 96453 non-null float64
5 Humidity 96453 non-null float64
6 Wind Speed (km/h) 96453 non-null float64
7 Wind Bearing (degrees) 96453 non-null float64
8 Visibility (km) 96453 non-null float64
9 Loud Cover 96453 non-null float64
10 Pressure (millibars) 96453 non-null float64
11 Daily Summary 96453 non-null object
dtypes: float64(8), object(4)

Let’s print the names of all the columns:

In [6]:

data.columns

Out[6]:

Index(['Formatted Date', 'Summary', 'Precip Type', 'Temperature (C)',
'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover',
'Pressure (millibars)', 'Daily Summary'],
dtype='object')

Now, we will select only those columns which are of our use and leave others. Selected columns are → ‘Formatted Date’, ‘Apparent Temperature (c )’, ‘Humidity’, ‘Daily Summary’

In [7]:

col = ['Formatted Date', 'Apparent Temperature (C)', 'Humidity', 'Daily Summary']
data = data[col]

For analysing data further, it is necessary to change the ‘Formatted Date’ into Date-Time format because the data type of this feature is ‘object’ and we can’t train our model on object data type. It can simply be done by using pandas predefined function → pd.to_datetime(). Remember to put utc = True.

In [8]:

data['Formatted Date'] = pd.to_datetime(data['Formatted Date'], utc=True)
data = data.set_index('Formatted Date')
data = data.resample('M').mean()

Resample(‘M’) simply converting the hourly data to monthly by taking the mean.

This is how our data looks like:

In [9]:

data.head()

Out[9]:

It’s time to visualize our data using some outstanding libraries called matplotlib and seaborn.

Firstly, let’s have a look at variation of ‘Apparent Temperature’ and ‘Humidity’ with time.

In [10]:

plt.figure(figsize=(18,5))
plt.title('Variation of temp with humidity')
plt.plot(data)

Out[10]:

[<matplotlib.lines.Line2D at 0x7f8f799d7b50>,
<matplotlib.lines.Line2D at 0x7f8f79a3bc10>]

Now, let’s plot the graph of temperature with humidity for the month of April.

In [11]:

plt.figure(figsize=(15, 5))
data_of_april = data[data.index.month==4]
plt.plot(data_of_april, marker='o',label=['Apparent Temperature (C)','Humidity'] );
plt.legend(loc = 'center right',fontsize = 10)
plt.title('Relation between temperature and humidity for the month of April')
plt.show()

From the above graph it is clear that for the year 2009, there is sudden increase in temperature and it is the maximum temperature of April. The temperature again fall after 2009 and in 2015 april reached it’s minimum temperature.

Now, let’s plot the co-relation between the features our our data. And let’s see what we can find out from that. For this case Heatmap will be of great help from seaborn library.

In [12]:

correlation = data.corr()
sns.heatmap(correlation)

Out[12]:

<AxesSubplot:>

This is our beatiful correlation between ‘Apparent Temperature’ and ‘Humidity’.

Now, let’s plot bar plot to see relation between ‘Humidity’ and ‘Apparent Temperature’.

In [13]:

plt.figure(figsize = (18,5))
sns.barplot(x='Apparent Temperature (C)', y='Humidity', data=data_of_april)
plt.xticks(rotation=-30)
plt.title('Relation between temperature and humidity for the month of April')
plt.show()

Conclusion:

Our environment is highly affected by Global Warming. From our analysis it’s been a clear observation that their is sudden increase in temperature and sudden decrease in temperature over ten years. But, in case of humidity, it is seen that it neither rise of fall instead stayed same over 10 years.

“I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com".

--

--