# importing necessary package
import sqlite3
import pandas as pd
# to create a database in the current directory called temps.db
= sqlite3.connect("temps.db") conn
Welcome! In this blog post, I’ll show you how to create a database, write a query function to access data from tables in a database, and create interesting visualizations using Plotly Express.
1. Create a Database
First, we’ll import the sqlite3
and pandas
packages. Then, using sqlite3.connect()
, we’ll create a database in our current directory called temps.db
.
We’ll be creating a new folder named datafiles
, then downloading the temperatures
datas into our new folder.
import os
# create folder named "datafiles" if it does not exist
if not os.path.exists("datafiles"):
"datafiles")
os.mkdir(
# download the files for the `temperatures` table
import urllib.request
= [f"{10 * i + 1}-{10 * (i+1)}" for i in range(190, 202)]
intervals for interval in intervals:
= f"https://raw.githubusercontent.com/PIC16B-ucla/24F/main/datasets/noaa-ghcn/decades/{interval}.csv"
url f"datafiles/{interval}.csv") urllib.request.urlretrieve(url,
Next, we’ll upload the tables temperatures
, stations
, and countries
into our database. Note that there are some NaN values in the temperatures
dataset. I’ve removed them using the prepare_df()
function. Here’s what the prepare_df()
looks like:
def prepare_df(df):
"""
prepares a piece of wide format dataframe into a long format data frame
"""
# melt to the long format table
= df.melt(
df = ["ID", "Year"],
id_vars = [f"VALUE{i}" for i in range(1, 13)],
value_vars = "Month",
var_name = "Temp"
value_name
)
# cleaning month and temp
"Month"] = df["Month"].str[5:].astype(int)
df["Temp"] = df["Temp"] / 100
df[
# removing rows where Temp == NaN
= df.dropna(subset = "Temp")
df
return df
for i, interval in enumerate(intervals):
= f"datafiles/{interval}.csv"
filepath = pd.read_csv(filepath)
df = prepare_df(df) # prepares temperatures df and removes nan values
df "temperatures", conn,
df.to_sql(= "replace" if i == 0 else "append", index = False)
if_exists
= "https://raw.githubusercontent.com/PIC16B-ucla/24F/refs/heads/main/datasets/noaa-ghcn/station-metadata.csv"
url = pd.read_csv(url)
stations "stations", conn, if_exists = "replace", index=False)
stations.to_sql(
= "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries_url = pd.read_csv(countries_url)
countries "countries", conn, if_exists = "replace", index=False) countries.to_sql(
Let’s check if our tables have been successfully uploaded into our database.
= conn.cursor()
cursor "SELECT name FROM sqlite_master WHERE type='table'")
cursor.execute(print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]
Looks good! Now, after we’re done uploading tables to our database, let’s close our connection to our database using conn.close()
. It’s generally a good practice to do this.
conn.close()
2. Write a Query Function
Now that we have our tables in the database, let’s create a query function to allow easy access to our data in the database. I’ve created the function in… Here’s what the function looks like:
from climate_database import query_climate_database
import inspect
print(inspect.getsource(query_climate_database))
def query_climate_database(db_file, country, year_begin, year_end, month) :
""" Extracts climate data from a specified database based on the provided country,
year range, and month; and returns extracted data in a Pandas dataframe
Args:
db_file (string): file name for the database
country (string): name of the country for which data should be returned
year_begin (integer): the earliest year for which should be returned
year_end (integer): the latest years for which to should returned
month (integer): the month of the year for which should be returned
Returns:
df (Pandas dataframe): a dataframe of temperature readings of inputted country according to
inputted year_begin, inputted year_end, and inputted month of the year. The
resulting dataframe contains the columns `NAME`, 'LATITUDE', 'LONGITUDE`,
`Country`, `Year`, `Month`, `Temp`.
"""
with sqlite3.connect(db_file) as conn:
df = pd.read_sql_query("""
SELECT s.*, c.Name AS Country, t.*
FROM stations s
INNER JOIN countries c
ON c."FIPS 10-4" = SUBSTR(s.ID, 1, 2)
INNER JOIN temperatures t
ON t.ID = s.ID
WHERE c.Name = ? AND t.Year >= ? AND t.Year <= ? AND t.Month = ?
ORDER BY NAME
""", conn, params=(country, year_begin, year_end, month))
# returns dataframe in order of columns mentioned before
return df[["NAME", "LATITUDE", "LONGITUDE", "Country", "Year", "Month", "Temp"]]
This is what the resulting dataframe looks like:
= query_climate_database(db_file = "temps.db",
df = "India",
country = 1980,
year_begin = 2020,
year_end = 1)
month df
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | AGARTALA | 23.883 | 91.250 | India | 1980 | 1 | 18.21 |
1 | AGARTALA | 23.883 | 91.250 | India | 1981 | 1 | 18.25 |
2 | AGARTALA | 23.883 | 91.250 | India | 1982 | 1 | 19.31 |
3 | AGARTALA | 23.883 | 91.250 | India | 1985 | 1 | 19.25 |
4 | AGARTALA | 23.883 | 91.250 | India | 1988 | 1 | 19.54 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | VISHAKHAPATNAM | 17.717 | 83.233 | India | 2016 | 1 | 25.09 |
3148 | VISHAKHAPATNAM | 17.717 | 83.233 | India | 2017 | 1 | 23.90 |
3149 | VISHAKHAPATNAM | 17.717 | 83.233 | India | 2018 | 1 | 22.65 |
3150 | VISHAKHAPATNAM | 17.717 | 83.233 | India | 2019 | 1 | 22.20 |
3151 | VISHAKHAPATNAM | 17.717 | 83.233 | India | 2020 | 1 | 23.75 |
3152 rows × 7 columns
3. Write a Geographic Scatter Function for Yearly Temperature Increases
Now that we have easy access to certain subsets of our data, let’s visualize our data and see what we can learn from it. To do that, I’ll create a function that displays a geographic scatter plot, where eahc point shows the first coefficient of the linear regression model fitted by the temperatures each station in India from 1980 - 2020.
To evaluate the coefficients, we’ll need to implement a function that utilizes LinearRegression
from sklearn.linear_model
.
from sklearn.linear_model import LinearRegression
def coef(data_group):
""" fits data to a linear regression model and outputs the first coefficient of the fitted model
Args:
data_group (Pandas dataframe): the data for which to fit to a model
Returns:
"""
= data_group[["Year"]] # 2 brackets because X should be a df
x = data_group["Temp"] # 1 bracket because y should be a series
y = LinearRegression()
LR
LR.fit(x, y)return LR.coef_[0]
from climate_database import temperature_coefficient_plot
print(inspect.getsource(temperature_coefficient_plot))
def temperature_coefficient_plot(db_file, country, year_begin, year_end, month, min_obs, **kwargs) :
"""creates a geographic scatter plot where each point displays the first coefficent
of the linear regression model fitted for each station's temperature in given year range
Args:
db_file (string): file name for the database
country (string): name of the country for which data should be returned
year_begin (integer): the earliest year for which should be returned
year_end (integer): the latest years for which to should returned
month (integer): the month of the year for which should be returned
min_obs (integer): the minimum required number of years of data for any given station
**kwargs (optional): keyword arguments for the geographic scatter plot
Returns:
an interactive geographic scatterplot
"""
df = query_climate_database(db_file, country, year_begin, year_end, month)
# Filter out stations with observations less than min_obs
df['ObsCount'] = df.groupby('NAME')['NAME'].transform('count')
df = df[df['ObsCount'] >= min_obs]
# Fitting each station data to a linear regression model
# and put the first coefficient of the fitted models into a Pandas dataframe
coefs = df.groupby(["NAME", "Month"]).apply(coef).round(4).reset_index()
# Adding columns for latitude and longitude for each station in coefs
lat_lon = df[['NAME', 'LATITUDE', 'LONGITUDE']].drop_duplicates('NAME')
coefs = coefs.merge(lat_lon, how='left', left_on='NAME', right_on='NAME')
# Creating dictionary to match month to month name (for title of plot)
month_dict={1: "January", 2: "February", 3: "March", 4: "April",
5: "May", 6: "June", 7: "July", 8: "August", 9: "September",
10: "October", 11: "November", 12: "December"}
# Preparing the plot
fig = px.scatter_mapbox(coefs,
lat="LATITUDE",
lon="LONGITUDE",
hover_name="NAME",
color=0,
title = f"Estimates of yearly increase in termperature in {month_dict[month]} <br>for stations in {country}, years {year_begin} - {year_end}",
**kwargs)
# Updating margin and colorbar range
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},
coloraxis=dict(cmax=0.1, cmin=-0.1), # to make colorbar range from -0.1 to 0.1
coloraxis_colorbar=dict(title='Estimated Yearly<br>Increase (°C)<br>')
)
return fig
from plotly import express as px
= px.colors.sequential.RdBu_r
color_map
= 1
month = "India"
country = 1980
year_begin = 2020
year_end = temperature_coefficient_plot("temps.db","India", 1980, 2020, 1,
fig = 10,
min_obs = 2,
zoom ="carto-positron",
mapbox_style=color_map,
color_continuous_scale= 700)
width
fig.show()
It seems that a lot stations that surround the edges of India has a positive coefficient, which indicates that there are trends of increasing temperatures over the past decades. Notice how the stations in the central area of India yields negative coefficients.
Let’s take a look at a geographic scatter plot for the U.S. during the same time period.
4. Create Two More Interesting Figures
Let’s see how the average annual temperatures has changed over the years.
from climate_database import query_country_climate_database
= query_country_climate_database("temps.db", "India",1980, 2020)
df df.head()
NAME | LATITUDE | LONGITUDE | Year | Month | Temp | |
---|---|---|---|---|---|---|
0 | AGARTALA | 23.883 | 91.25 | 1980 | 1 | 18.21 |
1 | AGARTALA | 23.883 | 91.25 | 1980 | 3 | 26.30 |
2 | AGARTALA | 23.883 | 91.25 | 1980 | 4 | 29.72 |
3 | AGARTALA | 23.883 | 91.25 | 1980 | 5 | 27.28 |
4 | AGARTALA | 23.883 | 91.25 | 1980 | 6 | 28.56 |
We’ll be using the temperature_coefficient_plot
, which I’ve created, to implement the heatmap.
from climate_database import temperature_coefficient_plot
print(inspect.getsource(temperature_coefficient_plot))
def temperature_coefficient_plot(db_file, country, year_begin, year_end, month, min_obs, **kwargs) :
"""creates a geographic scatter plot where each point displays the first coefficent
of the linear regression model fitted for each station's temperature in given year range
Args:
db_file (string): file name for the database
country (string): name of the country for which data should be returned
year_begin (integer): the earliest year for which should be returned
year_end (integer): the latest years for which to should returned
month (integer): the month of the year for which should be returned
min_obs (integer): the minimum required number of years of data for any given station
**kwargs (optional): keyword arguments for the geographic scatter plot
Returns:
an interactive geographic scatterplot
"""
df = query_climate_database(db_file, country, year_begin, year_end, month)
# Filter out stations with observations less than min_obs
df['ObsCount'] = df.groupby('NAME')['NAME'].transform('count')
df = df[df['ObsCount'] >= min_obs]
# Fitting each station data to a linear regression model
# and put the first coefficient of the fitted models into a Pandas dataframe
coefs = df.groupby(["NAME", "Month"]).apply(coef).round(4).reset_index()
# Adding columns for latitude and longitude for each station in coefs
lat_lon = df[['NAME', 'LATITUDE', 'LONGITUDE']].drop_duplicates('NAME')
coefs = coefs.merge(lat_lon, how='left', left_on='NAME', right_on='NAME')
# Creating dictionary to match month to month name (for title of plot)
month_dict={1: "January", 2: "February", 3: "March", 4: "April",
5: "May", 6: "June", 7: "July", 8: "August", 9: "September",
10: "October", 11: "November", 12: "December"}
# Preparing the plot
fig = px.scatter_mapbox(coefs,
lat="LATITUDE",
lon="LONGITUDE",
hover_name="NAME",
color=0,
title = f"Estimates of yearly increase in termperature in {month_dict[month]} <br>for stations in {country}, years {year_begin} - {year_end}",
**kwargs)
# Updating margin and colorbar range
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},
coloraxis=dict(cmax=0.1, cmin=-0.1), # to make colorbar range from -0.1 to 0.1
coloraxis_colorbar=dict(title='Estimated Yearly<br>Increase (°C)<br>')
)
return fig
Based on the heatmap below, it appears that, over the years, more and more stations have been a higher annual yearly temperature.
from climate_database import yearly_avg_temp_heatmap
# setting color map for heatmap
= px.colors.sequential.deep
color_map
= yearly_avg_temp_heatmap("temps.db", "India", 1980, 2020,
fig = color_map)
color_continuous_scale
fig.show()
We can dig deeper to learn more about our data. Let’s see how the monthly averages in India has changed from 1980 to 2020, and see if there’s a particular month where the averages have increased.
from climate_database import monthly_avg_temp_lineplots
= monthly_avg_temp_lineplots("temps.db", "India", 1980, 2020,
fig = 1500,
height = 500)
width fig.show()
While none of the plots above appear to have a strongly positive linear trend, we can see that, there’s a small positive linear trend in March and April, and from June to November. This indicates that the effects of climate change may be more apparent in those months in India.