Creating Databases, Query Functions, and Interactive Plots

sqlite3
SQL
pandas
sklearn
plotly
Author

Jade Liang

Published

October 23, 2024

Welcome! In this blog post, I’ll show you how to create a database, write a query function to access data from tables in a database, and create interesting visualizations using Plotly Express.

1. Create a Database

First, we’ll import the sqlite3 and pandas packages. Then, using sqlite3.connect(), we’ll create a database in our current directory called temps.db.

# importing necessary package
import sqlite3
import pandas as pd

# to create a database in the current directory called temps.db
conn = sqlite3.connect("temps.db") 

We’ll be creating a new folder named datafiles, then downloading the temperatures datas into our new folder.

import os
# create folder named "datafiles" if it does not exist
if not os.path.exists("datafiles"): 
    os.mkdir("datafiles")

# download the files for the `temperatures` table
import urllib.request
intervals = [f"{10 * i + 1}-{10 * (i+1)}" for i in range(190, 202)]
for interval in intervals:
    url = f"https://raw.githubusercontent.com/PIC16B-ucla/24F/main/datasets/noaa-ghcn/decades/{interval}.csv"
    urllib.request.urlretrieve(url, f"datafiles/{interval}.csv")

Next, we’ll upload the tables temperatures, stations, and countries into our database. Note that there are some NaN values in the temperatures dataset. I’ve removed them using the prepare_df() function. Here’s what the prepare_df() looks like:

def prepare_df(df):
    """
    prepares a piece of wide format dataframe into a long format data frame
    """
    # melt to the long format table
    df = df.melt(
        id_vars = ["ID", "Year"],
        value_vars = [f"VALUE{i}" for i in range(1, 13)],
        var_name = "Month",
        value_name = "Temp"
    )

    # cleaning month and temp
    df["Month"] = df["Month"].str[5:].astype(int)
    df["Temp"]  = df["Temp"] / 100

    # removing rows where Temp == NaN
    df = df.dropna(subset = "Temp")

    return df
for i, interval in enumerate(intervals):
    filepath = f"datafiles/{interval}.csv"
    df = pd.read_csv(filepath)
    df = prepare_df(df) # prepares temperatures df and removes nan values
    df.to_sql("temperatures", conn, 
              if_exists = "replace" if i == 0 else "append", index = False)

url = "https://raw.githubusercontent.com/PIC16B-ucla/24F/refs/heads/main/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
stations.to_sql("stations", conn, if_exists = "replace", index=False)

countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries.to_sql("countries", conn, if_exists = "replace", index=False)

Let’s check if our tables have been successfully uploaded into our database.

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]

Looks good! Now, after we’re done uploading tables to our database, let’s close our connection to our database using conn.close(). It’s generally a good practice to do this.

conn.close()

2. Write a Query Function

Now that we have our tables in the database, let’s create a query function to allow easy access to our data in the database. I’ve created the function in… Here’s what the function looks like:

from climate_database import query_climate_database
import inspect
print(inspect.getsource(query_climate_database))
def query_climate_database(db_file, country, year_begin, year_end, month) :
    """ Extracts climate data from a specified database based on the provided country, 
        year range, and month; and returns extracted data in a Pandas dataframe

    Args:
        db_file (string): file name for the database
        country (string): name of the country for which data should be returned
        year_begin (integer): the earliest year for which should be returned
        year_end (integer): the latest years for which to should returned
        month (integer): the month of the year for which should be returned

    Returns:
        df (Pandas dataframe): a dataframe of temperature readings of inputted country according to 
            inputted year_begin, inputted year_end, and inputted month of the year. The
            resulting dataframe contains the columns `NAME`, 'LATITUDE', 'LONGITUDE`, 
            `Country`, `Year`, `Month`, `Temp`.
    """

    with sqlite3.connect(db_file) as conn:
        df = pd.read_sql_query("""
                    SELECT s.*, c.Name AS Country, t.*
                    FROM stations s
                    INNER JOIN countries c
                    ON c."FIPS 10-4" = SUBSTR(s.ID, 1, 2)
                    INNER JOIN temperatures t
                    ON t.ID = s.ID
                    WHERE c.Name = ? AND t.Year >= ? AND t.Year <= ? AND t.Month = ?   
                    ORDER BY NAME                            
                    """, conn, params=(country, year_begin, year_end, month))
        
        # returns dataframe in order of columns mentioned before
    return df[["NAME", "LATITUDE", "LONGITUDE", "Country", "Year", "Month", "Temp"]]    

This is what the resulting dataframe looks like:

df = query_climate_database(db_file = "temps.db",
                       country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
df
NAME LATITUDE LONGITUDE Country Year Month Temp
0 AGARTALA 23.883 91.250 India 1980 1 18.21
1 AGARTALA 23.883 91.250 India 1981 1 18.25
2 AGARTALA 23.883 91.250 India 1982 1 19.31
3 AGARTALA 23.883 91.250 India 1985 1 19.25
4 AGARTALA 23.883 91.250 India 1988 1 19.54
... ... ... ... ... ... ... ...
3147 VISHAKHAPATNAM 17.717 83.233 India 2016 1 25.09
3148 VISHAKHAPATNAM 17.717 83.233 India 2017 1 23.90
3149 VISHAKHAPATNAM 17.717 83.233 India 2018 1 22.65
3150 VISHAKHAPATNAM 17.717 83.233 India 2019 1 22.20
3151 VISHAKHAPATNAM 17.717 83.233 India 2020 1 23.75

3152 rows × 7 columns

3. Write a Geographic Scatter Function for Yearly Temperature Increases

Now that we have easy access to certain subsets of our data, let’s visualize our data and see what we can learn from it. To do that, I’ll create a function that displays a geographic scatter plot, where eahc point shows the first coefficient of the linear regression model fitted by the temperatures each station in India from 1980 - 2020.

To evaluate the coefficients, we’ll need to implement a function that utilizes LinearRegression from sklearn.linear_model.

from sklearn.linear_model import LinearRegression
def coef(data_group):
    """ fits data to a linear regression model and outputs the first coefficient of the fitted model

    Args:
        data_group (Pandas dataframe): the data for which to fit to a model
    
    Returns:
        
    """
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]
from climate_database import temperature_coefficient_plot
print(inspect.getsource(temperature_coefficient_plot))
def temperature_coefficient_plot(db_file, country, year_begin, year_end, month, min_obs, **kwargs) : 
    """creates a geographic scatter plot where each point displays the first coefficent
            of the linear regression model fitted for each station's temperature in given year range

    Args:
        db_file (string): file name for the database
        country (string): name of the country for which data should be returned
        year_begin (integer): the earliest year for which should be returned
        year_end (integer): the latest years for which to should returned
        month (integer): the month of the year for which should be returned
        min_obs (integer): the minimum required number of years of data for any given station
        **kwargs (optional): keyword arguments for the geographic scatter plot
    
    Returns:
        an interactive geographic scatterplot
    """

    df = query_climate_database(db_file, country, year_begin, year_end, month)

    # Filter out stations with observations less than min_obs
    df['ObsCount'] = df.groupby('NAME')['NAME'].transform('count')
    df = df[df['ObsCount'] >= min_obs]

    # Fitting each station data to a linear regression model
    #   and put the first coefficient of the fitted models into a Pandas dataframe
    coefs = df.groupby(["NAME", "Month"]).apply(coef).round(4).reset_index()

    # Adding columns for latitude and longitude for each station in coefs
    lat_lon = df[['NAME', 'LATITUDE', 'LONGITUDE']].drop_duplicates('NAME')
    coefs = coefs.merge(lat_lon, how='left', left_on='NAME', right_on='NAME')
    
    # Creating dictionary to match month to month name (for title of plot)
    month_dict={1: "January", 2: "February", 3: "March", 4: "April",
            5: "May", 6: "June", 7: "July", 8: "August", 9: "September",
            10: "October", 11: "November", 12: "December"}
    
    # Preparing the plot
    fig = px.scatter_mapbox(coefs,
                            lat="LATITUDE",
                            lon="LONGITUDE",
                            hover_name="NAME",
                            color=0,
                            title = f"Estimates of yearly increase in termperature in {month_dict[month]} <br>for stations in {country}, years {year_begin} - {year_end}",
                            **kwargs)
    
    # Updating margin and colorbar range
    fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0}, 
                      coloraxis=dict(cmax=0.1, cmin=-0.1), # to make colorbar range from -0.1 to 0.1
                      coloraxis_colorbar=dict(title='Estimated Yearly<br>Increase (°C)<br>')
                      )
    
    return fig
from plotly import express as px

color_map = px.colors.sequential.RdBu_r

month = 1
country = "India"
year_begin = 1980
year_end = 2020
fig = temperature_coefficient_plot("temps.db","India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map,
                                   width = 700)
                                 
fig.show()

It seems that a lot stations that surround the edges of India has a positive coefficient, which indicates that there are trends of increasing temperatures over the past decades. Notice how the stations in the central area of India yields negative coefficients.

Let’s take a look at a geographic scatter plot for the U.S. during the same time period.

4. Create Two More Interesting Figures

Let’s see how the average annual temperatures has changed over the years.

from climate_database import query_country_climate_database
df = query_country_climate_database("temps.db", "India",1980, 2020)
df.head()
NAME LATITUDE LONGITUDE Year Month Temp
0 AGARTALA 23.883 91.25 1980 1 18.21
1 AGARTALA 23.883 91.25 1980 3 26.30
2 AGARTALA 23.883 91.25 1980 4 29.72
3 AGARTALA 23.883 91.25 1980 5 27.28
4 AGARTALA 23.883 91.25 1980 6 28.56

We’ll be using the temperature_coefficient_plot, which I’ve created, to implement the heatmap.

from climate_database import temperature_coefficient_plot
print(inspect.getsource(temperature_coefficient_plot))
def temperature_coefficient_plot(db_file, country, year_begin, year_end, month, min_obs, **kwargs) : 
    """creates a geographic scatter plot where each point displays the first coefficent
            of the linear regression model fitted for each station's temperature in given year range

    Args:
        db_file (string): file name for the database
        country (string): name of the country for which data should be returned
        year_begin (integer): the earliest year for which should be returned
        year_end (integer): the latest years for which to should returned
        month (integer): the month of the year for which should be returned
        min_obs (integer): the minimum required number of years of data for any given station
        **kwargs (optional): keyword arguments for the geographic scatter plot
    
    Returns:
        an interactive geographic scatterplot
    """

    df = query_climate_database(db_file, country, year_begin, year_end, month)

    # Filter out stations with observations less than min_obs
    df['ObsCount'] = df.groupby('NAME')['NAME'].transform('count')
    df = df[df['ObsCount'] >= min_obs]

    # Fitting each station data to a linear regression model
    #   and put the first coefficient of the fitted models into a Pandas dataframe
    coefs = df.groupby(["NAME", "Month"]).apply(coef).round(4).reset_index()

    # Adding columns for latitude and longitude for each station in coefs
    lat_lon = df[['NAME', 'LATITUDE', 'LONGITUDE']].drop_duplicates('NAME')
    coefs = coefs.merge(lat_lon, how='left', left_on='NAME', right_on='NAME')
    
    # Creating dictionary to match month to month name (for title of plot)
    month_dict={1: "January", 2: "February", 3: "March", 4: "April",
            5: "May", 6: "June", 7: "July", 8: "August", 9: "September",
            10: "October", 11: "November", 12: "December"}
    
    # Preparing the plot
    fig = px.scatter_mapbox(coefs,
                            lat="LATITUDE",
                            lon="LONGITUDE",
                            hover_name="NAME",
                            color=0,
                            title = f"Estimates of yearly increase in termperature in {month_dict[month]} <br>for stations in {country}, years {year_begin} - {year_end}",
                            **kwargs)
    
    # Updating margin and colorbar range
    fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0}, 
                      coloraxis=dict(cmax=0.1, cmin=-0.1), # to make colorbar range from -0.1 to 0.1
                      coloraxis_colorbar=dict(title='Estimated Yearly<br>Increase (°C)<br>')
                      )
    
    return fig

Based on the heatmap below, it appears that, over the years, more and more stations have been a higher annual yearly temperature.

from climate_database import yearly_avg_temp_heatmap

# setting color map for heatmap
color_map = px.colors.sequential.deep

fig = yearly_avg_temp_heatmap("temps.db", "India", 1980, 2020,
                              color_continuous_scale = color_map)

fig.show()

We can dig deeper to learn more about our data. Let’s see how the monthly averages in India has changed from 1980 to 2020, and see if there’s a particular month where the averages have increased.

from climate_database import monthly_avg_temp_lineplots

fig = monthly_avg_temp_lineplots("temps.db", "India", 1980, 2020,
                                 height = 1500,
                                 width = 500)
fig.show()

While none of the plots above appear to have a strongly positive linear trend, we can see that, there’s a small positive linear trend in March and April, and from June to November. This indicates that the effects of climate change may be more apparent in those months in India.