Complete project from inspection to analysis, using Google's ecosystem (Cloud Storage, BigQuery, Colab).
According to experts, the number of heart attacks increases when the temperature drops, more specifically below 14°C (57°F). How does this statement hold up by analyzing a sample from a particular city in Brazil?
In this project I want to show some skills regarding the analysis of public data, SQL and Cloud Computing.
Nowadays, every data professional should be comfortable with cloud solutions, specially the ones that are working on startups.
The idea of this project is to show decision making, usability of the cloud, how to integrate some modules and simulate a small data operation.
Below, the link for the statements and datasets.
SOURCE 1: https://www.uol.com.br/vivabem/noticias/redacao/2019/07/07/inverno-aumenta-em-30-o-risco-de-infarto-saiba-como-se-proteger.htm
SOURCE 2: https://drauziovarella.uol.com.br/cardiovascular/numero-de-casos-de-infarto-e-avc-aumenta-no-frio/
DATASETS: http://dadosabertos.c3sl.ufpr.br/curitiba/SESPAMedicoUnidadesMunicipaisDeSaude/
2018-03-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv
2018-06-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv
2018-09-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv
IMPORTANT #1: Some info of the datasets are in Brazilian - Portuguese (Because I'm originally from Brazil and I decided to navigate our public datasets) but I'll translate all the important information throughout this page to English. Some of the images used in this project will also be Google Translated to English.
IMPORTANT #2: The idea of this project is not to affirm if public statements are true, or test the veracity of any scientific project.
Let's begin!
When navigating the web, it's very likely we will come across bold statements, such as the following:
Some of these statements can't be simply tested, but what if we decided to use public available data to test it out?
I've then decided to download and browse through the data of a Brazilian city (Since the posts were in Brazilian websites).
It's the city of Curitiba, which not only is very organized in terms of technology, as is also a city that can be very cold during some periods of the year.
Thing is, when I finished the download of such items, I've quickly realized that no-code / low-code applications like Excel or PowerBI would be a little tricky to use, since we start to get into gigabyte level of data, with MILLIONS of rows to analyze. (Even trying to open one of these files in Excel resulted in a miserable crash. It's time to use the real deal.)
Big datasets require big tools, and considering not everyone has a top notch machine at home or at their company, cloud solutions start to look very compelling.
I'm sure and convinced that there are different ways of cleaning data, changing encodings, all the good stuff.
But since we are talking about Portuguese and its special characters, I've decided to manually changed some stuff on Sublime Text.
Great, now with all the spaces, the "Ç" and the "Ã" and their cousins out of the way, let's upload these files to our cloud.
For starters, Cloud Storage is a badass version of standard clouds for uploading files, except it was designed for scabability in mind. It's super powerful and integrates like a breeze with BigQuery, which we will use in a moment.
BUT HEY, HELIO: Why don't you filter the dataset into a small version with only the occurrences of Heart Diseases and analyse it straight into Jupyter without uploading all that in cloud which could potentially generate costs?
On a real life scenario, of course the best decision is always the one that generates the best result while consuming the least amount of resources.
Side note: It's possible to use a multi-cloud environment, but let's make it simple.
A little step-by-step ahead:
That's it, quite straight forward I'd say.
Now, let's make things a little more tangible.
We want to analyze the heart issues in general, specifically heart attacks, but of course the public available data is not only about heart diseases. There's a little bit of everything going on.
Another problem is that we don't have the weather info on these datasets, which require us to get another public dataset (Which Google kindly provides) and cross it with the temperature info for the particular days we are looking for.
So let's do it like this:
And for all the intents and purposes, it is time for some good old SQL, data's best friend. Let's do it.
We need to "tell" BigQuery how to manipulate the data.
Before, we need to understand how it manages data:
That being said, we need to request BigQuery to download the file from CloudStorage into a DATASET.
Then we WRITE A QUERY to the DATASET to summon a TABLE to be used afterwards.
Since we have 3 tables with the same variables (columns), we can create a table based on the first dataset and uploading its schema with it, and download the 2 other datasets into BigQuery while adding them to the created table, with the same schema as well.
Since this is not a tutorial on how to use BigQuery, I'll jump straight to the Queries. Feel free to ask me for more information.
With this information I can execute my query.
The first query I executed was the following:
Well, let's describe what's going on here:
Notice how each query process a determined amount of information.
That's why it's super important to write efficient queries, not only because of time, but also because the size of the files processed is paid for. Good queries not only save time, they also save money.
Now we save the query into a new dataset / table, good job so far.
We have our Heart Medical data, but we are not done yet.
Remember that we want to check if the cases increase when it's cold?
Google supplies weather info inside BigQuery, it's called noaa_gsod (National Oceanic and Atmospheric Administration - Global Surface Summary of the Day), and it's available all the way from 1929.
We are going to Query data from GSOD 2018, for the city of Curitiba, Brazil.
Here comes the second query:
Well, let's describe what's going on here:
Now I saved this into a new dataset / table, it's time for the "Gran SQL Finale".
2 Tables to be merged into 1 while retaining all the information that we need. Time to JOIN.
Final query:
Let's describe the query:
Here we are!
Now we are finally about to start the so called "analysis" using Google Colab and some Python libraries.
The idea of using a collab notebook is the integration with Google Cloud. While it's also possible to install the libraries on Jupyter, using native tools is always interesting. Plus, the idea for this project is to use Google's solution as much as possible.
Let's begin by:
# LIBRARIES
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from google.cloud import bigquery
# AUTHENTICATION
from google.colab import auth
auth.authenticate_user ()
print('Authenticated')
# ID of the Google Cloud project
project_id = 'heart-attack-ds'
# Set up the BigQuery client
client = bigquery.Client(project=project_id)
# Set up the query
query = """
SELECT *
FROM `heart-attack-ds.occurrences_results.heart_occurrences`
"""
# Run the query and retrieve the results
query_job = client.query(query)
results = query_job.result()
# Convert the results to a pandas dataframe
heart_occurrences = results.to_dataframe()
While we in fact had to analyse multiple things to come to this point, after all, we NEED to know what we are doing on Queries, here is the part where we look at numbers, plots, cross things, understand variables and all the good stuff.
Many things happened for this moment to be possible, so the legend is in fact true: The actual analysis is only a small part of the whole process.
The only thing is: This part is a piece of cake now that we basically made our dataset / table as objective as possible.
# Print the dataframe to check if everything is fine
print(heart_occurrences)
# Print the head of the dataframe to get a slightly more organized view
heart_occurrences.head()
#Based on the temperature, I want to know the number of occcurences of each type of heart problem.
heart_occurrences.groupby(['temp_celsius','descricao_do_cid'])['data'].value_counts()
# Saving as a variable
value_per_temp = heart_occurrences.groupby(['temp_celsius','descricao_do_cid'])['data'].value_counts()
# I want to check the mean
value_per_temp.groupby(['temp_celsius', 'descricao_do_cid']).mean().reset_index()
# Saving as a variable
value_per_temp_mean = value_per_temp.groupby(['temp_celsius', 'descricao_do_cid']).mean().reset_index()
# I want to check how many types of ICD / CID are described in this dataset.
cids = value_per_temp_mean['descricao_do_cid'].unique()
for cid in cids:
print(cid)
#At last, let's see how the mean of each ICD corresponds to the temperature of when it happened.
for cid in cids:
plt.figure(figsize=(16,6))
plt.title(cid)
df_cid = value_per_temp_mean[value_per_temp_mean['descricao_do_cid'] == cid]
sns.barplot(x="temp_celsius", y="data", data=df_cid, color="coral")
On a real life project, this is the part I'd write my conclusions. I usually put remarks throughout the plots so I don't lose information, then I just concatenate everything at the end.
But since this project basically had 1 plot session and really 1 question to be answered so to speak, we basically need to look at a few things.
Why? As we can clearly see with our plots, the temperature vs the mean of heart diseases occurrences is irrelevant when we plot considering all the cases within the dataset.
We also need to gather more information: In which city / country these websites done their research?
Did the research follow a good methodology in the first place?
We only have the data for public institutions, what about private ones?
What if the increase in number of cases is the combination of weather AND other factors, such as employment rate, human development index, or other?
Machine Learning would be excellent in that scenario in order to identify the most important variables. Maybe we could even check the possibility of uploading a model to production.
Plus, not many cities in Brazil stay below 14°C (57°F) for a long time during the year, so I wonder how such unbalanced scenario would contribute for a conclusion like the title of the website posts.
Why? Because we didn't analyse the data for other years :)
Why? Because I'd need to run more and more data, possibly from multiple years, plot more visualizations, study the plots for longer.
After a calmly analysed good chunk of data (we are probably talking about TERABYTES), then I'd be able to say for sure what is going on with that one.
If we have all the data needed: GREAT
If we have part of it: RUN IT AS A SAMPLE
If we barely have any data: It is just opinion.
And honestly, looking at the complexity of a GOOD analysis, I doubt such thing as analyzing big chunks of data was done.
We have to always question everything.
The list goes on! There are so many things that could in fact contribute for heart attacks, that it kinda feels silly to affirm that the weather in itself plays a major role in it.
Now that's the end.
THANK YOU for sticking around and I hope this project was useful to you.
Could I plug all the available datasets of heart diseases on a larger scale? Yes.
Is there room for improvement? Yes.
But I guess that already does the trick for an GCP portfolio project.
Hope talking to you soon!
Helio Ribeiro
helioribeiropro@gmail.com
+55 (11) 9 3932-8049