SUMMARY - ABSTRACT

Return to Index

Complete project from inspection to analysis, using Google's ecosystem (Cloud Storage, BigQuery, Colab).

According to experts, the number of heart attacks increases when the temperature drops, more specifically below 14°C (57°F). How does this statement hold up by analyzing a sample from a particular city in Brazil?

In this project I want to show some skills regarding the analysis of public data, SQL and Cloud Computing.


Nowadays, every data professional should be comfortable with cloud solutions, specially the ones that are working on startups.

The idea of this project is to show decision making, usability of the cloud, how to integrate some modules and simulate a small data operation.

OBJECTIVE: Check if public statements are true by analysing public data, using cloud infrastructure.

Below, the link for the statements and datasets.

SOURCE 1: https://www.uol.com.br/vivabem/noticias/redacao/2019/07/07/inverno-aumenta-em-30-o-risco-de-infarto-saiba-como-se-proteger.htm
SOURCE 2: https://drauziovarella.uol.com.br/cardiovascular/numero-de-casos-de-infarto-e-avc-aumenta-no-frio/
DATASETS: http://dadosabertos.c3sl.ufpr.br/curitiba/SESPAMedicoUnidadesMunicipaisDeSaude/
2018-03-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv
2018-06-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv
2018-09-10_Sistema_E-SaudeMedicos-_Base_de_Dados.csv

IMPORTANT #1: Some info of the datasets are in Brazilian - Portuguese (Because I'm originally from Brazil and I decided to navigate our public datasets) but I'll translate all the important information throughout this page to English. Some of the images used in this project will also be Google Translated to English.

IMPORTANT #2: The idea of this project is not to affirm if public statements are true, or test the veracity of any scientific project.

USED TECHNOLOGIES
  1. SQL
  2. GCP (Google Cloud Platform)
    2.1. Cloud Storage
    2.2. BigQuery
    2.3. Google Colab
  3. Python
    3.1. Pandas
    3.2. Seaborn
    3.3. Matplotlib.pyplot

Let's begin!


1. THE QUESTION

Return to Index

When navigating the web, it's very likely we will come across bold statements, such as the following: 1_HEART_ATTACK_COLD.png

2_HEART_ATTACK_COLD.png


Some of these statements can't be simply tested, but what if we decided to use public available data to test it out?

I've then decided to download and browse through the data of a Brazilian city (Since the posts were in Brazilian websites).

It's the city of Curitiba, which not only is very organized in terms of technology, as is also a city that can be very cold during some periods of the year.

Thing is, when I finished the download of such items, I've quickly realized that no-code / low-code applications like Excel or PowerBI would be a little tricky to use, since we start to get into gigabyte level of data, with MILLIONS of rows to analyze. (Even trying to open one of these files in Excel resulted in a miserable crash. It's time to use the real deal.)

3_DATA_FILES.png

4_DATA_SIZE.png

Big datasets require big tools, and considering not everyone has a top notch machine at home or at their company, cloud solutions start to look very compelling.

2. DATA CLEANING

Return to Index
You know when people say that data professionals mostly often clean and prepare data for analysis instead of running ML models?

WELL.

With a portfolio project, that shall be no different.

I'm sure and convinced that there are different ways of cleaning data, changing encodings, all the good stuff.

But since we are talking about Portuguese and its special characters, I've decided to manually changed some stuff on Sublime Text.

BEFORE

6_SUBLIME_BEFORE.png

AFTER

7_SUBLIME_AFTER.png


Great, now with all the spaces, the "Ç" and the "Ã" and their cousins out of the way, let's upload these files to our cloud.


3. GOOGLE CLOUD STORAGE

Return to Index

For starters, Cloud Storage is a badass version of standard clouds for uploading files, except it was designed for scabability in mind. It's super powerful and integrates like a breeze with BigQuery, which we will use in a moment.


BUT HEY, HELIO: Why don't you filter the dataset into a small version with only the occurrences of Heart Diseases and analyse it straight into Jupyter without uploading all that in cloud which could potentially generate costs?

WELL...
  • FIRST, I want to simulate a real life scenario where everything is available at the cloud of choice.
  • SECOND, I want to show some SQL skills because that's what represents the day-to-day stuff of data professionals.
  • THIRD, I know the file is 1GB only... But what if it was 70GB? Tough to work on a Jupyter without memory to load the datasets.
  • FOURTH: Cloud is cool. 🤓

On a real life scenario, of course the best decision is always the one that generates the best result while consuming the least amount of resources.


Side note: It's possible to use a multi-cloud environment, but let's make it simple.

A little step-by-step ahead:

  1. Enter GCP Console
  2. Open a project (or enter an existing one, in my case it is named "Heart Attack-DS")
  3. Enter Cloud Storage on the left navigation bar
  4. Create and/or enter the bucket for your project
  5. Create a folder to make sure everything is neat
  6. Upload the 1GB of files to make it available to work within the cloud.
In the end it should look something like this:

9_CLOUD_STORAGE.png



That's it, quite straight forward I'd say.

Now, let's make things a little more tangible.


4. GOOGLE BIGQUERY / SQL - ASSEMBLYING THE DATASETS

Return to Index

We want to analyze the heart issues in general, specifically heart attacks, but of course the public available data is not only about heart diseases. There's a little bit of everything going on.

Another problem is that we don't have the weather info on these datasets, which require us to get another public dataset (Which Google kindly provides) and cross it with the temperature info for the particular days we are looking for.

So let's do it like this:

  1. I need to aggregate all the datasets into one.
  2. Then I will reduce their size, by filtering only the relevant data regarding heart problems
  3. Finally, I will join the weather data to the dataset, making it ready for analysis.

And for all the intents and purposes, it is time for some good old SQL, data's best friend. Let's do it.


4.1. BIGQUERY - Aggregating the 3 datasets into 1

Return to Index

We need to "tell" BigQuery how to manipulate the data.

Before, we need to understand how it manages data:

  • First, we create a DATASET. The DATASET will store the TABLE.
  • The TABLE the equivalent is what we are used to visualize on Pandas, Excel, etc etc

That being said, we need to request BigQuery to download the file from CloudStorage into a DATASET.
Then we WRITE A QUERY to the DATASET to summon a TABLE to be used afterwards.

Since we have 3 tables with the same variables (columns), we can create a table based on the first dataset and uploading its schema with it, and download the 2 other datasets into BigQuery while adding them to the created table, with the same schema as well.

Since this is not a tutorial on how to use BigQuery, I'll jump straight to the Queries. Feel free to ask me for more information.


We will end up with something like this:

10_TABLE_ON_BIGQUERY.png



PREVIEW OF THE TABLE, JUST TO VISUALIZE

11_TABLE_PREVIEW.png



TABLE INFO

12_TABLE_INFO.png



2.6MM rows, I must say that looks exciting.


Now, let's work with the data.

4.2. SQL - Filtering by Heart Occurrences

Return to Index


The idea here is to filter the occurrences only by their ICD (International Classification of Diseases), or CID for Portuguese.

With a quick search, I've found this information online: (Translated from ENG to PT-BR)

13_CID.png

14_CID_HEART.png

How heart diseases are referred.

With this information I can execute my query.

The first query I executed was the following:

15_QUERY_FILTER.png

Well, let's describe what's going on here:

  • I want only 3 variables for my analysis: When, How and How Much.
  • I basically asked to extract the date, the ICD (Also asked the text version of it) and decided to count the occurrences.
  • They come from de dataset I've assembled earlier.
  • I filter only the ICD related to heart issues.
  • I group the info to be able to have a count of issues per day
  • I create a count column to help me to understand the amount and types of occurrences for each day.

Notice how each query process a determined amount of information.

16_QUERY_SIZE.png

That's why it's super important to write efficient queries, not only because of time, but also because the size of the files processed is paid for. Good queries not only save time, they also save money.

Now we save the query into a new dataset / table, good job so far.

4.3. SQL - Gathering weather data

Return to Index

We have our Heart Medical data, but we are not done yet.
Remember that we want to check if the cases increase when it's cold?

Now we have to gather weather information.

Google supplies weather info inside BigQuery, it's called noaa_gsod (National Oceanic and Atmospheric Administration - Global Surface Summary of the Day), and it's available all the way from 1929.

We are going to Query data from GSOD 2018, for the city of Curitiba, Brazil.

Here comes the second query:

17_QUERY_TEMPERATURE.png

Well, let's describe what's going on here:

  • I extracted Day, Month and Year of the dataset (Which I'll use as an index and also as a reference to the ICD tables)
  • STN stands for Station, which is where NOAA reads information from. 838420 is the station for Curitiba.
  • I also converted from Fahrenheit to Celsius, but because I'm more used to it. Bet some people didn't know this could be done with SQL, I was surprised when I first learned it.

Now I saved this into a new dataset / table, it's time for the "Gran SQL Finale".

4.4. SQL - JOINing weather data

Return to Index

2 Tables to be merged into 1 while retaining all the information that we need. Time to JOIN.

Final query:
18_QUERY_JOIN_TEMP.png

Let's describe the query:

  • I select all the variables that I need, regardless of which table they belong.
  • "oco" and "tem" are temporary variables that will be only used as an SQL command, they won't show up later.
  • The "data" in front of f0 means that I'm renaming the column f0o to data, since that's the name we got from the first query when I used the "Extract Date" command
  • Now pay attention on the EXTRACT:
    • It is necessary for the join function to work because although both tables have date information, they are distributed differently.
    • Example: I EXTRACTED the DAY from "oco" (my first table) and told my Query that it equals the "da" column of the table "tem"
    • I had to CAST each variable AS INT64 because on my first table the information is stored as number and on the temperature table it is stores as a STRING.
  • That resulted on the final table where I have all the heart occurrences plus the weather info on that particular day, which is exactly what I need for my analysis.

19_FINAL_TABLE.png

The size of final table: 1324 rows.

20_TABLE_SIZE.png

It is much easier to work with this information now, isn't it?


I could decide to do several things with this dataset, including delivering it on an Excel file for the business professional. But I think Python would allow me to run analysis faster and gather more information quickly.

Time to go to Colab :)

5. GOOGLE COLAB / PYTHON

Return to Index

Here we are!

Now we are finally about to start the so called "analysis" using Google Colab and some Python libraries.

The idea of using a collab notebook is the integration with Google Cloud. While it's also possible to install the libraries on Jupyter, using native tools is always interesting. Plus, the idea for this project is to use Google's solution as much as possible.

Let's begin by:

  • Loading the libraries we will use
  • Authenticating our connection with BigQuery
  • Saving our final query from BQ into a Pandas Dataframe.
In [1]:
# LIBRARIES

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from google.cloud import bigquery
In [2]:
# AUTHENTICATION

from google.colab import auth
auth.authenticate_user ()
print('Authenticated')
Authenticated
In [3]:
# ID of the Google Cloud project
project_id = 'heart-attack-ds'
In [4]:
# Set up the BigQuery client
client = bigquery.Client(project=project_id)

# Set up the query
query = """
    SELECT *
    FROM `heart-attack-ds.occurrences_results.heart_occurrences`
"""

# Run the query and retrieve the results
query_job = client.query(query)
results = query_job.result()
In [5]:
# Convert the results to a pandas dataframe
heart_occurrences = results.to_dataframe()

5.1. PYTHON - Quick browse through data

Return to Index
Still following me here?
We finally arrived on that part that's mostly understood as ANALYSIS.


While we in fact had to analyse multiple things to come to this point, after all, we NEED to know what we are doing on Queries, here is the part where we look at numbers, plots, cross things, understand variables and all the good stuff.

Many things happened for this moment to be possible, so the legend is in fact true: The actual analysis is only a small part of the whole process.

The only thing is: This part is a piece of cake now that we basically made our dataset / table as objective as possible.

And the reason it's so simple is because we only wanted to answer ONE QUESTION:

HEART ATTACK: DO LOW TEMPERATURES INCREASE THE NUMBER OF CASES?

Let's find out!

In [6]:
# Print the dataframe to check if everything is fine
print(heart_occurrences)
            data codigo_do_cid  \
0     2018-06-01          I200   
1     2018-06-01          I200   
2     2018-06-01          I200   
3     2018-06-04          I200   
4     2018-06-04          I200   
...          ...           ...   
1319  2018-02-19          I238   
1320  2018-02-22          I238   
1321  2018-02-22          I238   
1322  2018-08-24          I213   
1323  2018-06-21          I231   

                                       descricao_do_cid  total_ocorrencias  \
0                                       ANGINA INSTAVEL                  1   
1                                       ANGINA INSTAVEL                  1   
2                                       ANGINA INSTAVEL                  1   
3                                       ANGINA INSTAVEL                  1   
4                                       ANGINA INSTAVEL                  1   
...                                                 ...                ...   
1319  OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INF...                  1   
1320  OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INF...                  1   
1321  OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INF...                  1   
1322  INFARTO AGUDO TRANSMURAL DO MIOCARDIO, DE LOCA...                  1   
1323  COMUNICACAO INTERATRIAL COMO COMPLICACAO ATUAL...                  1   

      temp_celsius  
0             17.0  
1             17.0  
2             17.0  
3             11.0  
4             11.0  
...            ...  
1319          21.0  
1320          23.0  
1321          23.0  
1322          19.0  
1323          17.0  

[1324 rows x 5 columns]
In [7]:
# Print the head of the dataframe to get a slightly more organized view
heart_occurrences.head()
Out[7]:
data codigo_do_cid descricao_do_cid total_ocorrencias temp_celsius
0 2018-06-01 I200 ANGINA INSTAVEL 1 17.0
1 2018-06-01 I200 ANGINA INSTAVEL 1 17.0
2 2018-06-01 I200 ANGINA INSTAVEL 1 17.0
3 2018-06-04 I200 ANGINA INSTAVEL 1 11.0
4 2018-06-04 I200 ANGINA INSTAVEL 1 11.0
In [8]:
#Based on the temperature, I want to know the number of occcurences of each type of heart problem.
heart_occurrences.groupby(['temp_celsius','descricao_do_cid'])['data'].value_counts()
Out[8]:
temp_celsius  descricao_do_cid                                                       data      
9.0           ANGINA INSTAVEL                                                        2018-06-15    4
              ANGINA PECTORIS, NAO ESPECIFICADA                                      2018-06-15    4
10.0          ANGINA INSTAVEL                                                        2018-06-14    6
                                                                                     2018-07-11    4
                                                                                     2018-07-12    2
                                                                                                  ..
25.0          INFARTO AGUDO DO MIOCARDIO NAO ESPECIFICADO                            2018-01-22    1
                                                                                     2018-01-23    1
              INFARTO AGUDO TRANSMURAL DO MIOCARDIO DE OUTRAS LOCALIZACOES           2018-01-22    1
              OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INFARTO AGUDO DO MIOCARDIO  2018-02-09    1
              OUTRAS FORMAS DE ANGINA PECTORIS                                       2018-01-23    2
Name: data, Length: 430, dtype: int64
In [9]:
# Saving as a variable

value_per_temp = heart_occurrences.groupby(['temp_celsius','descricao_do_cid'])['data'].value_counts()
In [10]:
# I want to check the mean
value_per_temp.groupby(['temp_celsius', 'descricao_do_cid']).mean().reset_index()
Out[10]:
temp_celsius descricao_do_cid data
0 9.0 ANGINA INSTAVEL 4.000000
1 9.0 ANGINA PECTORIS, NAO ESPECIFICADA 4.000000
2 10.0 ANGINA INSTAVEL 3.250000
3 10.0 ANGINA PECTORIS COM ESPASMO DOCUMENTADO 1.000000
4 10.0 ANGINA PECTORIS, NAO ESPECIFICADA 6.200000
... ... ... ...
122 25.0 DOENCA ISQUEMICA AGUDA DO CORACAO NAO ESPECIFI... 1.000000
123 25.0 INFARTO AGUDO DO MIOCARDIO NAO ESPECIFICADO 1.333333
124 25.0 INFARTO AGUDO TRANSMURAL DO MIOCARDIO DE OUTRA... 1.000000
125 25.0 OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INF... 1.000000
126 25.0 OUTRAS FORMAS DE ANGINA PECTORIS 2.000000

127 rows × 3 columns

In [11]:
# Saving as a variable

value_per_temp_mean = value_per_temp.groupby(['temp_celsius', 'descricao_do_cid']).mean().reset_index()

5.2. PYTHON - Plotting important info

Return to Index
In [12]:
# I want to check how many types of ICD / CID are described in this dataset.
cids = value_per_temp_mean['descricao_do_cid'].unique()

for cid in cids:
  print(cid)
ANGINA INSTAVEL
ANGINA PECTORIS, NAO ESPECIFICADA
ANGINA PECTORIS COM ESPASMO DOCUMENTADO
DOENCA ISQUEMICA AGUDA DO CORACAO NAO ESPECIFICADA
INFARTO AGUDO DO MIOCARDIO NAO ESPECIFICADO
OUTRAS FORMAS DE ANGINA PECTORIS
INFARTO AGUDO TRANSMURAL DA PAREDE ANTERIOR DO MIOCARDIO
INFARTO DO MIOCARDIO RECORRENTE DE LOCALIZACAO NAO ESPECIFICADA
OUTRAS FORMAS DE DOENCA ISQUEMICA AGUDA DO CORACAO
OUTRAS COMPLICACOES ATUAIS SUBSEQUENTES AO INFARTO AGUDO DO MIOCARDIO
INFARTO AGUDO SUBENDOCARDICO DO MIOCARDIO
INFARTO DO MIOCARDIO RECORRENTE DE OUTRAS LOCALIZACOES
TROMBOSE CORONARIA QUE NAO RESULTA EM INFARTO DO MIOCARDIO
COMUNICACAO INTERATRIAL COMO COMPLICACAO ATUAL SUBSEQUENTE AO INFARTO AGUDO DO MIOCARDIO
INFARTO AGUDO TRANSMURAL DA PAREDE INFERIOR DO MIOCARDIO
INFARTO AGUDO TRANSMURAL DO MIOCARDIO, DE LOCALIZACAO NAO ESPECIFICADA
INFARTO DO MIOCARDIO RECORRENTE DA PAREDE ANTERIOR
INFARTO AGUDO TRANSMURAL DO MIOCARDIO DE OUTRAS LOCALIZACOES
In [13]:
#At last, let's see how the mean of each ICD corresponds to the temperature of when it happened.

for cid in cids:
  plt.figure(figsize=(16,6))
  plt.title(cid)
  df_cid = value_per_temp_mean[value_per_temp_mean['descricao_do_cid'] == cid]
  sns.barplot(x="temp_celsius", y="data", data=df_cid, color="coral")
Now that we have our plots done, let's look at them and draw some conclusions.
Thank you for sticking around!

6. CONCLUSION

Return to Index

On a real life project, this is the part I'd write my conclusions. I usually put remarks throughout the plots so I don't lose information, then I just concatenate everything at the end.

But since this project basically had 1 plot session and really 1 question to be answered so to speak, we basically need to look at a few things.


So, do low temperatures increase the occurrences of heart attacks?

THE ANSWER IS:

  • For the CITY OF CURITIBA in Brazil, IN 2018, within PUBLIC HOSPITALS, without considering other factors such as AGE and GENDER, is a clear NO.

Why? As we can clearly see with our plots, the temperature vs the mean of heart diseases occurrences is irrelevant when we plot considering all the cases within the dataset.

We also need to gather more information: In which city / country these websites done their research?

Did the research follow a good methodology in the first place?

We only have the data for public institutions, what about private ones?

What if the increase in number of cases is the combination of weather AND other factors, such as employment rate, human development index, or other?

Machine Learning would be excellent in that scenario in order to identify the most important variables. Maybe we could even check the possibility of uploading a model to production.

Plus, not many cities in Brazil stay below 14°C (57°F) for a long time during the year, so I wonder how such unbalanced scenario would contribute for a conclusion like the title of the website posts.


"BUT HELIO CAN WE AFFIRM THAT FOR ALL THE YEARS?"
  • NO.

Why? Because we didn't analyse the data for other years :)


"SO ARE THE WEBSITES WRONG?"
  • INCONCLUSIVE.

Why? Because I'd need to run more and more data, possibly from multiple years, plot more visualizations, study the plots for longer.

After a calmly analysed good chunk of data (we are probably talking about TERABYTES), then I'd be able to say for sure what is going on with that one.

If we have all the data needed: GREAT
If we have part of it: RUN IT AS A SAMPLE
If we barely have any data: It is just opinion.

And honestly, looking at the complexity of a GOOD analysis, I doubt such thing as analyzing big chunks of data was done.

We have to always question everything.


"WHAT WOULD YOU SUGGEST?"
  • I suspect there's more to it, like nutrition and stress.
  • Let's not forget our "outliers", like people with cases within their families. They can take care of themselves but in the end, genetics play a big role.
  • I'd ask for informations such as:
    • Employment Rate;
    • Average Income (Personal and family);
    • Human Development Index;
    • Family History (Simple yes or no is fine);
    • I could consider gender (I'd guess there's a difference there);
    • If possible some nutrition information based on country or state, if people from a particular area tend to consume a lot of trans fat or similar;
    • I would consider age;
    • Where the person was born;
    • What type of work that person has, how heavy it is;
    • Person's weight / BMI (Body Mass Index);
    • If the person goes to the gym or exercise.

The list goes on! There are so many things that could in fact contribute for heart attacks, that it kinda feels silly to affirm that the weather in itself plays a major role in it.


7. ENDING & FINAL REMARKS

Return to Index

Now that's the end.

THANK YOU for sticking around and I hope this project was useful to you.
Could I plug all the available datasets of heart diseases on a larger scale? Yes.
Is there room for improvement? Yes.

But I guess that already does the trick for an GCP portfolio project.

Hope talking to you soon!

CONTACT INFO:

Helio Ribeiro
helioribeiropro@gmail.com
+55 (11) 9 3932-8049