Combine data to investigate a hypothesis

This stuy case is based on the research question whether there is relation between town temperature and health risks. We would like to compare towns that have the most highest temperature with other towns in the Unitied States. The hypothesis is that the health risk value is higher for towns with higher temperature compared to ordinary towns. In particular the risk related to 'Adult Binch Drinking'. For this we need two datasets. The dataset of hottest towns and the dataset of USHealth_data.csv. The end goal is to run statistical comparion test. Before we can do such we first need to clean, organize and merge the data. The column 'Value' for instance contains not all float type data.

from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

Load first dataset: hotest towns

First we load the text file hottest_towns.txt. This data we put in a dataframe. If we look with an editor in the file we see that the first two lines are not of interest, so we skip these. Furthermore there are no column labels so we add these.

import pandas as pd
df=pd.read_table('data/hottest_towns.txt', skiprows = 2, names=['place', 'days'])
df.head()

place

days

0

Phoenix, Arizona

107

1

Las Vegas, Nevada

70

2

Riverside, California

24

3

Dallas, Texas

17

4

Austin, Texas

16

As we can see we need to split the City from the State. Let's do this

df[['City', 'State']] = df['place'].str.split(',', n=1, expand=True)
ht = df.drop('place', axis = 1) #drop column place
ht.head()

days

City

State

0

107

Phoenix

Arizona

1

70

Las Vegas

Nevada

2

24

Riverside

California

3

17

Dallas

Texas

4

16

Austin

Texas

Load the health data

First we download data from http://www.bigcitieshealth.org/obesity-physical-activity and inspect it with .head()

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

0

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

NaN

NaN

1

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

1.5

3.0

It seems that we need to do some cleaning as well in the place column. States are abbreviated. Let us split the column like we did in the hottest town data set and than change the state with the map function.

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

0

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

NaN

NaN

Washington

DC

1

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

1.5

3.0

Fort Worth (Tarrant County)

TX

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

0

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

NaN

NaN

Washington

District of Columbia

1

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

NaN

1.5

3.0

Fort Worth (Tarrant County)

Texas

Inspect data

There are quit some missing values. We might want to drop the NaN's of the Values since this is our column of interest

Let us inspect the indicator values. We were interested in adult binch drinking

The question is wether the names of the cities and the states are comparable

As we can see we have to get rid of the (additional text) at some names. We can do such using regular expressions. The regular expression \s\([a-zA-Z\s\-]+\) matches any name starting with opening bracket (, followed by a name including spaces or '-' followed by a closing bracket )

Lastly we clean the States with strip() to make sure we have matching names

Now we cleaned and matched the city and states we can use this to merge the data

Merge both datasets to create a subset hottest towns health data

We join the datasets on City and State. Drop all the rows from that have NaN in the 'Value'

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

days

0

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010

Both

Black

24.3

San Antonio, TX

BRFSS (or similar) How many times during the p...

2013 BRFSS Bexar County

NaN

Data includes Bexar County, TX, not just San A...

NaN

NaN

NaN

NaN

San Antonio

Texas

8

1

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010

Both

Hispanic

25.9

San Antonio, TX

BRFSS (or similar) How many times during the p...

2013 BRFSS Bexar County

NaN

Data includes Bexar County, TX, not just San A...

NaN

NaN

NaN

NaN

San Antonio

Texas

8

Create a dataset without the hottest towns

Now we need to create the second dataset without the hottest towns. We can use a trick to do such. We can create a list of towns and select all that is not in that list.

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

568

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010.0

Both

All

10.9

Miami (Miami-Dade County), FL

BRFSS (or similar) How many times during the p...

BRFSS

NaN

NaN

NaN

NaN

NaN

NaN

Miami

Florida

569

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010.0

Both

All

13.3

Charlotte, NC

BRFSS (or similar) How many times during the p...

2010 NC BRFSS (Mecklenburg Sample)

NaN

NaN

NaN

NaN

8.7

17.9

Charlotte

North Carolina

Visual inspection

A histogram can give insight in the distribution and whether both categories are comparable. From the following plot we can see that both distributions are comparable, but hottest town is a bit to the right if you ignore the outlier of not hottest towns

We still see a few outliers. We might want to delete them. Of course you first need to investigate wether you can disgard them. For demonstration purpose we drop the outliers.

Statistical tests

If we assume normality we can run the t-test to compare significant differences. If we do not we should use a non parametic test. From the plot we can see that our data is not normal.

Conclusion

The hotest town Value of Binch Drinking is significant higher than the regular towns.

Last updated

Was this helpful?