Adults who binge drink in hot towns
a combining data study case
Combine data to investigate a hypothesis
This stuy case is based on the research question whether there is relation between town temperature and health risks. We would like to compare towns that have the most highest temperature with other towns in the Unitied States. The hypothesis is that the health risk value is higher for towns with higher temperature compared to ordinary towns. In particular the risk related to 'Adult Binge Drinking'. For this we need two datasets. The dataset of hottest towns and the dataset of USHealth_data.csv. The end goal is to run statistical comparion test. Before we can do such we first need to clean, organize and merge the data. The column 'Value' for instance contains not all float type data.
Loading BokehJS ...
Load first dataset: hotest towns
First we load the text file hottest_towns.txt
. This data we put in a dataframe. If we look with an editor in the file we see that the first two lines are not of interest, so we skip these. Furthermore there are no column labels so we add these.
place
days
0
Phoenix, Arizona
107
1
Las Vegas, Nevada
70
2
Riverside, California
24
3
Dallas, Texas
17
4
Austin, Texas
16
As we can see we need to split the City from the State. Let's do this
days
City
State
0
107
Phoenix
Arizona
1
70
Las Vegas
Nevada
2
24
Riverside
California
3
17
Dallas
Texas
4
16
Austin
Texas
Load the health data
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
It seems that we need to do some cleaning as well in the place column. States are abbreviated. Let us split the column like we did in the hottest town data set and than change the state with the map function.
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
Washington
DC
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
Fort Worth (Tarrant County)
TX
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
Washington
District of Columbia
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
Fort Worth (Tarrant County)
Texas
Inspect data
There are quite some missing values. We might want to drop the NaN's of the Values since this is our column of interest
Let us inspect the indicator values. We were interested in adult binge drinking
The question is whether the names of the cities and the states are comparable
As we can see we have to get rid of the (additional text) at some names. We can do such using regular expressions. The regular expression \s\([a-zA-Z\s\-]+\)
matches any name starting with an opening bracket (, followed by a name including spaces or '-' followed by a closing bracket )
Lastly, we clean the States with strip()
to make sure we have matching names
Now we cleaned and matched the city and states we can use this to merge the data
Merge both datasets to create a subset of hottest towns health data
We join the datasets on City and State. Drop all the rows that have NaN in the 'Value'
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
days
0
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010
Both
Black
24.3
San Antonio, TX
BRFSS (or similar) How many times during the p...
2013 BRFSS Bexar County
NaN
Data includes Bexar County, TX, not just San A...
NaN
NaN
NaN
NaN
San Antonio
Texas
8
1
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010
Both
Hispanic
25.9
San Antonio, TX
BRFSS (or similar) How many times during the p...
2013 BRFSS Bexar County
NaN
Data includes Bexar County, TX, not just San A...
NaN
NaN
NaN
NaN
San Antonio
Texas
8
Create a dataset without the hottest towns
Now we need to create the second dataset without the hottest towns. We can use a trick to do such. We can create a list of towns and select all that is not in that list.
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
568
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010.0
Both
All
10.9
Miami (Miami-Dade County), FL
BRFSS (or similar) How many times during the p...
BRFSS
NaN
NaN
NaN
NaN
NaN
NaN
Miami
Florida
569
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010.0
Both
All
13.3
Charlotte, NC
BRFSS (or similar) How many times during the p...
2010 NC BRFSS (Mecklenburg Sample)
NaN
NaN
NaN
NaN
8.7
17.9
Charlotte
North Carolina
Visual inspection
A histogram can give insight in the distribution and whether both categories are comparable. From the following plot we can see that both distributions are comparable, but hottest town is a bit to the right if you ignore the outlier of not hottest towns
Conclusion
The hot
test town Value of Binch Drinking is significantly higher than the regular towns.
Last updated