Adults who binge drink in hot towns
a combining data study case
Combine data to investigate a hypothesis
This stuy case is based on the research question whether there is relation between town temperature and health risks. We would like to compare towns that have the most highest temperature with other towns in the Unitied States. The hypothesis is that the health risk value is higher for towns with higher temperature compared to ordinary towns. In particular the risk related to 'Adult Binge Drinking'. For this we need two datasets. The dataset of hottest towns and the dataset of USHealth_data.csv. The end goal is to run statistical comparion test. Before we can do such we first need to clean, organize and merge the data. The column 'Value' for instance contains not all float type data.
from bokeh.io import output_notebook
output_notebook()
Loading BokehJS ...
Load first dataset: hotest towns
First we load the text file hottest_towns.txt
. This data we put in a dataframe. If we look with an editor in the file we see that the first two lines are not of interest, so we skip these. Furthermore there are no column labels so we add these.
import pandas as pd
df=pd.read_table('data/hottest_towns.txt', skiprows = 2,
names=['place', 'days'])
df.head()
place
days
0
Phoenix, Arizona
107
1
Las Vegas, Nevada
70
2
Riverside, California
24
3
Dallas, Texas
17
4
Austin, Texas
16
As we can see we need to split the City from the State. Let's do this
df[['City', 'State']] = df['place'].str.split(',', n=1, expand=True)
ht = df.drop('place', axis = 1) #drop column place
ht.head()
days
City
State
0
107
Phoenix
Arizona
1
70
Las Vegas
Nevada
2
24
Riverside
California
3
17
Dallas
Texas
4
16
Austin
Texas
Load the health data
First we download data from http://www.bigcitieshealth.org/obesity-physical-activity and inspect it with .head()
df = pd.read_csv('data/BCHI-dataset_2019-03-04.csv', low_memory=False)
df.head(2)
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
It seems that we need to do some cleaning as well in the place column. States are abbreviated. Let us split the column like we did in the hottest town data set and than change the state with the map function.
df[['City', 'State']] = df['Place'].str.split(',', n=1, expand=True)
df.head(2)
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
Washington
DC
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
Fort Worth (Tarrant County)
TX
states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada',
'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland',
'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois',
'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho',
'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin',
'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam',
'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas',
'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri',
'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana',
'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida',
'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware',
'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands',
'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}
df['State'] = df['State'].str.strip().map(states)
df.head(2)
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
0
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
1.7
Washington, DC
Age-Adjusted rate of opioid-related mortality ...
D.C. Department of Health, Center for Policy, ...
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
NaN
NaN
Washington
District of Columbia
1
Behavioral Health/Substance Abuse
Opioid-Related Unintentional Drug Overdose Mor...
2010
Both
All
2.2
Fort Worth (Tarrant County), TX
Age-adjusted rate of opioid-related mortality ...
National Center for Health Statistics
NaN
This indicator is not exclusive of other drugs...
NaN
NaN
1.5
3.0
Fort Worth (Tarrant County)
Texas
Inspect data
There are quite some missing values. We might want to drop the NaN's of the Values since this is our column of interest
df.isnull().sum()
Indicator Category 0
Indicator 0
Year 0
Sex 0
Race/Ethnicity 0
Value 3444
Place 0
BCHC Requested Methodology 0
Source 4964
Methods 26968
Notes 19851
90% Confidence Level - Low 31655
90% Confidence Level - High 31655
95% Confidence Level - Low 25657
95% Confidence Level - High 25616
City 0
State 2154
dtype: int64
df = df.dropna(subset = ['Value'])
df.isnull().sum()
Indicator Category 0
Indicator 0
Year 0
Sex 0
Race/Ethnicity 0
Value 0
Place 0
BCHC Requested Methodology 0
Source 4133
Methods 24226
Notes 19850
90% Confidence Level - Low 28214
90% Confidence Level - High 28214
95% Confidence Level - Low 22304
95% Confidence Level - High 22263
City 0
State 2154
dtype: int64
Let us inspect the indicator values. We were interested in adult binge drinking
set(df['Indicator'])
{'AIDS Diagnoses Rate (Per 100,000 people)',
'All Types of Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
'All-Cause Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Asthma Emergency Department Visit Rate (Age-Adjusted; Per 10,000)',
'Bike Score',
'Chlamydia Rate (Per 100,000 People)',
'Congenital Syphilis Rate (Per 100,000 Live Births)',
'Diabetes Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Female Breast Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Firearm-Related Emergency Department Visit Rate (Age-Adjusted; Per 10,000 people)',
'Firearm-Related Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Gonorrhea Rate (Per 100,000 People)',
'HIV Diagnoses Rate (Per 100,000 people)',
'HIV-Related Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Heart Disease Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Homicide Rate (Age-Adjusted; Per 100,000 people)',
'Infant Mortality Rate (Per 1,000 live births)',
'Life Expectancy at Birth (Years)',
'Lung Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Median Household Income (Dollars)',
'Motor Vehicle Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Opioid-Related Unintentional Drug Overdose Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Percent Foreign Born',
'Percent Living Below 200% Poverty Level',
'Percent Unemployed',
'Percent Who Only Speak English at Home',
'Percent Who Speak Spanish at Home',
'Percent of 3 and 4 Year Olds Currently Enrolled in Preschool',
'Percent of Adults 65 and Over Who Received Pneumonia Vaccine',
'Percent of Adults Who Are Obese',
'Percent of Adults Who Binge Drank',
'Percent of Adults Who Currently Smoke',
'Percent of Adults Who Meet CDC-Recommended Physical Activity Levels',
'Percent of Adults Who Received Seasonal Flu Shot',
'Percent of Children (Tested) Under Age 6 with Elevated Blood Lead Levels',
'Percent of Children Living in Poverty',
'Percent of Children Who Received Seasonal Flu Shot',
'Percent of High School Graduates (Over Age 18)',
'Percent of High School Students Who Are Obese',
'Percent of High School Students Who Binge Drank',
'Percent of High School Students Who Currently Smoke',
'Percent of High School Students Who Meet CDC-Recommended Physical Activity Levels',
'Percent of Households Whose Housing Costs Exceed 35% of Income',
'Percent of Low Birth Weight Babies Born',
'Percent of Mothers Under Age 20',
'Percent of Population 65 and Over',
'Percent of Population Under 18',
'Percent of Population Uninsured',
'Percent of Population with a Disability',
'Persons Living with HIV/AIDS Rate (Per 100,000 people)',
'Pneumonia and Influenza Mortality Rate (Age-Adjusted; Per 100,000 people)',
'Primary and Secondary Syphilis Rate (Per 100,000 People)',
'Race/Ethnicity (Percent)',
'Rate of Laboratory Confirmed Infections Caused by Salmonella (Per 100,000 people)',
'Rate of Laboratory Confirmed Infections Caused by Shiga Toxin-Producing E-Coli (Per 100,000 people)',
'Sex (Percent)',
'Suicide Rate (Age-Adjusted; Per 100,000 people)',
'Total Population (People)',
'Transit Score',
'Tuberculosis Incidence Rate (Per 100,000 people)',
'Walkability'}
df = df[df['Indicator'] == 'Percent of Adults Who Binge Drank']
len(df)
535
The question is whether the names of the cities and the states are comparable
print(sorted(set(df['City'])))
print("\n")
print(sorted(set(ht['City'])))
['Baltimore', 'Boston', 'Charlotte', 'Chicago',
'Columbus', 'Denver', 'Detroit', 'Fort Worth (Tarrant County)',
'Houston', 'Indianapolis (Marion County)', 'Kansas City',
'Las Vegas (Clark County)', 'Long Beach', 'Los Angeles',
'Miami (Miami-Dade County)', 'Minneapolis', 'New York City',
'Oakland (Alameda County)', 'Philadelphia', 'Phoenix',
'Portland (Multnomah County)', 'San Antonio',
'San Diego County', 'San Jose', 'Seattle', 'U.S. Total',
'Washington']
['Austin', 'Dallas', 'Houston', 'Kansas City', 'Las Vegas',
'Oklahoma City', 'Phoenix', 'Riverside', 'Sacramento',
'Salt Lake City', 'San Antonio']
df = df.dropna(subset = ['State'])
print((sorted(set(df['State']))))
print("\n")
print((sorted(set(ht['State']))))
['Arizona', 'California', 'Colorado', 'District of Columbia',
'Florida', 'Indiana', 'Maryland', 'Massachusetts', 'Michigan',
'Minnesota', 'Missouri', 'Nevada', 'New York', 'North Carolina',
'Ohio', 'Oregon', 'Pennsylvania', 'Texas', 'Washington']
[' Arizona ', ' California ', ' Missouri ', ' Nevada ',
' Oklahoma ', ' Texas ', ' Utah ']
As we can see we have to get rid of the (additional text) at some names. We can do such using regular expressions. The regular expression \s\([a-zA-Z\s\-]+\)
matches any name starting with an opening bracket (, followed by a name including spaces or '-' followed by a closing bracket )
df['City']= df['City'].str.replace(r"\s\([a-zA-Z\s\-]+\)", "")
print((sorted(set(df['City']))))
['Baltimore', 'Boston', 'Charlotte', 'Columbus', 'Denver', 'Detroit', 'Fort Worth', 'Houston', 'Indianapolis', 'Kansas City', 'Las Vegas', 'Long Beach', 'Los Angeles', 'Miami', 'Minneapolis', 'New York City', 'Oakland', 'Philadelphia', 'Phoenix', 'Portland', 'San Antonio', 'San Diego County', 'San Jose', 'Seattle', 'Washington']
Lastly, we clean the States with strip()
to make sure we have matching names
df['State'] = df['State'].str.strip()
ht['State'] = ht['State'].str.strip()
Now we cleaned and matched the city and states we can use this to merge the data
Merge both datasets to create a subset of hottest towns health data
We join the datasets on City and State. Drop all the rows that have NaN in the 'Value'
hh = pd.merge(df, ht, how='inner', left_on=['City', 'State'],
right_on=['City', 'State'])
hh.head(2)
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
days
0
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010
Both
Black
24.3
San Antonio, TX
BRFSS (or similar) How many times during the p...
2013 BRFSS Bexar County
NaN
Data includes Bexar County, TX, not just San A...
NaN
NaN
NaN
NaN
San Antonio
Texas
8
1
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010
Both
Hispanic
25.9
San Antonio, TX
BRFSS (or similar) How many times during the p...
2013 BRFSS Bexar County
NaN
Data includes Bexar County, TX, not just San A...
NaN
NaN
NaN
NaN
San Antonio
Texas
8
Create a dataset without the hottest towns
Now we need to create the second dataset without the hottest towns. We can use a trick to do such. We can create a list of towns and select all that is not in that list.
x = ht['City'].tolist()
nh = df.where(~df['City'].isin(x))
nh.head(2)
Indicator Category
Indicator
Year
Sex
Race/Ethnicity
Value
Place
BCHC Requested Methodology
Source
Methods
Notes
90% Confidence Level - Low
90% Confidence Level - High
95% Confidence Level - Low
95% Confidence Level - High
City
State
568
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010.0
Both
All
10.9
Miami (Miami-Dade County), FL
BRFSS (or similar) How many times during the p...
BRFSS
NaN
NaN
NaN
NaN
NaN
NaN
Miami
Florida
569
Behavioral Health/Substance Abuse
Percent of Adults Who Binge Drank
2010.0
Both
All
13.3
Charlotte, NC
BRFSS (or similar) How many times during the p...
2010 NC BRFSS (Mecklenburg Sample)
NaN
NaN
NaN
NaN
8.7
17.9
Charlotte
North Carolina
y_hot = nh['Value'].dropna()
y_nhot = hh['Value'].dropna()
y_hot.describe()
count 392.000000
mean 19.822194
std 7.395218
min 1.700000
25% 14.600000
50% 18.700000
75% 24.000000
max 54.000000
Name: Value, dtype: float64
y_nhot.describe()
count 103.000000
mean 18.218447
std 11.350605
min 6.300000
25% 12.800000
50% 15.300000
75% 21.150000
max 87.500000
Name: Value, dtype: float64
Visual inspection
A histogram can give insight in the distribution and whether both categories are comparable. From the following plot we can see that both distributions are comparable, but hottest town is a bit to the right if you ignore the outlier of not hottest towns
import numpy as np
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show
def make_plot(data, color, label):
measured = data
hist, edges = np.histogram(measured, density=True, bins=50)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
fill_color=color, line_color="white", alpha=0.5, legend_label=label)
p.y_range.start = 0
p.legend.location = "center_right"
p.legend.background_fill_color = "#fefefe"
p.xaxis.axis_label = 'Percent of Adults Who Binge Drank'
p.yaxis.axis_label = 'Probability'
p.grid.grid_line_color="white"
return p
# Normal Distribution
p = figure(title="Distribution", tools='', background_fill_color="#fafafa")
p = make_plot(y_hot, 'red', 'hottest towns')
p = make_plot(y_nhot, 'green', 'normal towns')
show(p)

from scipy.stats import ttest_ind
ttest_ind(np.array(y_hot), np.array(y_nhot), equal_var = False)
Ttest_indResult(statistic=1.3601091999238046, pvalue=0.17623132973147082)
from scipy.stats import kruskal
kruskal(np.array(y_hot), np.array(y_nhot))
KruskalResult(statistic=14.72621798981739, pvalue=0.0001243056464991998)
Conclusion
The hot
test town Value of Binch Drinking is significantly higher than the regular towns.
Last updated
Was this helpful?