Adults who binge drink in hot towns

a combining data study case

Combine data to investigate a hypothesis

This stuy case is based on the research question whether there is relation between town temperature and health risks. We would like to compare towns that have the most highest temperature with other towns in the Unitied States. The hypothesis is that the health risk value is higher for towns with higher temperature compared to ordinary towns. In particular the risk related to 'Adult Binge Drinking'. For this we need two datasets. The dataset of hottest towns and the dataset of USHealth_data.csv. The end goal is to run statistical comparion test. Before we can do such we first need to clean, organize and merge the data. The column 'Value' for instance contains not all float type data.

from bokeh.io import output_notebook
output_notebook()

Loading BokehJS ...

Load first dataset: hotest towns

First we load the text file hottest_towns.txt. This data we put in a dataframe. If we look with an editor in the file we see that the first two lines are not of interest, so we skip these. Furthermore there are no column labels so we add these.

import pandas as pd
df=pd.read_table('data/hottest_towns.txt', skiprows = 2, 
                 names=['place', 'days'])
df.head()

place

days

Phoenix, Arizona

107

Las Vegas, Nevada

Riverside, California

Dallas, Texas

Austin, Texas

As we can see we need to split the City from the State. Let's do this

df[['City', 'State']] = df['place'].str.split(',', n=1, expand=True)
ht = df.drop('place', axis = 1) #drop column place
ht.head()

days

City

State

107

Phoenix

Arizona

Las Vegas

Nevada

Riverside

California

Dallas

Texas

Austin

Texas

Load the health data

First we download data from http://www.bigcitieshealth.org/obesity-physical-activity and inspect it with .head()

df = pd.read_csv('data/BCHI-dataset_2019-03-04.csv', low_memory=False)
df.head(2)

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

1.5

3.0

It seems that we need to do some cleaning as well in the place column. States are abbreviated. Let us split the column like we did in the hottest town data set and than change the state with the map function.

df[['City', 'State']] = df['Place'].str.split(',', n=1, expand=True)

df.head(2)

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

Washington

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

1.5

3.0

Fort Worth (Tarrant County)

states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 
        'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 
        'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 
        'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 
        'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 
        'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 
        'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 
        'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 
        'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 
        'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 
        'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 
        'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands',
        'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

df['State'] = df['State'].str.strip().map(states)

df.head(2)

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

1.7

Washington, DC

Age-Adjusted rate of opioid-related mortality ...

D.C. Department of Health, Center for Policy, ...

NaN

This indicator is not exclusive of other drugs...

NaN

Washington

District of Columbia

Behavioral Health/Substance Abuse

Opioid-Related Unintentional Drug Overdose Mor...

2010

Both

All

2.2

Fort Worth (Tarrant County), TX

Age-adjusted rate of opioid-related mortality ...

National Center for Health Statistics

NaN

This indicator is not exclusive of other drugs...

NaN

1.5

3.0

Fort Worth (Tarrant County)

Texas

Inspect data

There are quite some missing values. We might want to drop the NaN's of the Values since this is our column of interest

df.isnull().sum()

Indicator Category                 0
Indicator                          0
Year                               0
Sex                                0
Race/Ethnicity                     0
Value                           3444
Place                              0
BCHC Requested Methodology         0
Source                          4964
Methods                        26968
Notes                          19851
90% Confidence Level - Low     31655
90% Confidence Level - High    31655
95% Confidence Level - Low     25657
95% Confidence Level - High    25616
City                               0
State                           2154
dtype: int64

df = df.dropna(subset = ['Value'])
df.isnull().sum()

Indicator Category                 0
Indicator                          0
Year                               0
Sex                                0
Race/Ethnicity                     0
Value                              0
Place                              0
BCHC Requested Methodology         0
Source                          4133
Methods                        24226
Notes                          19850
90% Confidence Level - Low     28214
90% Confidence Level - High    28214
95% Confidence Level - Low     22304
95% Confidence Level - High    22263
City                               0
State                           2154
dtype: int64

Let us inspect the indicator values. We were interested in adult binge drinking

set(df['Indicator'])

{'AIDS Diagnoses Rate (Per 100,000 people)',
 'All Types of Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'All-Cause Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Asthma Emergency Department Visit Rate (Age-Adjusted; Per 10,000)',
 'Bike Score',
 'Chlamydia Rate (Per 100,000 People)',
 'Congenital Syphilis Rate (Per 100,000 Live Births)',
 'Diabetes Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Female Breast Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Firearm-Related Emergency Department Visit Rate (Age-Adjusted; Per 10,000 people)',
 'Firearm-Related Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Gonorrhea Rate (Per 100,000 People)',
 'HIV Diagnoses Rate (Per 100,000 people)',
 'HIV-Related Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Heart Disease Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Homicide Rate (Age-Adjusted; Per 100,000 people)',
 'Infant Mortality Rate (Per 1,000 live births)',
 'Life Expectancy at Birth (Years)',
 'Lung Cancer Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Median Household Income (Dollars)',
 'Motor Vehicle Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Opioid-Related Unintentional Drug Overdose Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Percent Foreign Born',
 'Percent Living Below 200% Poverty Level',
 'Percent Unemployed',
 'Percent Who Only Speak English at Home',
 'Percent Who Speak Spanish at Home',
 'Percent of 3 and 4 Year Olds Currently Enrolled in Preschool',
 'Percent of Adults 65 and Over Who Received Pneumonia Vaccine',
 'Percent of Adults Who Are Obese',
 'Percent of Adults Who Binge Drank',
 'Percent of Adults Who Currently Smoke',
 'Percent of Adults Who Meet CDC-Recommended Physical Activity Levels',
 'Percent of Adults Who Received Seasonal Flu Shot',
 'Percent of Children (Tested) Under Age 6 with Elevated Blood Lead Levels',
 'Percent of Children Living in Poverty',
 'Percent of Children Who Received Seasonal Flu Shot',
 'Percent of High School Graduates (Over Age 18)',
 'Percent of High School Students Who Are Obese',
 'Percent of High School Students Who Binge Drank',
 'Percent of High School Students Who Currently Smoke',
 'Percent of High School Students Who Meet CDC-Recommended Physical Activity Levels',
 'Percent of Households Whose Housing Costs Exceed 35% of Income',
 'Percent of Low Birth Weight Babies Born',
 'Percent of Mothers Under Age 20',
 'Percent of Population 65 and Over',
 'Percent of Population Under 18',
 'Percent of Population Uninsured',
 'Percent of Population with a Disability',
 'Persons Living with HIV/AIDS Rate (Per 100,000 people)',
 'Pneumonia and Influenza Mortality Rate (Age-Adjusted; Per 100,000 people)',
 'Primary and Secondary Syphilis Rate (Per 100,000 People)',
 'Race/Ethnicity (Percent)',
 'Rate of Laboratory Confirmed Infections Caused by Salmonella (Per 100,000 people)',
 'Rate of Laboratory Confirmed Infections Caused by Shiga Toxin-Producing E-Coli (Per 100,000 people)',
 'Sex (Percent)',
 'Suicide Rate (Age-Adjusted; Per 100,000 people)',
 'Total Population (People)',
 'Transit Score',
 'Tuberculosis Incidence Rate (Per 100,000 people)',
 'Walkability'}

df = df[df['Indicator'] == 'Percent of Adults Who Binge Drank']
len(df)

The question is whether the names of the cities and the states are comparable

print(sorted(set(df['City'])))
print("\n")
print(sorted(set(ht['City'])))

['Baltimore', 'Boston', 'Charlotte', 'Chicago', 
'Columbus', 'Denver', 'Detroit', 'Fort Worth (Tarrant County)', 
'Houston', 'Indianapolis (Marion County)', 'Kansas City', 
'Las Vegas (Clark County)', 'Long Beach', 'Los Angeles', 
'Miami (Miami-Dade County)', 'Minneapolis', 'New York City', 
'Oakland (Alameda County)', 'Philadelphia', 'Phoenix', 
'Portland (Multnomah County)', 'San Antonio', 
'San Diego County', 'San Jose', 'Seattle', 'U.S. Total', 
'Washington']


['Austin', 'Dallas', 'Houston', 'Kansas City', 'Las Vegas', 
'Oklahoma City', 'Phoenix', 'Riverside', 'Sacramento', 
'Salt Lake City', 'San Antonio']

df = df.dropna(subset = ['State'])
print((sorted(set(df['State']))))
print("\n")
print((sorted(set(ht['State']))))

['Arizona', 'California', 'Colorado', 'District of Columbia', 
'Florida', 'Indiana', 'Maryland', 'Massachusetts', 'Michigan', 
'Minnesota', 'Missouri', 'Nevada', 'New York', 'North Carolina', 
'Ohio', 'Oregon', 'Pennsylvania', 'Texas', 'Washington']


[' Arizona ', ' California ', ' Missouri ', ' Nevada ', 
' Oklahoma ', ' Texas ', ' Utah ']

As we can see we have to get rid of the (additional text) at some names. We can do such using regular expressions. The regular expression \s\([a-zA-Z\s\-]+\) matches any name starting with an opening bracket (, followed by a name including spaces or '-' followed by a closing bracket )

df['City']= df['City'].str.replace(r"\s\([a-zA-Z\s\-]+\)", "")
print((sorted(set(df['City']))))

['Baltimore', 'Boston', 'Charlotte', 'Columbus', 'Denver', 'Detroit', 'Fort Worth', 'Houston', 'Indianapolis', 'Kansas City', 'Las Vegas', 'Long Beach', 'Los Angeles', 'Miami', 'Minneapolis', 'New York City', 'Oakland', 'Philadelphia', 'Phoenix', 'Portland', 'San Antonio', 'San Diego County', 'San Jose', 'Seattle', 'Washington']

Lastly, we clean the States with strip() to make sure we have matching names

df['State'] = df['State'].str.strip()
ht['State'] = ht['State'].str.strip()

Now we cleaned and matched the city and states we can use this to merge the data

Merge both datasets to create a subset of hottest towns health data

We join the datasets on City and State. Drop all the rows that have NaN in the 'Value'

hh = pd.merge(df, ht, how='inner', left_on=['City', 'State'],
              right_on=['City', 'State'])
hh.head(2)

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

days

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010

Both

Black

24.3

San Antonio, TX

BRFSS (or similar) How many times during the p...

2013 BRFSS Bexar County

NaN

Data includes Bexar County, TX, not just San A...

NaN

San Antonio

Texas

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010

Both

Hispanic

25.9

San Antonio, TX

BRFSS (or similar) How many times during the p...

2013 BRFSS Bexar County

NaN

Data includes Bexar County, TX, not just San A...

NaN

San Antonio

Texas

Create a dataset without the hottest towns

Now we need to create the second dataset without the hottest towns. We can use a trick to do such. We can create a list of towns and select all that is not in that list.

x = ht['City'].tolist()
nh =  df.where(~df['City'].isin(x))
nh.head(2)

Indicator Category

Indicator

Year

Sex

Race/Ethnicity

Value

Place

BCHC Requested Methodology

Source

Methods

Notes

90% Confidence Level - Low

90% Confidence Level - High

95% Confidence Level - Low

95% Confidence Level - High

City

State

568

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010.0

Both

All

10.9

Miami (Miami-Dade County), FL

BRFSS (or similar) How many times during the p...

BRFSS

NaN

Miami

Florida

569

Behavioral Health/Substance Abuse

Percent of Adults Who Binge Drank

2010.0

Both

All

13.3

Charlotte, NC

BRFSS (or similar) How many times during the p...

2010 NC BRFSS (Mecklenburg Sample)

NaN

8.7

17.9

Charlotte

North Carolina

y_hot = nh['Value'].dropna()
y_nhot = hh['Value'].dropna()

y_hot.describe()

count    392.000000
mean      19.822194
std        7.395218
min        1.700000
25%       14.600000
50%       18.700000
75%       24.000000
max       54.000000
Name: Value, dtype: float64

y_nhot.describe()

count    103.000000
mean      18.218447
std       11.350605
min        6.300000
25%       12.800000
50%       15.300000
75%       21.150000
max       87.500000
Name: Value, dtype: float64

Visual inspection

A histogram can give insight in the distribution and whether both categories are comparable. From the following plot we can see that both distributions are comparable, but hottest town is a bit to the right if you ignore the outlier of not hottest towns

import numpy as np

from bokeh.layouts import gridplot
from bokeh.plotting import figure, show

def make_plot(data, color, label):
    measured = data
    hist, edges = np.histogram(measured, density=True, bins=50)
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color=color, line_color="white", alpha=0.5, legend_label=label)

    p.y_range.start = 0
    p.legend.location = "center_right"
    p.legend.background_fill_color = "#fefefe"
    p.xaxis.axis_label = 'Percent of Adults Who Binge Drank'
    p.yaxis.axis_label = 'Probability'
    p.grid.grid_line_color="white"
    return p

# Normal Distribution
p = figure(title="Distribution", tools='', background_fill_color="#fafafa")
p = make_plot(y_hot, 'red', 'hottest towns')
p = make_plot(y_nhot, 'green', 'normal towns')

show(p)

from scipy.stats import ttest_ind 
ttest_ind(np.array(y_hot), np.array(y_nhot), equal_var = False)

Ttest_indResult(statistic=1.3601091999238046, pvalue=0.17623132973147082)

from scipy.stats import kruskal

kruskal(np.array(y_hot), np.array(y_nhot))

KruskalResult(statistic=14.72621798981739, pvalue=0.0001243056464991998)

Conclusion

The hottest town Value of Binch Drinking is significantly higher than the regular towns.

PreviousHeart failure Case study NextUSDA food database (JSON)

Last updated 4 years ago

Was this helpful?