USDA food database (JSON)

Case study based on a USDA food database wrangled into into Json format by Ashley Williams.

JSON Data

JSON has become one of the standards for sending http requests between web browsers and other applications. JSON is a very nearly valid Python code using basic types like dictionaries, arrays, strings, numbers, and booleans. So far we used json.loads() and json.dumps() to read and write JSON files. JSON data can be converted to a dataframe using the built-in pandas pd.read_json method

import pandas as pd
df = pd.read_json('data/food.json')
df.head()

description

group

nutrients

portions

tags

Cheese, caraway

Dairy and Egg Products

1008

[{'value': 25.18, 'units': 'g', 'group': 'Comp...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]

[]

Cheese, cheddar

Dairy and Egg Products

1009

[{'value': 24.9, 'units': 'g', 'group': 'Compo...

[{'unit': 'cup, diced', 'amount': 1, 'grams': ...

[]

Cheese, edam

Dairy and Egg Products

1018

[{'value': 4.22, 'units': 'g', 'group': 'Other...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}, ...

[]

Cheese, feta

Dairy and Egg Products

1019

[{'value': 5.2, 'units': 'g', 'group': 'Other'...

[{'unit': 'cup, crumbled', 'amount': 1, 'grams...

[]

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

[{'value': 3.27, 'units': 'g', 'group': 'Other...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]

[]

The disadvantage of pd.read_json is that we create columns with dictionaries. Better is to select the nutrients information from the column and put it in a different dataframe along with the id number, so that we can combine the two afterwards. Below you see how the first row and column cell is selected and put in a dataframe.

nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients.head(10)

description

group

units

value

Protein

Composition

25.18

Total lipid (fat)

Composition

29.20

Carbohydrate, by difference

Composition

3.06

Ash

Other

3.28

Energy

kcal

376.00

Water

Composition

39.28

Energy

1573.00

Fiber, total dietary

Composition

0.00

Calcium, Ca

Elements

673.00

Iron, Fe

Elements

0.64

To do this for the entire dataframe we need to loop through the dataframe and put each nutrients cell in a dataframe, add the id column for identification purposes. If we first create a list of all these dataframes and then concat them we have one single dataframe for all the nutrients cells in the original dataframe.

nutrients = []
for rec in db:
    fnuts = pd.DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)

nutrients[150:250]

description

group

units

value

150

Arginine

Amino Acids

0.952

1008

151

Histidine

Amino Acids

0.884

1008

152

Alanine

Amino Acids

0.711

1008

153

Aspartic acid

Amino Acids

1.618

1008

154

Glutamic acid

Amino Acids

6.160

1008

155

Glycine

Amino Acids

0.439

1008

156

Proline

Amino Acids

2.838

1008

157

Serine

Amino Acids

1.472

1008

158

Cholesterol

Other

93.000

1008

159

Fatty acids, total saturated

Other

18.584

1008

160

Fatty acids, total monounsaturated

Other

8.275

1008

161

Fatty acids, total polyunsaturated

Other

0.830

1008

162

Protein

Composition

24.900

1009

163

Total lipid (fat)

Composition

33.140

1009

164

Carbohydrate, by difference

Composition

1.280

1009

165

Ash

Other

3.930

1009

166

Energy

kcal

403.000

1009

167

Sucrose

Sugars

0.240

1009

168

Lactose

Sugars

0.230

1009

169

Maltose

Sugars

0.150

1009

170

Alcohol, ethyl

Other

0.000

1009

171

Water

Composition

36.750

1009

172

Caffeine

Other

0.000

1009

173

Theobromine

Other

0.000

1009

174

Energy

1684.000

1009

175

Sugars, total

Composition

0.520

1009

176

Fiber, total dietary

Composition

0.000

1009

177

Calcium, Ca

Elements

721.000

1009

178

Iron, Fe

Elements

0.680

1009

179

Magnesium, Mg

Elements

28.000

1009

...

220

Fatty acids, total monounsaturated

Other

9.391

1009

221

Fatty acids, total polyunsaturated

Other

0.942

1009

222

Dihydrophylloquinone

Vitamins

mcg

0.000

1009

223

Tryptophan

Amino Acids

0.320

1009

224

Threonine

Amino Acids

0.886

1009

225

Isoleucine

Amino Acids

1.546

1009

226

Leucine

Amino Acids

2.385

1009

227

Lysine

Amino Acids

2.072

1009

228

Methionine

Amino Acids

0.652

1009

229

Cystine

Amino Acids

0.125

1009

230

Phenylalanine

Amino Acids

1.311

1009

231

Tyrosine

Amino Acids

1.202

1009

232

Valine

Amino Acids

1.663

1009

233

Arginine

Amino Acids

0.941

1009

234

Histidine

Amino Acids

0.874

1009

235

Alanine

Amino Acids

0.703

1009

236

Aspartic acid

Amino Acids

1.600

1009

237

Glutamic acid

Amino Acids

6.092

1009

238

Glycine

Amino Acids

0.429

1009

239

Proline

Amino Acids

2.806

1009

240

Serine

Amino Acids

1.456

1009

241

Protein

Composition

24.900

1009

242

Total lipid (fat)

Composition

33.140

1009

243

Carbohydrate, by difference

Composition

1.280

1009

244

Ash

Other

3.930

1009

245

Energy

kcal

403.000

1009

246

Sucrose

Sugars

0.240

1009

247

Lactose

Sugars

0.230

1009

248

Maltose

Sugars

0.150

1009

249

Alcohol, ethyl

Other

0.000

1009

100 rows × 5 columns

#check for duplicates
print(len(nutrients))
nutrients.duplicated().sum()

nutrients = nutrients.drop_duplicates()

print(len(nutrients))
nutrients.duplicated().sum()

Since we put the nutrients info in a separate dataframe we can eliminate that from the original database

columns_to_keep = ['description',
                   'group',
                   'id',
                   'manufacturer']
df = df[columns_to_keep]
df.head()

description

group

manufacturer

Cheese, caraway

Dairy and Egg Products

1008

Cheese, cheddar

Dairy and Egg Products

1009

Cheese, edam

Dairy and Egg Products

1018

Cheese, feta

Dairy and Egg Products

1019

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

Now we reduced the dimensions in the dataframes we can easily merge them. First we rename the columns that are in both dataframes

df = df.rename(columns = {'description':'food', 
                          'group':'food_group'}, copy = False)

df.head()

food

food_group

Cheese, caraway

Dairy and Egg Products

1008

Cheese, cheddar

Dairy and Egg Products

1009

Cheese, edam

Dairy and Egg Products

1018

Cheese, feta

Dairy and Egg Products

1019

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

nutrients = nutrients.rename(columns = {'description':'nutrients',
                                        'group': 'nutrient_group'})
nutrients.head()

nutrients

nutrient_group

units

value

Protein

Composition

25.18

1008

Total lipid (fat)

Composition

29.20

1008

Carbohydrate, by difference

Composition

3.06

1008

Ash

Other

3.28

1008

Energy

kcal

376.00

1008

ndata = pd.merge(nutrients, df, on='id', how='outer')
ndata.head(20)

nutrients

nutrient_group

units

value

food

food_group

manufacturer

Protein

Composition

25.180

1008

Cheese, caraway

Dairy and Egg Products

Total lipid (fat)

Composition

29.200

1008

Cheese, caraway

Dairy and Egg Products

Carbohydrate, by difference

Composition

3.060

1008

Cheese, caraway

Dairy and Egg Products

Ash

Other

3.280

1008

Cheese, caraway

Dairy and Egg Products

Energy

kcal

376.000

1008

Cheese, caraway

Dairy and Egg Products

Water

Composition

39.280

1008

Cheese, caraway

Dairy and Egg Products

Energy

1573.000

1008

Cheese, caraway

Dairy and Egg Products

Fiber, total dietary

Composition

0.000

1008

Cheese, caraway

Dairy and Egg Products

Calcium, Ca

Elements

673.000

1008

Cheese, caraway

Dairy and Egg Products

Iron, Fe

Elements

0.640

1008

Cheese, caraway

Dairy and Egg Products

Magnesium, Mg

Elements

22.000

1008

Cheese, caraway

Dairy and Egg Products

Phosphorus, P

Elements

490.000

1008

Cheese, caraway

Dairy and Egg Products

Potassium, K

Elements

93.000

1008

Cheese, caraway

Dairy and Egg Products

Sodium, Na

Elements

690.000

1008

Cheese, caraway

Dairy and Egg Products

Zinc, Zn

Elements

2.940

1008

Cheese, caraway

Dairy and Egg Products

Copper, Cu

Elements

0.024

1008

Cheese, caraway

Dairy and Egg Products

Manganese, Mn

Elements

0.021

1008

Cheese, caraway

Dairy and Egg Products

Selenium, Se

Elements

mcg

14.500

1008

Cheese, caraway

Dairy and Egg Products

Vitamin A, IU

Vitamins

1054.000

1008

Cheese, caraway

Dairy and Egg Products

Retinol

Vitamins

mcg

262.000

1008

Cheese, caraway

Dairy and Egg Products

With the merged data we can conduct any analysis we like

%matplotlib notebook
result = ndata.groupby(['nutrients', 'food_group'])['value'].quantile(0.5)
result['Total lipid (fat)'].sort_values().plot(kind='barh')

PreviousAdults who binge drink in hot towns NextBreastcancer NKI data

Last updated 2 years ago