USDA food database (JSON)

Case study based on a USDA food database wrangled into into Json format by Ashley Williams.

JSON Data

JSON has become one of the standards for sending http requests between web browsers and other applications. JSON is a very nearly valid Python code using basic types like dictionaries, arrays, strings, numbers, and booleans. So far we used json.loads() and json.dumps() to read and write JSON files. JSON data can be converted to a dataframe using the built-in pandas pd.read_json method

import pandas as pd
df = pd.read_json('data/food.json')
df.head()

description

group

id

nutrients

portions

tags

0

Cheese, caraway

Dairy and Egg Products

1008

[{'value': 25.18, 'units': 'g', 'group': 'Comp...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]

[]

1

Cheese, cheddar

Dairy and Egg Products

1009

[{'value': 24.9, 'units': 'g', 'group': 'Compo...

[{'unit': 'cup, diced', 'amount': 1, 'grams': ...

[]

2

Cheese, edam

Dairy and Egg Products

1018

[{'value': 4.22, 'units': 'g', 'group': 'Other...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}, ...

[]

3

Cheese, feta

Dairy and Egg Products

1019

[{'value': 5.2, 'units': 'g', 'group': 'Other'...

[{'unit': 'cup, crumbled', 'amount': 1, 'grams...

[]

4

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

[{'value': 3.27, 'units': 'g', 'group': 'Other...

[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]

[]

The disadvantage of pd.read_json is that we create columns with dictionaries. Better is to select the nutrients information from the column and put it in a different dataframe along with the id number, so that we can combine the two afterwards. Below you see how the first row and column cell is selected and put in a dataframe.

nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients.head(10)

description

group

units

value

0

Protein

Composition

g

25.18

1

Total lipid (fat)

Composition

g

29.20

2

Carbohydrate, by difference

Composition

g

3.06

3

Ash

Other

g

3.28

4

Energy

Energy

kcal

376.00

5

Water

Composition

g

39.28

6

Energy

Energy

kJ

1573.00

7

Fiber, total dietary

Composition

g

0.00

8

Calcium, Ca

Elements

mg

673.00

9

Iron, Fe

Elements

mg

0.64

To do this for the entire dataframe we need to loop through the dataframe and put each nutrients cell in a dataframe, add the id column for identification purposes. If we first create a list of all these dataframes and then concat them we have one single dataframe for all the nutrients cells in the original dataframe.

nutrients = []
for rec in db:
    fnuts = pd.DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)
nutrients[150:250]

description

group

units

value

id

150

Arginine

Amino Acids

g

0.952

1008

151

Histidine

Amino Acids

g

0.884

1008

152

Alanine

Amino Acids

g

0.711

1008

153

Aspartic acid

Amino Acids

g

1.618

1008

154

Glutamic acid

Amino Acids

g

6.160

1008

155

Glycine

Amino Acids

g

0.439

1008

156

Proline

Amino Acids

g

2.838

1008

157

Serine

Amino Acids

g

1.472

1008

158

Cholesterol

Other

mg

93.000

1008

159

Fatty acids, total saturated

Other

g

18.584

1008

160

Fatty acids, total monounsaturated

Other

g

8.275

1008

161

Fatty acids, total polyunsaturated

Other

g

0.830

1008

162

Protein

Composition

g

24.900

1009

163

Total lipid (fat)

Composition

g

33.140

1009

164

Carbohydrate, by difference

Composition

g

1.280

1009

165

Ash

Other

g

3.930

1009

166

Energy

Energy

kcal

403.000

1009

167

Sucrose

Sugars

g

0.240

1009

168

Lactose

Sugars

g

0.230

1009

169

Maltose

Sugars

g

0.150

1009

170

Alcohol, ethyl

Other

g

0.000

1009

171

Water

Composition

g

36.750

1009

172

Caffeine

Other

mg

0.000

1009

173

Theobromine

Other

mg

0.000

1009

174

Energy

Energy

kJ

1684.000

1009

175

Sugars, total

Composition

g

0.520

1009

176

Fiber, total dietary

Composition

g

0.000

1009

177

Calcium, Ca

Elements

mg

721.000

1009

178

Iron, Fe

Elements

mg

0.680

1009

179

Magnesium, Mg

Elements

mg

28.000

1009

...

...

...

...

...

...

220

Fatty acids, total monounsaturated

Other

g

9.391

1009

221

Fatty acids, total polyunsaturated

Other

g

0.942

1009

222

Dihydrophylloquinone

Vitamins

mcg

0.000

1009

223

Tryptophan

Amino Acids

g

0.320

1009

224

Threonine

Amino Acids

g

0.886

1009

225

Isoleucine

Amino Acids

g

1.546

1009

226

Leucine

Amino Acids

g

2.385

1009

227

Lysine

Amino Acids

g

2.072

1009

228

Methionine

Amino Acids

g

0.652

1009

229

Cystine

Amino Acids

g

0.125

1009

230

Phenylalanine

Amino Acids

g

1.311

1009

231

Tyrosine

Amino Acids

g

1.202

1009

232

Valine

Amino Acids

g

1.663

1009

233

Arginine

Amino Acids

g

0.941

1009

234

Histidine

Amino Acids

g

0.874

1009

235

Alanine

Amino Acids

g

0.703

1009

236

Aspartic acid

Amino Acids

g

1.600

1009

237

Glutamic acid

Amino Acids

g

6.092

1009

238

Glycine

Amino Acids

g

0.429

1009

239

Proline

Amino Acids

g

2.806

1009

240

Serine

Amino Acids

g

1.456

1009

241

Protein

Composition

g

24.900

1009

242

Total lipid (fat)

Composition

g

33.140

1009

243

Carbohydrate, by difference

Composition

g

1.280

1009

244

Ash

Other

g

3.930

1009

245

Energy

Energy

kcal

403.000

1009

246

Sucrose

Sugars

g

0.240

1009

247

Lactose

Sugars

g

0.230

1009

248

Maltose

Sugars

g

0.150

1009

249

Alcohol, ethyl

Other

g

0.000

1009

100 rows × 5 columns

#check for duplicates
print(len(nutrients))
nutrients.duplicated().sum()
389355





14179
nutrients = nutrients.drop_duplicates()
print(len(nutrients))
nutrients.duplicated().sum()
375176





0

Since we put the nutrients info in a separate dataframe we can eliminate that from the original database

columns_to_keep = ['description',
                   'group',
                   'id',
                   'manufacturer']
df = df[columns_to_keep]
df.head()

description

group

id

manufacturer

0

Cheese, caraway

Dairy and Egg Products

1008

1

Cheese, cheddar

Dairy and Egg Products

1009

2

Cheese, edam

Dairy and Egg Products

1018

3

Cheese, feta

Dairy and Egg Products

1019

4

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

Now we reduced the dimensions in the dataframes we can easily merge them. First we rename the columns that are in both dataframes

df = df.rename(columns = {'description':'food', 
                          'group':'food_group'}, copy = False)
df.head()

food

food_group

id

0

Cheese, caraway

Dairy and Egg Products

1008

1

Cheese, cheddar

Dairy and Egg Products

1009

2

Cheese, edam

Dairy and Egg Products

1018

3

Cheese, feta

Dairy and Egg Products

1019

4

Cheese, mozzarella, part skim milk

Dairy and Egg Products

1028

nutrients = nutrients.rename(columns = {'description':'nutrients',
                                        'group': 'nutrient_group'})
nutrients.head()

nutrients

nutrient_group

units

value

id

0

Protein

Composition

g

25.18

1008

1

Total lipid (fat)

Composition

g

29.20

1008

2

Carbohydrate, by difference

Composition

g

3.06

1008

3

Ash

Other

g

3.28

1008

4

Energy

Energy

kcal

376.00

1008

ndata = pd.merge(nutrients, df, on='id', how='outer')
ndata.head(20)

nutrients

nutrient_group

units

value

id

food

food_group

manufacturer

0

Protein

Composition

g

25.180

1008

Cheese, caraway

Dairy and Egg Products

1

Total lipid (fat)

Composition

g

29.200

1008

Cheese, caraway

Dairy and Egg Products

2

Carbohydrate, by difference

Composition

g

3.060

1008

Cheese, caraway

Dairy and Egg Products

3

Ash

Other

g

3.280

1008

Cheese, caraway

Dairy and Egg Products

4

Energy

Energy

kcal

376.000

1008

Cheese, caraway

Dairy and Egg Products

5

Water

Composition

g

39.280

1008

Cheese, caraway

Dairy and Egg Products

6

Energy

Energy

kJ

1573.000

1008

Cheese, caraway

Dairy and Egg Products

7

Fiber, total dietary

Composition

g

0.000

1008

Cheese, caraway

Dairy and Egg Products

8

Calcium, Ca

Elements

mg

673.000

1008

Cheese, caraway

Dairy and Egg Products

9

Iron, Fe

Elements

mg

0.640

1008

Cheese, caraway

Dairy and Egg Products

10

Magnesium, Mg

Elements

mg

22.000

1008

Cheese, caraway

Dairy and Egg Products

11

Phosphorus, P

Elements

mg

490.000

1008

Cheese, caraway

Dairy and Egg Products

12

Potassium, K

Elements

mg

93.000

1008

Cheese, caraway

Dairy and Egg Products

13

Sodium, Na

Elements

mg

690.000

1008

Cheese, caraway

Dairy and Egg Products

14

Zinc, Zn

Elements

mg

2.940

1008

Cheese, caraway

Dairy and Egg Products

15

Copper, Cu

Elements

mg

0.024

1008

Cheese, caraway

Dairy and Egg Products

16

Manganese, Mn

Elements

mg

0.021

1008

Cheese, caraway

Dairy and Egg Products

17

Selenium, Se

Elements

mcg

14.500

1008

Cheese, caraway

Dairy and Egg Products

18

Vitamin A, IU

Vitamins

IU

1054.000

1008

Cheese, caraway

Dairy and Egg Products

19

Retinol

Vitamins

mcg

262.000

1008

Cheese, caraway

Dairy and Egg Products

With the merged data we can conduct any analysis we like

%matplotlib notebook
result = ndata.groupby(['nutrients', 'food_group'])['value'].quantile(0.5)
result['Total lipid (fat)'].sort_values().plot(kind='barh')

Last updated