USDA food database (JSON)
Case study based on a USDA food database wrangled into into Json format by Ashley Williams.
JSON Data
JSON has become one of the standards for sending http requests between web browsers and other applications. JSON is a very nearly valid Python code using basic types like dictionaries, arrays, strings, numbers, and booleans. So far we used json.loads()
and json.dumps()
to read and write JSON files. JSON data can be converted to a dataframe using the built-in pandas pd.read_json
method
description
group
id
nutrients
portions
tags
0
Cheese, caraway
Dairy and Egg Products
1008
[{'value': 25.18, 'units': 'g', 'group': 'Comp...
[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]
[]
1
Cheese, cheddar
Dairy and Egg Products
1009
[{'value': 24.9, 'units': 'g', 'group': 'Compo...
[{'unit': 'cup, diced', 'amount': 1, 'grams': ...
[]
2
Cheese, edam
Dairy and Egg Products
1018
[{'value': 4.22, 'units': 'g', 'group': 'Other...
[{'unit': 'oz', 'amount': 1, 'grams': 28.35}, ...
[]
3
Cheese, feta
Dairy and Egg Products
1019
[{'value': 5.2, 'units': 'g', 'group': 'Other'...
[{'unit': 'cup, crumbled', 'amount': 1, 'grams...
[]
4
Cheese, mozzarella, part skim milk
Dairy and Egg Products
1028
[{'value': 3.27, 'units': 'g', 'group': 'Other...
[{'unit': 'oz', 'amount': 1, 'grams': 28.35}]
[]
The disadvantage of pd.read_json
is that we create columns with dictionaries. Better is to select the nutrients information from the column and put it in a different dataframe along with the id
number, so that we can combine the two afterwards. Below you see how the first row and column cell is selected and put in a dataframe.
description
group
units
value
0
Protein
Composition
g
25.18
1
Total lipid (fat)
Composition
g
29.20
2
Carbohydrate, by difference
Composition
g
3.06
3
Ash
Other
g
3.28
4
Energy
Energy
kcal
376.00
5
Water
Composition
g
39.28
6
Energy
Energy
kJ
1573.00
7
Fiber, total dietary
Composition
g
0.00
8
Calcium, Ca
Elements
mg
673.00
9
Iron, Fe
Elements
mg
0.64
To do this for the entire dataframe we need to loop through the dataframe and put each nutrients cell in a dataframe, add the id
column for identification purposes. If we first create a list of all these dataframes and then concat them we have one single dataframe for all the nutrients cells in the original dataframe.
description
group
units
value
id
150
Arginine
Amino Acids
g
0.952
1008
151
Histidine
Amino Acids
g
0.884
1008
152
Alanine
Amino Acids
g
0.711
1008
153
Aspartic acid
Amino Acids
g
1.618
1008
154
Glutamic acid
Amino Acids
g
6.160
1008
155
Glycine
Amino Acids
g
0.439
1008
156
Proline
Amino Acids
g
2.838
1008
157
Serine
Amino Acids
g
1.472
1008
158
Cholesterol
Other
mg
93.000
1008
159
Fatty acids, total saturated
Other
g
18.584
1008
160
Fatty acids, total monounsaturated
Other
g
8.275
1008
161
Fatty acids, total polyunsaturated
Other
g
0.830
1008
162
Protein
Composition
g
24.900
1009
163
Total lipid (fat)
Composition
g
33.140
1009
164
Carbohydrate, by difference
Composition
g
1.280
1009
165
Ash
Other
g
3.930
1009
166
Energy
Energy
kcal
403.000
1009
167
Sucrose
Sugars
g
0.240
1009
168
Lactose
Sugars
g
0.230
1009
169
Maltose
Sugars
g
0.150
1009
170
Alcohol, ethyl
Other
g
0.000
1009
171
Water
Composition
g
36.750
1009
172
Caffeine
Other
mg
0.000
1009
173
Theobromine
Other
mg
0.000
1009
174
Energy
Energy
kJ
1684.000
1009
175
Sugars, total
Composition
g
0.520
1009
176
Fiber, total dietary
Composition
g
0.000
1009
177
Calcium, Ca
Elements
mg
721.000
1009
178
Iron, Fe
Elements
mg
0.680
1009
179
Magnesium, Mg
Elements
mg
28.000
1009
...
...
...
...
...
...
220
Fatty acids, total monounsaturated
Other
g
9.391
1009
221
Fatty acids, total polyunsaturated
Other
g
0.942
1009
222
Dihydrophylloquinone
Vitamins
mcg
0.000
1009
223
Tryptophan
Amino Acids
g
0.320
1009
224
Threonine
Amino Acids
g
0.886
1009
225
Isoleucine
Amino Acids
g
1.546
1009
226
Leucine
Amino Acids
g
2.385
1009
227
Lysine
Amino Acids
g
2.072
1009
228
Methionine
Amino Acids
g
0.652
1009
229
Cystine
Amino Acids
g
0.125
1009
230
Phenylalanine
Amino Acids
g
1.311
1009
231
Tyrosine
Amino Acids
g
1.202
1009
232
Valine
Amino Acids
g
1.663
1009
233
Arginine
Amino Acids
g
0.941
1009
234
Histidine
Amino Acids
g
0.874
1009
235
Alanine
Amino Acids
g
0.703
1009
236
Aspartic acid
Amino Acids
g
1.600
1009
237
Glutamic acid
Amino Acids
g
6.092
1009
238
Glycine
Amino Acids
g
0.429
1009
239
Proline
Amino Acids
g
2.806
1009
240
Serine
Amino Acids
g
1.456
1009
241
Protein
Composition
g
24.900
1009
242
Total lipid (fat)
Composition
g
33.140
1009
243
Carbohydrate, by difference
Composition
g
1.280
1009
244
Ash
Other
g
3.930
1009
245
Energy
Energy
kcal
403.000
1009
246
Sucrose
Sugars
g
0.240
1009
247
Lactose
Sugars
g
0.230
1009
248
Maltose
Sugars
g
0.150
1009
249
Alcohol, ethyl
Other
g
0.000
1009
100 rows × 5 columns
Since we put the nutrients info in a separate dataframe we can eliminate that from the original database
description
group
id
manufacturer
0
Cheese, caraway
Dairy and Egg Products
1008
1
Cheese, cheddar
Dairy and Egg Products
1009
2
Cheese, edam
Dairy and Egg Products
1018
3
Cheese, feta
Dairy and Egg Products
1019
4
Cheese, mozzarella, part skim milk
Dairy and Egg Products
1028
Now we reduced the dimensions in the dataframes we can easily merge them. First we rename the columns that are in both dataframes
food
food_group
id
0
Cheese, caraway
Dairy and Egg Products
1008
1
Cheese, cheddar
Dairy and Egg Products
1009
2
Cheese, edam
Dairy and Egg Products
1018
3
Cheese, feta
Dairy and Egg Products
1019
4
Cheese, mozzarella, part skim milk
Dairy and Egg Products
1028
nutrients
nutrient_group
units
value
id
0
Protein
Composition
g
25.18
1008
1
Total lipid (fat)
Composition
g
29.20
1008
2
Carbohydrate, by difference
Composition
g
3.06
1008
3
Ash
Other
g
3.28
1008
4
Energy
Energy
kcal
376.00
1008
nutrients
nutrient_group
units
value
id
food
food_group
manufacturer
0
Protein
Composition
g
25.180
1008
Cheese, caraway
Dairy and Egg Products
1
Total lipid (fat)
Composition
g
29.200
1008
Cheese, caraway
Dairy and Egg Products
2
Carbohydrate, by difference
Composition
g
3.060
1008
Cheese, caraway
Dairy and Egg Products
3
Ash
Other
g
3.280
1008
Cheese, caraway
Dairy and Egg Products
4
Energy
Energy
kcal
376.000
1008
Cheese, caraway
Dairy and Egg Products
5
Water
Composition
g
39.280
1008
Cheese, caraway
Dairy and Egg Products
6
Energy
Energy
kJ
1573.000
1008
Cheese, caraway
Dairy and Egg Products
7
Fiber, total dietary
Composition
g
0.000
1008
Cheese, caraway
Dairy and Egg Products
8
Calcium, Ca
Elements
mg
673.000
1008
Cheese, caraway
Dairy and Egg Products
9
Iron, Fe
Elements
mg
0.640
1008
Cheese, caraway
Dairy and Egg Products
10
Magnesium, Mg
Elements
mg
22.000
1008
Cheese, caraway
Dairy and Egg Products
11
Phosphorus, P
Elements
mg
490.000
1008
Cheese, caraway
Dairy and Egg Products
12
Potassium, K
Elements
mg
93.000
1008
Cheese, caraway
Dairy and Egg Products
13
Sodium, Na
Elements
mg
690.000
1008
Cheese, caraway
Dairy and Egg Products
14
Zinc, Zn
Elements
mg
2.940
1008
Cheese, caraway
Dairy and Egg Products
15
Copper, Cu
Elements
mg
0.024
1008
Cheese, caraway
Dairy and Egg Products
16
Manganese, Mn
Elements
mg
0.021
1008
Cheese, caraway
Dairy and Egg Products
17
Selenium, Se
Elements
mcg
14.500
1008
Cheese, caraway
Dairy and Egg Products
18
Vitamin A, IU
Vitamins
IU
1054.000
1008
Cheese, caraway
Dairy and Egg Products
19
Retinol
Vitamins
mcg
262.000
1008
Cheese, caraway
Dairy and Egg Products
With the merged data we can conduct any analysis we like
Last updated