Combine data

Concat, merge or join

For combining data we can use the methods concat, merge and join. Sometimes all three methods can be applied to get the same end result, it depends on your data and your preference

pd.concat([df_01, df_02])
pd.merge(left, right)
left.join(right)

Concat

with concat we can combine data. This is especcially handy in combining arrays. By default, the concat works with axis=0, e.g the row concatenation. It glues the rows of one dataframe or array to another

df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

0.799711

0.546198

-1.499229

0.030387

2.282779

0.341611

1.417359

-2.053733

0.687576

-0.275441

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

-1.058341

2.681832

0.752952

-0.348513

-0.289674

1.101420

1.136794

-0.515453

-0.106401

1.208821

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

df_03 = pd.concat([df_01, df_02])
df_03

0.799711

0.546198

-1.499229

0.030387

2.282779

0.341611

1.417359

-2.053733

0.687576

-0.275441

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

-1.058341

2.681832

0.752952

-0.348513

-0.289674

1.101420

1.136794

-0.515453

-0.106401

1.208821

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

if you pass axis=1 it will glue the dataframes in the column direction

df_03 = pd.concat([df_01, df_02], axis=1)
df_03

0.799711

0.546198

-1.499229

0.030387

2.282779

-1.058341

2.681832

0.752952

-0.348513

-0.289674

0.341611

1.417359

-2.053733

0.687576

-0.275441

1.101420

1.136794

-0.515453

-0.106401

1.208821

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

Merge

In the JSON study case we saw that we had to combine data sources. Before we look into the ways how we merge data we should consider the difference between outer, inner, left and right joins. The following picture shows the differences:

You can also perform a SQL-style join using the .merge() function:

pd.merge?

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left

key

left_value

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

key

right_value

51,3

pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

key

left_value

right_value

pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])

key

left_value

right_value

1.0

2.0

3.0

NaN

51,3

pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])

key

left_value

right_value

1.0

2.0

NaN

51,3

pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])

key

left_value

right_value

NaN

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

key

left_value

other_key

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

key

right_value

some_key

53,2

pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

key

left_value

other_key

right_value

some_key

53,2

df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03

key

left_value

other_key

right_value

some_key

53,2

NaN

Join

Pandas DataFrame has also a join function for merging by index. However overlapping columns cannot exist.

left

key

left_value

other_key

right

key

right_value

some_key

53,2

left.set_index('key').join(right.set_index('key'), how='outer')

left_value

other_key

right_value

some_key

key

1.0

2.0

53,2

3.0

NaN

right = right.rename(columns = {'key': 'name'})
right

name

right_value

some_key

53,2

df_04 = left.join(right, how='outer')
df_04

key

left_value

other_key

name

right_value

some_key

53,2

With the on= argument you can match indexes with keys. For example:

left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

key

value

right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

group_val

3.5

7.0

df_05 = left1.join(right1, on='key')
df_05

key

value

group_val

3.5

7.0

3.5

7.0

NaN

Merging on indexes with merge is also possible

left1 = left1.set_index('key')
left1

value

key

df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06

value

group_val

3.5

7.0

NaN

PreviousReshape with melt NextNormalization

Last updated 2 years ago