Combine data
Concat, merge or join
For combining data we can use the methods concat
, merge
and join
. Sometimes all three methods can be applied to get the same end result, it depends on your data and your preference
pd.concat([df_01, df_02])
pd.merge(left, right)
left.join(right)
Concat
with concat we can combine data. This is especcially handy in combining arrays. By default, the concat works with axis=0
, e.g the row concatenation. It glues the rows of one dataframe or array to another
df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01
0
1
2
3
4
0
0.799711
0.546198
-1.499229
0.030387
2.282779
1
0.341611
1.417359
-2.053733
0.687576
-0.275441
2
-0.603197
-0.981207
-1.839125
-1.083722
-1.129557
df_02 = pd.DataFrame(np.random.randn(3,5))
df_02
0
1
2
3
4
0
-1.058341
2.681832
0.752952
-0.348513
-0.289674
1
1.101420
1.136794
-0.515453
-0.106401
1.208821
2
-0.295428
1.292897
-0.584524
-2.071780
-1.573926
df_03 = pd.concat([df_01, df_02])
df_03
0
1
2
3
4
0
0.799711
0.546198
-1.499229
0.030387
2.282779
1
0.341611
1.417359
-2.053733
0.687576
-0.275441
2
-0.603197
-0.981207
-1.839125
-1.083722
-1.129557
0
-1.058341
2.681832
0.752952
-0.348513
-0.289674
1
1.101420
1.136794
-0.515453
-0.106401
1.208821
2
-0.295428
1.292897
-0.584524
-2.071780
-1.573926
if you pass axis=1
it will glue the dataframes in the column direction
df_03 = pd.concat([df_01, df_02], axis=1)
df_03
0
1
2
3
4
0
1
2
3
4
0
0.799711
0.546198
-1.499229
0.030387
2.282779
-1.058341
2.681832
0.752952
-0.348513
-0.289674
1
0.341611
1.417359
-2.053733
0.687576
-0.275441
1.101420
1.136794
-0.515453
-0.106401
1.208821
2
-0.603197
-0.981207
-1.839125
-1.083722
-1.129557
-0.295428
1.292897
-0.584524
-2.071780
-1.573926
Merge
In the JSON study case we saw that we had to combine data sources. Before we look into the ways how we merge data we should consider the difference between outer, inner, left and right joins. The following picture shows the differences:

You can also perform a SQL-style join using the .merge() function:
pd.merge?
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left
key
left_value
0
A
1
1
B
2
2
C
3
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right
key
right_value
0
A
3
1
B
4
2
D
51,3
pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])
key
left_value
right_value
0
A
1
3
1
B
2
4
pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])
key
left_value
right_value
0
A
1.0
3
1
B
2.0
4
2
C
3.0
NaN
3
D
NaN
51,3
pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])
key
left_value
right_value
0
A
1.0
3
1
B
2.0
4
2
D
NaN
51,3
pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
key
left_value
right_value
0
A
1
3
1
B
2
4
2
C
3
NaN
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left
key
left_value
other_key
0
A
1
X
1
B
2
Y
2
C
3
Z
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right
key
right_value
some_key
0
A
3
W
1
B
53,2
Y
2
D
5
Z
pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])
key
left_value
other_key
right_value
some_key
0
B
2
Y
53,2
Y
df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03
key
left_value
other_key
right_value
some_key
0
A
1
X
3
W
1
B
2
Y
53,2
Y
2
C
3
Z
NaN
NaN
Join
Pandas DataFrame has also a join function for merging by index. However overlapping columns cannot exist.
left
key
left_value
other_key
0
A
1
X
1
B
2
Y
2
C
3
Z
right
key
right_value
some_key
0
A
3
W
1
B
53,2
Y
2
D
5
Z
left.set_index('key').join(right.set_index('key'), how='outer')
left_value
other_key
right_value
some_key
key
A
1.0
X
3
W
B
2.0
Y
53,2
Y
C
3.0
Z
NaN
NaN
D
NaN
NaN
5
Z
right = right.rename(columns = {'key': 'name'})
right
name
right_value
some_key
0
A
3
W
1
B
53,2
Y
2
D
5
Z
df_04 = left.join(right, how='outer')
df_04
key
left_value
other_key
name
right_value
some_key
0
A
1
X
A
3
W
1
B
2
Y
B
53,2
Y
2
C
3
Z
D
5
Z
With the on=
argument you can match indexes with keys. For example:
left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1
key
value
0
a
0
1
b
1
2
a
2
3
a
3
4
b
4
5
c
5
right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1
group_val
a
3.5
b
7.0
df_05 = left1.join(right1, on='key')
df_05
key
value
group_val
0
a
0
3.5
1
b
1
7.0
2
a
2
3.5
3
a
3
3.5
4
b
4
7.0
5
c
5
NaN
Merging on indexes with merge
is also possible
left1 = left1.set_index('key')
left1
value
key
a
0
b
1
a
2
a
3
b
4
c
5
df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06
value
group_val
a
0
3.5
a
2
3.5
a
3
3.5
b
1
7.0
b
4
7.0
c
5
NaN
Last updated
Was this helpful?