Combine data
For combining data we can use the methods concat
, merge
and join
. Sometimes all three methods can be applyed to get the same end result, it depends on your data and your preference.
import pandas as pd
import numpy as np
pd.concat?
pd.merge?
pd.DataFrame.join?
Concat
with concat we can combine data. This is especially handy in combining arrays. By default the concat works with axis=0
, e.g the row concatenation. It glues the rows of one dataframe or array to another
df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01
0
0.113879
1.824252
-1.007082
-0.411709
-0.129588
1
0.571164
-1.167565
-1.462957
-0.573230
-0.974223
2
0.136346
-0.870894
-1.320389
1.000776
2.227670
df_02 = pd.DataFrame(np.random.randn(3,5))
df_02
0
-0.656365
0.998600
-2.124309
0.574141
-1.108821
1
1.082656
0.051674
-1.077717
1.066916
1.290093
2
0.460137
0.015859
0.216606
1.164776
0.050458
df_03 = pd.concat([df_01, df_02])
df_03
0
0.113879
1.824252
-1.007082
-0.411709
-0.129588
1
0.571164
-1.167565
-1.462957
-0.573230
-0.974223
2
0.136346
-0.870894
-1.320389
1.000776
2.227670
0
-0.656365
0.998600
-2.124309
0.574141
-1.108821
1
1.082656
0.051674
-1.077717
1.066916
1.290093
2
0.460137
0.015859
0.216606
1.164776
0.050458
if you pass axis=1
it will glue the dataframes in the column direction
df_03 = pd.concat([df_01, df_02], axis=1)
df_03
0
0.113879
1.824252
-1.007082
-0.411709
-0.129588
-0.656365
0.998600
-2.124309
0.574141
-1.108821
1
0.571164
-1.167565
-1.462957
-0.573230
-0.974223
1.082656
0.051674
-1.077717
1.066916
1.290093
2
0.136346
-0.870894
-1.320389
1.000776
2.227670
0.460137
0.015859
0.216606
1.164776
0.050458
In case of unequal shape it will fill the gaps with NaN
A = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]}).set_index('key')
B = pd.DataFrame({'key': ['A', 'B', 'X', 'Y'], 'value': [3, 4, 5, 7]}).set_index('key')
A
key
A
1
B
2
C
3
B
key
A
3
B
4
X
5
Y
7
Merge
You can also perform a SQL-style join using the .merge() function:
pd.merge?
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left
0
A
1
1
B
2
2
C
3
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right
0
A
3
1
B
4
2
D
51,3
pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])
0
A
1
3
1
B
2
4
pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])
0
A
1.0
3
1
B
2.0
4
2
C
3.0
NaN
3
D
NaN
51,3
pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])
0
A
1.0
3
1
B
2.0
4
2
D
NaN
51,3
pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
0
A
1
3
1
B
2
4
2
C
3
NaN
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left
0
A
1
X
1
B
2
Y
2
C
3
Z
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right
0
A
3
W
1
B
53,2
Y
2
D
5
Z
pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])
0
B
2
Y
53,2
Y
df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03
0
A
1
X
3
W
1
B
2
Y
53,2
Y
2
C
3
Z
NaN
NaN
Join
Pandas DataFrame has als a join function for merging by index. However overlapping columns cannot exist.
left
0
A
1
X
1
B
2
Y
2
C
3
Z
right
0
A
3
W
1
B
53,2
Y
2
D
5
Z
left.set_index('key').join(right.set_index('key'), how='outer')
key
A
1.0
X
3
W
B
2.0
Y
53,2
Y
C
3.0
Z
NaN
NaN
D
NaN
NaN
5
Z
right = right.rename(columns = {'key': 'name'})
right
0
A
3
W
1
B
53,2
Y
2
D
5
Z
df_04 = left.join(right, how='outer')
df_04
0
A
1
X
A
3
W
1
B
2
Y
B
53,2
Y
2
C
3
Z
D
5
Z
With the on=
argument you can match indexes with keys. For example:
left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1
0
a
0
1
b
1
2
a
2
3
a
3
4
b
4
5
c
5
right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1
a
3.5
b
7.0
df_05 = left1.join(right1, on='key')
df_05
0
a
0
3.5
1
b
1
7.0
2
a
2
3.5
3
a
3
3.5
4
b
4
7.0
5
c
5
NaN
Merging on indexes with merge
is also possible
left1 = left1.set_index('key')
left1
key
a
0
b
1
a
2
a
3
b
4
c
5
df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06
a
0
3.5
a
2
3.5
a
3
3.5
b
1
7.0
b
4
7.0
c
5
NaN
Last updated
Was this helpful?