Combine data

For combining data we can use the methods concat, merge and join. Sometimes all three methods can be applyed to get the same end result, it depends on your data and your preference.

import pandas as pd
import numpy as np
pd.concat?
pd.merge?
pd.DataFrame.join?

Concat

with concat we can combine data. This is especially handy in combining arrays. By default the concat works with axis=0, e.g the row concatenation. It glues the rows of one dataframe or array to another

df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

0
1
2
3
4

0

0.113879

1.824252

-1.007082

-0.411709

-0.129588

1

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

2

0.136346

-0.870894

-1.320389

1.000776

2.227670

df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

0
1
2
3
4

0

-0.656365

0.998600

-2.124309

0.574141

-1.108821

1

1.082656

0.051674

-1.077717

1.066916

1.290093

2

0.460137

0.015859

0.216606

1.164776

0.050458

df_03 = pd.concat([df_01, df_02])
df_03

0
1
2
3
4

0

0.113879

1.824252

-1.007082

-0.411709

-0.129588

1

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

2

0.136346

-0.870894

-1.320389

1.000776

2.227670

0

-0.656365

0.998600

-2.124309

0.574141

-1.108821

1

1.082656

0.051674

-1.077717

1.066916

1.290093

2

0.460137

0.015859

0.216606

1.164776

0.050458

if you pass axis=1 it will glue the dataframes in the column direction

df_03 = pd.concat([df_01, df_02], axis=1)
df_03

0
1
2
3
4
0
1
2
3
4

0

0.113879

1.824252

-1.007082

-0.411709

-0.129588

-0.656365

0.998600

-2.124309

0.574141

-1.108821

1

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

1.082656

0.051674

-1.077717

1.066916

1.290093

2

0.136346

-0.870894

-1.320389

1.000776

2.227670

0.460137

0.015859

0.216606

1.164776

0.050458

In case of unequal shape it will fill the gaps with NaN

A = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]}).set_index('key')
B = pd.DataFrame({'key': ['A', 'B', 'X', 'Y'], 'value': [3, 4, 5, 7]}).set_index('key')
A

value

key

A

1

B

2

C

3

B

value

key

A

3

B

4

X

5

Y

7

Merge

You can also perform a SQL-style join using the .merge() function:

pd.merge?
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left 

key
left_value

0

A

1

1

B

2

2

C

3

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

key
right_value

0

A

3

1

B

4

2

D

51,3

pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

key
left_value
right_value

0

A

1

3

1

B

2

4

pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])
key
left_value
right_value

0

A

1.0

3

1

B

2.0

4

2

C

3.0

NaN

3

D

NaN

51,3

pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])
key
left_value
right_value

0

A

1.0

3

1

B

2.0

4

2

D

NaN

51,3

pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
key
left_value
right_value

0

A

1

3

1

B

2

4

2

C

3

NaN

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

key
left_value
other_key

0

A

1

X

1

B

2

Y

2

C

3

Z

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

key
right_value
some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

key
left_value
other_key
right_value
some_key

0

B

2

Y

53,2

Y

df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03
key
left_value
other_key
right_value
some_key

0

A

1

X

3

W

1

B

2

Y

53,2

Y

2

C

3

Z

NaN

NaN

Join

Pandas DataFrame has als a join function for merging by index. However overlapping columns cannot exist.

left

key
left_value
other_key

0

A

1

X

1

B

2

Y

2

C

3

Z

right

key
right_value
some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

left.set_index('key').join(right.set_index('key'), how='outer')
left_value
other_key
right_value
some_key

key

A

1.0

X

3

W

B

2.0

Y

53,2

Y

C

3.0

Z

NaN

NaN

D

NaN

NaN

5

Z

right = right.rename(columns = {'key': 'name'})
right

name
right_value
some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

df_04 = left.join(right, how='outer')
df_04
key
left_value
other_key
name
right_value
some_key

0

A

1

X

A

3

W

1

B

2

Y

B

53,2

Y

2

C

3

Z

D

5

Z

With the on= argument you can match indexes with keys. For example:

left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

key
value

0

a

0

1

b

1

2

a

2

3

a

3

4

b

4

5

c

5

right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

group_val

a

3.5

b

7.0

df_05 = left1.join(right1, on='key')
df_05
key
value
group_val

0

a

0

3.5

1

b

1

7.0

2

a

2

3.5

3

a

3

3.5

4

b

4

7.0

5

c

5

NaN

Merging on indexes with merge is also possible

left1 = left1.set_index('key')
left1

value

key

a

0

b

1

a

2

a

3

b

4

c

5

df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06
value
group_val

a

0

3.5

a

2

3.5

a

3

3.5

b

1

7.0

b

4

7.0

c

5

NaN

Last updated

Was this helpful?