Combine data

For combining data we can use the methods concat, merge and join. Sometimes all three methods can be applyed to get the same end result, it depends on your data and your preference.

import pandas as pd
import numpy as np

pd.concat?

pd.merge?

pd.DataFrame.join?

Concat

with concat we can combine data. This is especially handy in combining arrays. By default the concat works with axis=0, e.g the row concatenation. It glues the rows of one dataframe or array to another

df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

0.113879

1.824252

-1.007082

-0.411709

-0.129588

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

0.136346

-0.870894

-1.320389

1.000776

2.227670

df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

-0.656365

0.998600

-2.124309

0.574141

-1.108821

1.082656

0.051674

-1.077717

1.066916

1.290093

0.460137

0.015859

0.216606

1.164776

0.050458

df_03 = pd.concat([df_01, df_02])
df_03

0.113879

1.824252

-1.007082

-0.411709

-0.129588

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

0.136346

-0.870894

-1.320389

1.000776

2.227670

-0.656365

0.998600

-2.124309

0.574141

-1.108821

1.082656

0.051674

-1.077717

1.066916

1.290093

0.460137

0.015859

0.216606

1.164776

0.050458

if you pass axis=1 it will glue the dataframes in the column direction

df_03 = pd.concat([df_01, df_02], axis=1)
df_03

0.113879

1.824252

-1.007082

-0.411709

-0.129588

-0.656365

0.998600

-2.124309

0.574141

-1.108821

0.571164

-1.167565

-1.462957

-0.573230

-0.974223

1.082656

0.051674

-1.077717

1.066916

1.290093

0.136346

-0.870894

-1.320389

1.000776

2.227670

0.460137

0.015859

0.216606

1.164776

0.050458

In case of unequal shape it will fill the gaps with NaN

A = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]}).set_index('key')
B = pd.DataFrame({'key': ['A', 'B', 'X', 'Y'], 'value': [3, 4, 5, 7]}).set_index('key')

value

key

value

key

Merge

You can also perform a SQL-style join using the .merge() function:

pd.merge?

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left

key

left_value

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

key

right_value

51,3

pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

key

left_value

right_value

pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])

key

left_value

right_value

1.0

2.0

3.0

NaN

51,3

pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])

key

left_value

right_value

1.0

2.0

NaN

51,3

pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])

key

left_value

right_value

NaN

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

key

left_value

other_key

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

key

right_value

some_key

53,2

pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

key

left_value

other_key

right_value

some_key

53,2

df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03

key

left_value

other_key

right_value

some_key

53,2

NaN

Join

Pandas DataFrame has als a join function for merging by index. However overlapping columns cannot exist.

left

key

left_value

other_key

right

key

right_value

some_key

53,2

left.set_index('key').join(right.set_index('key'), how='outer')

left_value

other_key

right_value

some_key

key

1.0

2.0

53,2

3.0

NaN

right = right.rename(columns = {'key': 'name'})
right

name

right_value

some_key

53,2

df_04 = left.join(right, how='outer')
df_04

key

left_value

other_key

name

right_value

some_key

53,2

With the on= argument you can match indexes with keys. For example:

left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

key

value

right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

group_val

3.5

7.0

df_05 = left1.join(right1, on='key')
df_05

key

value

group_val

3.5

7.0

3.5

7.0

NaN

Merging on indexes with merge is also possible

left1 = left1.set_index('key')
left1

value

key

df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06

value

group_val

3.5

7.0

NaN

PreviousTidy data NextReshaping

Last updated 3 years ago

Was this helpful?