Combine data

Concat, merge or join

For combining data we can use the methods concat, merge and join. Sometimes all three methods can be applied to get the same end result, it depends on your data and your preference

pd.concat([df_01, df_02])
pd.merge(left, right)
left.join(right)

Concat

with concat we can combine data. This is especcially handy in combining arrays. By default, the concat works with axis=0, e.g the row concatenation. It glues the rows of one dataframe or array to another

df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

0

1

2

3

4

0

0.799711

0.546198

-1.499229

0.030387

2.282779

1

0.341611

1.417359

-2.053733

0.687576

-0.275441

2

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

0

1

2

3

4

0

-1.058341

2.681832

0.752952

-0.348513

-0.289674

1

1.101420

1.136794

-0.515453

-0.106401

1.208821

2

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

df_03 = pd.concat([df_01, df_02])
df_03

0

1

2

3

4

0

0.799711

0.546198

-1.499229

0.030387

2.282779

1

0.341611

1.417359

-2.053733

0.687576

-0.275441

2

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

0

-1.058341

2.681832

0.752952

-0.348513

-0.289674

1

1.101420

1.136794

-0.515453

-0.106401

1.208821

2

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

if you pass axis=1 it will glue the dataframes in the column direction

df_03 = pd.concat([df_01, df_02], axis=1)
df_03

0

1

2

3

4

0

1

2

3

4

0

0.799711

0.546198

-1.499229

0.030387

2.282779

-1.058341

2.681832

0.752952

-0.348513

-0.289674

1

0.341611

1.417359

-2.053733

0.687576

-0.275441

1.101420

1.136794

-0.515453

-0.106401

1.208821

2

-0.603197

-0.981207

-1.839125

-1.083722

-1.129557

-0.295428

1.292897

-0.584524

-2.071780

-1.573926

Merge

In the JSON study case we saw that we had to combine data sources. Before we look into the ways how we merge data we should consider the difference between outer, inner, left and right joins. The following picture shows the differences:

source: educba.com

You can also perform a SQL-style join using the .merge() function:

pd.merge?
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left

key

left_value

0

A

1

1

B

2

2

C

3

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

key

right_value

0

A

3

1

B

4

2

D

51,3

pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

key

left_value

right_value

0

A

1

3

1

B

2

4

pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])

key

left_value

right_value

0

A

1.0

3

1

B

2.0

4

2

C

3.0

NaN

3

D

NaN

51,3

pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])

key

left_value

right_value

0

A

1.0

3

1

B

2.0

4

2

D

NaN

51,3

pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])

key

left_value

right_value

0

A

1

3

1

B

2

4

2

C

3

NaN

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

key

left_value

other_key

0

A

1

X

1

B

2

Y

2

C

3

Z

right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

key

right_value

some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

key

left_value

other_key

right_value

some_key

0

B

2

Y

53,2

Y

df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03

key

left_value

other_key

right_value

some_key

0

A

1

X

3

W

1

B

2

Y

53,2

Y

2

C

3

Z

NaN

NaN

Join

Pandas DataFrame has also a join function for merging by index. However overlapping columns cannot exist.

left

key

left_value

other_key

0

A

1

X

1

B

2

Y

2

C

3

Z

right

key

right_value

some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

left.set_index('key').join(right.set_index('key'), how='outer')

left_value

other_key

right_value

some_key

key

A

1.0

X

3

W

B

2.0

Y

53,2

Y

C

3.0

Z

NaN

NaN

D

NaN

NaN

5

Z

right = right.rename(columns = {'key': 'name'})
right

name

right_value

some_key

0

A

3

W

1

B

53,2

Y

2

D

5

Z

df_04 = left.join(right, how='outer')
df_04

key

left_value

other_key

name

right_value

some_key

0

A

1

X

A

3

W

1

B

2

Y

B

53,2

Y

2

C

3

Z

D

5

Z

With the on= argument you can match indexes with keys. For example:

left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

key

value

0

a

0

1

b

1

2

a

2

3

a

3

4

b

4

5

c

5

right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

group_val

a

3.5

b

7.0

df_05 = left1.join(right1, on='key')
df_05

key

value

group_val

0

a

0

3.5

1

b

1

7.0

2

a

2

3.5

3

a

3

3.5

4

b

4

7.0

5

c

5

NaN

Merging on indexes with merge is also possible

left1 = left1.set_index('key')
left1

value

key

a

0

b

1

a

2

a

3

b

4

c

5

df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06

value

group_val

a

0

3.5

a

2

3.5

a

3

3.5

b

1

7.0

b

4

7.0

c

5

NaN

Last updated