# Combine data

For combining data we can use the methods `concat`, `merge` and `join`. Sometimes all three methods can be applied to get the same end result, it depends on your data and your preference

```python
pd.concat([df_01, df_02])
pd.merge(left, right)
left.join(right)
```

## Concat

with concat we can combine data. This is especcially handy in combining arrays. By default, the concat works with `axis=0`, e.g the row concatenation. It glues the rows of one dataframe or array to another

```python
df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01
```

|   | 0         | 1         | 2         | 3         | 4         |
| - | --------- | --------- | --------- | --------- | --------- |
| 0 | 0.799711  | 0.546198  | -1.499229 | 0.030387  | 2.282779  |
| 1 | 0.341611  | 1.417359  | -2.053733 | 0.687576  | -0.275441 |
| 2 | -0.603197 | -0.981207 | -1.839125 | -1.083722 | -1.129557 |

```python
df_02 = pd.DataFrame(np.random.randn(3,5))
df_02
```

|   | 0         | 1        | 2         | 3         | 4         |
| - | --------- | -------- | --------- | --------- | --------- |
| 0 | -1.058341 | 2.681832 | 0.752952  | -0.348513 | -0.289674 |
| 1 | 1.101420  | 1.136794 | -0.515453 | -0.106401 | 1.208821  |
| 2 | -0.295428 | 1.292897 | -0.584524 | -2.071780 | -1.573926 |

```python
df_03 = pd.concat([df_01, df_02])
df_03
```

|   | 0         | 1         | 2         | 3         | 4         |
| - | --------- | --------- | --------- | --------- | --------- |
| 0 | 0.799711  | 0.546198  | -1.499229 | 0.030387  | 2.282779  |
| 1 | 0.341611  | 1.417359  | -2.053733 | 0.687576  | -0.275441 |
| 2 | -0.603197 | -0.981207 | -1.839125 | -1.083722 | -1.129557 |
| 0 | -1.058341 | 2.681832  | 0.752952  | -0.348513 | -0.289674 |
| 1 | 1.101420  | 1.136794  | -0.515453 | -0.106401 | 1.208821  |
| 2 | -0.295428 | 1.292897  | -0.584524 | -2.071780 | -1.573926 |

if you pass `axis=1` it will glue the dataframes in the column direction

```python
df_03 = pd.concat([df_01, df_02], axis=1)
df_03
```

\
&#x20;  &#x20;

| 0 | 1         | 2         | 3         | 4         | 0         | 1         | 2        | 3         | 4         |           |
| - | --------- | --------- | --------- | --------- | --------- | --------- | -------- | --------- | --------- | --------- |
| 0 | 0.799711  | 0.546198  | -1.499229 | 0.030387  | 2.282779  | -1.058341 | 2.681832 | 0.752952  | -0.348513 | -0.289674 |
| 1 | 0.341611  | 1.417359  | -2.053733 | 0.687576  | -0.275441 | 1.101420  | 1.136794 | -0.515453 | -0.106401 | 1.208821  |
| 2 | -0.603197 | -0.981207 | -1.839125 | -1.083722 | -1.129557 | -0.295428 | 1.292897 | -0.584524 | -2.071780 | -1.573926 |

\
&#x20;  &#x20;

## Merge

In the JSON study case we saw that we had to combine data sources. Before we look into the ways how we merge data we should consider the difference between outer, inner, left and right joins. The following picture shows the differences:

![source: educba.com](/files/-MK0ButhrZ0tOdikYNcT)

You can also perform a SQL-style join using the .merge() function:

```python
pd.merge?
```

```python
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left
```

\
&#x20;  &#x20;

|   | key | left\_value |
| - | --- | ----------- |
| 0 | A   | 1           |
| 1 | B   | 2           |
| 2 | C   | 3           |

```python
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right
```

|   | key | right\_value |
| - | --- | ------------ |
| 0 | A   | 3            |
| 1 | B   | 4            |
| 2 | D   | 51,3         |

```python
pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])
```

<br>

|   | key | left\_value | right\_value |
| - | --- | ----------- | ------------ |
| 0 | A   | 1           | 3            |
| 1 | B   | 2           | 4            |

```python
pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])
```

|   | key | left\_value | right\_value |
| - | --- | ----------- | ------------ |
| 0 | A   | 1.0         | 3            |
| 1 | B   | 2.0         | 4            |
| 2 | C   | 3.0         | NaN          |
| 3 | D   | NaN         | 51,3         |

```python
pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])
```

<br>

|   | key | left\_value | right\_value |
| - | --- | ----------- | ------------ |
| 0 | A   | 1.0         | 3            |
| 1 | B   | 2.0         | 4            |
| 2 | D   | NaN         | 51,3         |

```python
pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
```

\
&#x20;  <br>

|   | key | left\_value | right\_value |
| - | --- | ----------- | ------------ |
| 0 | A   | 1           | 3            |
| 1 | B   | 2           | 4            |
| 2 | C   | 3           | NaN          |

```python
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left
```

<br>

|   | key | left\_value | other\_key |
| - | --- | ----------- | ---------- |
| 0 | A   | 1           | X          |
| 1 | B   | 2           | Y          |
| 2 | C   | 3           | Z          |

```python
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right
```

|   | key | right\_value | some\_key |
| - | --- | ------------ | --------- |
| 0 | A   | 3            | W         |
| 1 | B   | 53,2         | Y         |
| 2 | D   | 5            | Z         |

```python
pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])
```

\
&#x20;   <br>

|   | key | left\_value | other\_key | right\_value | some\_key |
| - | --- | ----------- | ---------- | ------------ | --------- |
| 0 | B   | 2           | Y          | 53,2         | Y         |

```python
df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03
```

\
&#x20; &#x20;

|   | key | left\_value | other\_key | right\_value | some\_key |
| - | --- | ----------- | ---------- | ------------ | --------- |
| 0 | A   | 1           | X          | 3            | W         |
| 1 | B   | 2           | Y          | 53,2         | Y         |
| 2 | C   | 3           | Z          | NaN          | NaN       |

## Join

Pandas DataFrame has also a join function for merging by index. However overlapping columns cannot exist.

```python
left
```

|   | key | left\_value | other\_key |
| - | --- | ----------- | ---------- |
| 0 | A   | 1           | X          |
| 1 | B   | 2           | Y          |
| 2 | C   | 3           | Z          |

```python
right
```

|   | key | right\_value | some\_key |
| - | --- | ------------ | --------- |
| 0 | A   | 3            | W         |
| 1 | B   | 53,2         | Y         |
| 2 | D   | 5            | Z         |

```python
left.set_index('key').join(right.set_index('key'), how='outer')
```

\
&#x20; &#x20;

|     | left\_value | other\_key | right\_value | some\_key |
| --- | ----------- | ---------- | ------------ | --------- |
| key |             |            |              |           |
| A   | 1.0         | X          | 3            | W         |
| B   | 2.0         | Y          | 53,2         | Y         |
| C   | 3.0         | Z          | NaN          | NaN       |
| D   | NaN         | NaN        | 5            | Z         |

```python
right = right.rename(columns = {'key': 'name'})
right
```

\
&#x20;  &#x20;

|   | name | right\_value | some\_key |
| - | ---- | ------------ | --------- |
| 0 | A    | 3            | W         |
| 1 | B    | 53,2         | Y         |
| 2 | D    | 5            | Z         |

```python
df_04 = left.join(right, how='outer')
df_04
```

\
&#x20; <br>

|   | key | left\_value | other\_key | name | right\_value | some\_key |
| - | --- | ----------- | ---------- | ---- | ------------ | --------- |
| 0 | A   | 1           | X          | A    | 3            | W         |
| 1 | B   | 2           | Y          | B    | 53,2         | Y         |
| 2 | C   | 3           | Z          | D    | 5            | Z         |

With the `on=` argument you can match indexes with keys. For example:

```python
left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1
```

<br>

|   | key | value |
| - | --- | ----- |
| 0 | a   | 0     |
| 1 | b   | 1     |
| 2 | a   | 2     |
| 3 | a   | 3     |
| 4 | b   | 4     |
| 5 | c   | 5     |

```python
right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1
```

\
&#x20; <br>

|   | group\_val |
| - | ---------- |
| a | 3.5        |
| b | 7.0        |

```python
df_05 = left1.join(right1, on='key')
df_05
```

\
&#x20; &#x20;

|   | key | value | group\_val |
| - | --- | ----- | ---------- |
| 0 | a   | 0     | 3.5        |
| 1 | b   | 1     | 7.0        |
| 2 | a   | 2     | 3.5        |
| 3 | a   | 3     | 3.5        |
| 4 | b   | 4     | 7.0        |
| 5 | c   | 5     | NaN        |

Merging on indexes with `merge` is also possible

```python
left1 = left1.set_index('key')
left1
```

\
&#x20; &#x20;

|     | value |
| --- | ----- |
| key |       |
| a   | 0     |
| b   | 1     |
| a   | 2     |
| a   | 3     |
| b   | 4     |
| c   | 5     |

```python
df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06
```

\
&#x20; &#x20;

|   | value | group\_val |
| - | ----- | ---------- |
| a | 0     | 3.5        |
| a | 2     | 3.5        |
| a | 3     | 3.5        |
| b | 1     | 7.0        |
| b | 4     | 7.0        |
| c | 5     | NaN        |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://fennaf.gitbook.io/bfvm19prog1/data-wrangling/combine-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
