Module 4b: Expanding dataframes#
In this section you will learn how:
to add new rows, columns and dataframes to an existing dataframe.
There are different methods to grow your dataframes, depending on what you attach to them. Here we will explore ways to combine pandas rows, columns and entire dataframes.
import pandas as pd
import numpy as np
# load the dataframe from the previous section, method is further explained in module 4d
transposed_dict_df = pd.read_csv("transposed_dict_df-4b.csv", index_col=0)
Adding new rows and columns#
Letās say we want to change our mind and keep the participant IDs as a standalone column, but not as row labels. Adding a new column is very straightforward - itās like selecting a column by label and assigning values to it:
# make sure the number of entries in the new column matches the number of rows in the dataframe
transposed_dict_df["ID"] = transposed_dict_df.index
transposed_dict_df
| gender | age | height | weight | smoking | diseases | ID | |
|---|---|---|---|---|---|---|---|
| participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') | participant1 |
| participant2 | male | 40.0 | 1.88 | 50.0 | True | () | participant2 |
| participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) | participant3 |
| participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') | participant4 |
| participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') | participant5 |
| participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) | participant6 |
| participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') | participant7 |
| participant8 | NaN | NaN | NaN | 75.4 | False | () | participant8 |
| participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) | participant9 |
| participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... | participant10 |
Same analogy applies to appending rows:
transposed_dict_df.loc["participant11"] = ["female", 34, 1.55, 64, False, (), "participant11"]
transposed_dict_df
| gender | age | height | weight | smoking | diseases | ID | |
|---|---|---|---|---|---|---|---|
| participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') | participant1 |
| participant2 | male | 40.0 | 1.88 | 50.0 | True | () | participant2 |
| participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) | participant3 |
| participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') | participant4 |
| participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') | participant5 |
| participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) | participant6 |
| participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') | participant7 |
| participant8 | NaN | NaN | NaN | 75.4 | False | () | participant8 |
| participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) | participant9 |
| participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... | participant10 |
| participant11 | female | 34.0 | 1.55 | 64.0 | False | () | participant11 |
Since we donāt need participant information twice and it would be nice to keep 0-based indexing reference for the rows when visualising the dataframe, letās rename rows to numbers again:
n_rows = transposed_dict_df.shape[0]
transposed_dict_df.index = np.arange(0, n_rows)
transposed_dict_df
| gender | age | height | weight | smoking | diseases | ID | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') | participant1 |
| 1 | male | 40.0 | 1.88 | 50.0 | True | () | participant2 |
| 2 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) | participant3 |
| 3 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') | participant4 |
| 4 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') | participant5 |
| 5 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) | participant6 |
| 6 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') | participant7 |
| 7 | NaN | NaN | NaN | 75.4 | False | () | participant8 |
| 8 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) | participant9 |
| 9 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... | participant10 |
| 10 | female | 34.0 | 1.55 | 64.0 | False | () | participant11 |
Rearranging columns#
Also visually it would be more consistent to keep the ID column next to the row index at the beginning. You can rearrange all the columns again by listing all of them, but this time in your desired order:
transposed_dict_df = transposed_dict_df[['ID', 'gender', 'age', 'height', 'weight', 'smoking', 'diseases']]
transposed_dict_df
| ID | gender | age | height | weight | smoking | diseases | |
|---|---|---|---|---|---|---|---|
| 0 | participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') |
| 1 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () |
| 2 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) |
| 3 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') |
| 4 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') |
| 5 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) |
| 6 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') |
| 7 | participant8 | NaN | NaN | NaN | 75.4 | False | () |
| 8 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) |
| 9 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... |
| 10 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () |
Joining pandas dataframes#
What if you have another dictionary/pandas dataframe with more participant data and you would like to combine them with this table? Or you would like to append another row, but you cannot call .loc anymore because your rows donāt have string labels anymore? Then consider pd.concat() or dataframe.merge() depending on your use case.
Columns#
Letās create an example column-based additional dataset:
additional_col_df = {
"exercise_frequency": ["low", "medium", "high", "medium", "low", "high", "medium", "low", "high", "medium", "low"],
"alcohol_consumption": [np.nan, 0, 5, 1, 3, 0, 4, 2, 1, 0, 2]}
additional_col_df = pd.DataFrame(additional_col_df)
additional_col_df
| exercise_frequency | alcohol_consumption | |
|---|---|---|
| 0 | low | NaN |
| 1 | medium | 0.0 |
| 2 | high | 5.0 |
| 3 | medium | 1.0 |
| 4 | low | 3.0 |
| 5 | high | 0.0 |
| 6 | medium | 4.0 |
| 7 | low | 2.0 |
| 8 | high | 1.0 |
| 9 | medium | 0.0 |
| 10 | low | 2.0 |
If you simply wish to concatenate two dataframes along the columns, you can just add them in a list to pd.concat and specify the column axis:
concatenated_df = pd.concat([transposed_dict_df, additional_col_df], axis=1)
concatenated_df
| ID | gender | age | height | weight | smoking | diseases | exercise_frequency | alcohol_consumption | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') | low | NaN |
| 1 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () | medium | 0.0 |
| 2 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) | high | 5.0 |
| 3 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') | medium | 1.0 |
| 4 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') | low | 3.0 |
| 5 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) | high | 0.0 |
| 6 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') | medium | 4.0 |
| 7 | participant8 | NaN | NaN | NaN | 75.4 | False | () | low | 2.0 |
| 8 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) | high | 1.0 |
| 9 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... | medium | 0.0 |
| 10 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () | low | 2.0 |
What if your additional table included participant IDs in a different order to the first table and also a different name? Thatās when transposed_dict_df.merge comes in handy - it can match the rows according to the desired label. In contrast to pd.concat, this method can only be used for the column-aligned concatenation.
# adding IDs to the additional dataframe
additional_col_df["participant_name"] = ['participant6', 'participant1', 'participant2', 'participant3',
'participant4', 'participant5', 'participant7', 'participant8',
'participant9', 'participant10', 'participant11']
additional_col_df
| exercise_frequency | alcohol_consumption | participant_name | |
|---|---|---|---|
| 0 | low | NaN | participant6 |
| 1 | medium | 0.0 | participant1 |
| 2 | high | 5.0 | participant2 |
| 3 | medium | 1.0 | participant3 |
| 4 | low | 3.0 | participant4 |
| 5 | high | 0.0 | participant5 |
| 6 | medium | 4.0 | participant7 |
| 7 | low | 2.0 | participant8 |
| 8 | high | 1.0 | participant9 |
| 9 | medium | 0.0 | participant10 |
| 10 | low | 2.0 | participant11 |
Argument left_on refers to the column label in transposed_dict_df that we want to match with the column in additional_col_df specified in right_on:
merged_df = transposed_dict_df.merge(additional_col_df, left_on="ID", right_on="participant_name")
merged_df
| ID | gender | age | height | weight | smoking | diseases | exercise_frequency | alcohol_consumption | participant_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') | medium | 0.0 | participant1 |
| 1 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () | high | 5.0 | participant2 |
| 2 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) | medium | 1.0 | participant3 |
| 3 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') | low | 3.0 | participant4 |
| 4 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') | high | 0.0 | participant5 |
| 5 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) | low | NaN | participant6 |
| 6 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') | medium | 4.0 | participant7 |
| 7 | participant8 | NaN | NaN | NaN | 75.4 | False | () | low | 2.0 | participant8 |
| 8 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) | high | 1.0 | participant9 |
| 9 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... | medium | 0.0 | participant10 |
| 10 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () | low | 2.0 | participant11 |
Here is a visual summary of the differences between pd.concat and dataframe.merge for merging dataframes by columns:
Rows#
Combining dataframes along the rows is very similar to pd.concat. Just specify the rows axis. It is also sometimes useful to include ignore_index=True to preserve continuous row numbering if you donāt keep row labels.
# creating a row-based table for the example
additional_rows_df = {"ID": ["participant17", "participant13", "participant14"], "gender": ["male", "female", "male"], "age": [70, 45, 29], "height": [1.82, 1.60, 1.75], "weight": [78.5, 62.3, 88.0], "smoking": [False, True, False], "diseases": [("Anxiety",), ("Diabetes", "Hypertension"), ()]}
additional_rows_df = pd.DataFrame(additional_rows_df)
additional_rows_df
| ID | gender | age | height | weight | smoking | diseases | |
|---|---|---|---|---|---|---|---|
| 0 | participant17 | male | 70 | 1.82 | 78.5 | False | (Anxiety,) |
| 1 | participant13 | female | 45 | 1.60 | 62.3 | True | (Diabetes, Hypertension) |
| 2 | participant14 | male | 29 | 1.75 | 88.0 | False | () |
# ignore index to have continuous rows numbering
concatenated_df = pd.concat([transposed_dict_df, additional_rows_df], axis=0, ignore_index=True)
concatenated_df
| ID | gender | age | height | weight | smoking | diseases | |
|---|---|---|---|---|---|---|---|
| 0 | participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') |
| 1 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () |
| 2 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) |
| 3 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') |
| 4 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') |
| 5 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) |
| 6 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') |
| 7 | participant8 | NaN | NaN | NaN | 75.4 | False | () |
| 8 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) |
| 9 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... |
| 10 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () |
| 11 | participant17 | male | 70.0 | 1.82 | 78.5 | False | (Anxiety,) |
| 12 | participant13 | female | 45.0 | 1.60 | 62.3 | True | (Diabetes, Hypertension) |
| 13 | participant14 | male | 29.0 | 1.75 | 88.0 | False | () |
Save the dataframe to work on it in the next section. The method .to_csv will be explained in more detail in module 4d.
transposed_dict_df.to_csv("transposed_dict_df-4c.csv", index=False)
Here we covered the basics of merging the dataframes with pd.concat and dafaframe.merge. However they have many more functionalities, such as taking the union or intersection (join="outer"/join="inner") of column/row labels in pd.concat, which you can learn more about here.