Module 4a: Creating a pandas dataframe#
In this section you will learn how to:
to create a pandas dataframe,
to assign labels to columns and rows.
# pd is a common abbreviation for pandas when importing the library
import pandas as pd
import numpy as np
Creating a pandas dataframe#
A pandas dataframe is a data structure designed for keeping and manipulating tabular data. It can be created from a Python dictionary. To show you an example of how to do this, let us first recreate a dictionary dataset from module 2:
participant1_gender = 'female'
participant1_age = 25
participant1_weight = 70.9
participant1_height = 175
participant1_smoking = False
participant1_diseases = ('Asthma', 'Diabetes')
participant1_information = [participant1_gender, participant1_age, participant1_height, participant1_weight, participant1_smoking, participant1_diseases]
participant2_information = ['male', 40, 1.79, 103.4, True, ()]
participant3_information = ['male', 18, 1.75, 85.1, True, ('Lung cancer',)]
participant4_information = ['female', 83, 1.63, 55.9, False, ('Cardio vascular disease', 'Alzheimers')]
participant5_information = ['female', 55, 1.68, 50.0, False, ('Asthma', 'Anxiety')]
participant6_information = ['female', 32, 1.59, 64.0, True, ('Diabetes',)]
participant7_information = ['male', 21, 1.90, 92.9, False, ('Asthma', 'Colon cancer')]
participant8_information = ['male', 46, 1.71, 75.4, False, ()]
participant9_information = ['female', 32, 1.66, 90.7, True, ('Depression',)]
participant10_information = ['male', 67, 1.78, 82.3, False, ('Anxiety', 'Diabetes', 'Cardio vascular disease')]
participants_clinical_information = {
'participant1': participant1_information,
'participant2': participant2_information,
'participant3': participant3_information,
'participant4': participant4_information,
'participant5': participant5_information,
'participant6': participant6_information,
'participant7': participant7_information,
'participant8': participant8_information,
'participant9': participant9_information,
'participant10': participant10_information
}
pandas dataframe can accept various data structures like NumPy arrays, Python lists or dictionaries. Keep in mind that the method for creating a dataframe is case-sensitive, so for this to work correctly you need to specify pd.DataFrame
as it is with upper-case letters.
# make a Pandas dataframe out of the dictionary
dict_df = pd.DataFrame(participants_clinical_information) # alternatively the input could also be a NumPy array
dict_df # mentioning a variable at the end of the cell makes it appear in the output
participant1 | participant2 | participant3 | participant4 | participant5 | participant6 | participant7 | participant8 | participant9 | participant10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | female | male | male | female | female | female | male | male | female | male |
1 | 25 | 40 | 18 | 83 | 55 | 32 | 21 | 46 | 32 | 67 |
2 | 175 | 1.79 | 1.75 | 1.63 | 1.68 | 1.59 | 1.9 | 1.71 | 1.66 | 1.78 |
3 | 70.9 | 103.4 | 85.1 | 55.9 | 50.0 | 64.0 | 92.9 | 75.4 | 90.7 | 82.3 |
4 | False | True | True | False | False | True | False | False | True | False |
5 | (Asthma, Diabetes) | () | (Lung cancer,) | (Cardio vascular disease, Alzheimers) | (Asthma, Anxiety) | (Diabetes,) | (Asthma, Colon cancer) | () | (Depression,) | (Anxiety, Diabetes, Cardio vascular disease) |
The dataframe is not ideally positioned. The dictionary keys (“participant1”, etc.) are treated as column headers by default when loading them into a pandas dataframe. Let us transpose the table:
# transposing the dataframe
transposed_dict_df = dict_df.T
transposed_dict_df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
participant1 | female | 25 | 175 | 70.9 | False | (Asthma, Diabetes) |
participant2 | male | 40 | 1.79 | 103.4 | True | () |
participant3 | male | 18 | 1.75 | 85.1 | True | (Lung cancer,) |
participant4 | female | 83 | 1.63 | 55.9 | False | (Cardio vascular disease, Alzheimers) |
participant5 | female | 55 | 1.68 | 50.0 | False | (Asthma, Anxiety) |
participant6 | female | 32 | 1.59 | 64.0 | True | (Diabetes,) |
participant7 | male | 21 | 1.9 | 92.9 | False | (Asthma, Colon cancer) |
participant8 | male | 46 | 1.71 | 75.4 | False | () |
participant9 | female | 32 | 1.66 | 90.7 | True | (Depression,) |
participant10 | male | 67 | 1.78 | 82.3 | False | (Anxiety, Diabetes, Cardio vascular disease) |
You can access information about the dataframe shape just like in NumPy:
transposed_dict_df.shape
(10, 6)
A lot of pandas functions, that will be described here from now on, have the format of dataframe.method()
. To apply a given method on the dataframe, you need to specify its name first, then the dot and the method name with ()
.
Naming rows and columns#
We successfully inverted rows and columns, but we lost column labels in the process. Let us add them back. To do this, we need to create a Python list with column names. The order of the labels needs to match the column order. In transposed_dict_df
the participant ID column is not treated as a separate column, but as row labels - lack of numbering at the top indicates that. So at the moment, we need to match the number of labels with the number of recognised dataframe columns:
# naming columns
transposed_dict_df.columns = ["gender", "age", "height", "weight", "smoking", "diseases"]
transposed_dict_df
gender | age | height | weight | smoking | diseases | |
---|---|---|---|---|---|---|
participant1 | female | 25 | 175 | 70.9 | False | (Asthma, Diabetes) |
participant2 | male | 40 | 1.79 | 103.4 | True | () |
participant3 | male | 18 | 1.75 | 85.1 | True | (Lung cancer,) |
participant4 | female | 83 | 1.63 | 55.9 | False | (Cardio vascular disease, Alzheimers) |
participant5 | female | 55 | 1.68 | 50.0 | False | (Asthma, Anxiety) |
participant6 | female | 32 | 1.59 | 64.0 | True | (Diabetes,) |
participant7 | male | 21 | 1.9 | 92.9 | False | (Asthma, Colon cancer) |
participant8 | male | 46 | 1.71 | 75.4 | False | () |
participant9 | female | 32 | 1.66 | 90.7 | True | (Depression,) |
participant10 | male | 67 | 1.78 | 82.3 | False | (Anxiety, Diabetes, Cardio vascular disease) |
Naming rows follows the same logic - providing a list or array of labels to transposed_dict_df.index
:
n_rows = transposed_dict_df.shape[0] # get the number of rows from the shape tuple
# here we will simply assign a number to each row
# np.arange is a NumPy function that generates an sequence of numbers
transposed_dict_df.index = np.arange(0, n_rows)
transposed_dict_df
gender | age | height | weight | smoking | diseases | |
---|---|---|---|---|---|---|
0 | female | 25 | 175 | 70.9 | False | (Asthma, Diabetes) |
1 | male | 40 | 1.79 | 103.4 | True | () |
2 | male | 18 | 1.75 | 85.1 | True | (Lung cancer,) |
3 | female | 83 | 1.63 | 55.9 | False | (Cardio vascular disease, Alzheimers) |
4 | female | 55 | 1.68 | 50.0 | False | (Asthma, Anxiety) |
5 | female | 32 | 1.59 | 64.0 | True | (Diabetes,) |
6 | male | 21 | 1.9 | 92.9 | False | (Asthma, Colon cancer) |
7 | male | 46 | 1.71 | 75.4 | False | () |
8 | female | 32 | 1.66 | 90.7 | True | (Depression,) |
9 | male | 67 | 1.78 | 82.3 | False | (Anxiety, Diabetes, Cardio vascular disease) |
Printing .columns
and .index
by themeselves lets you access and view their labels:
print(transposed_dict_df.columns)
print(transposed_dict_df.index)
Index(['gender', 'age', 'height', 'weight', 'smoking', 'diseases'], dtype='object')
Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Save the dataframe to work on it in the next section. The method .to_csv
will be explained in more detail in module 4d.
transposed_dict_df.to_csv("transposed_dict_df-4a.csv", index=False)