Module 4a: Creating a pandas dataframe

Module 4a: Creating a pandas dataframe#

In this section you will learn how to:

  • to create a pandas dataframe,

  • to assign labels to columns and rows.

# pd is a common abbreviation for pandas when importing the library
import pandas as pd
import numpy as np

Creating a pandas dataframe#

A pandas dataframe is a data structure designed for keeping and manipulating tabular data. It can be created from a Python dictionary. To show you an example of how to do this, let us first recreate a dictionary dataset from module 2:

participant1_gender = 'female'
participant1_age = 25
participant1_weight = 70.9
participant1_height = 175
participant1_smoking = False
participant1_diseases = ('Asthma', 'Diabetes')

participant1_information = [participant1_gender, participant1_age, participant1_height, participant1_weight, participant1_smoking, participant1_diseases]

participant2_information = ['male', 40, 1.79, 103.4, True, ()]

participant3_information = ['male', 18, 1.75, 85.1, True, ('Lung cancer',)]
participant4_information = ['female', 83, 1.63, 55.9, False, ('Cardio vascular disease', 'Alzheimers')]
participant5_information = ['female', 55, 1.68, 50.0, False, ('Asthma', 'Anxiety')]
participant6_information = ['female', 32, 1.59, 64.0, True, ('Diabetes',)]
participant7_information = ['male', 21, 1.90, 92.9, False, ('Asthma', 'Colon cancer')]
participant8_information = ['male', 46, 1.71, 75.4, False, ()]
participant9_information = ['female', 32, 1.66, 90.7, True, ('Depression',)]
participant10_information = ['male', 67, 1.78, 82.3, False, ('Anxiety', 'Diabetes', 'Cardio vascular disease')]

participants_clinical_information = {
    'participant1': participant1_information,
    'participant2': participant2_information,
    'participant3': participant3_information,
    'participant4': participant4_information,
    'participant5': participant5_information,
    'participant6': participant6_information,
    'participant7': participant7_information,
    'participant8': participant8_information,
    'participant9': participant9_information,
    'participant10': participant10_information
}

pandas dataframe can accept various data structures like NumPy arrays, Python lists or dictionaries. Keep in mind that the method for creating a dataframe is case-sensitive, so for this to work correctly you need to specify pd.DataFrame as it is with upper-case letters.

# make a Pandas dataframe out of the dictionary
dict_df = pd.DataFrame(participants_clinical_information) # alternatively the input could also be a NumPy array
dict_df # mentioning a variable at the end of the cell makes it appear in the output
participant1 participant2 participant3 participant4 participant5 participant6 participant7 participant8 participant9 participant10
0 female male male female female female male male female male
1 25 40 18 83 55 32 21 46 32 67
2 175 1.79 1.75 1.63 1.68 1.59 1.9 1.71 1.66 1.78
3 70.9 103.4 85.1 55.9 50.0 64.0 92.9 75.4 90.7 82.3
4 False True True False False True False False True False
5 (Asthma, Diabetes) () (Lung cancer,) (Cardio vascular disease, Alzheimers) (Asthma, Anxiety) (Diabetes,) (Asthma, Colon cancer) () (Depression,) (Anxiety, Diabetes, Cardio vascular disease)

The dataframe is not ideally positioned. The dictionary keys (“participant1”, etc.) are treated as column headers by default when loading them into a pandas dataframe. Let us transpose the table:

# transposing the dataframe
transposed_dict_df = dict_df.T
transposed_dict_df
0 1 2 3 4 5
participant1 female 25 175 70.9 False (Asthma, Diabetes)
participant2 male 40 1.79 103.4 True ()
participant3 male 18 1.75 85.1 True (Lung cancer,)
participant4 female 83 1.63 55.9 False (Cardio vascular disease, Alzheimers)
participant5 female 55 1.68 50.0 False (Asthma, Anxiety)
participant6 female 32 1.59 64.0 True (Diabetes,)
participant7 male 21 1.9 92.9 False (Asthma, Colon cancer)
participant8 male 46 1.71 75.4 False ()
participant9 female 32 1.66 90.7 True (Depression,)
participant10 male 67 1.78 82.3 False (Anxiety, Diabetes, Cardio vascular disease)

You can access information about the dataframe shape just like in NumPy:

transposed_dict_df.shape
(10, 6)

A lot of pandas functions, that will be described here from now on, have the format of dataframe.method(). To apply a given method on the dataframe, you need to specify its name first, then the dot and the method name with ().

Naming rows and columns#

We successfully inverted rows and columns, but we lost column labels in the process. Let us add them back. To do this, we need to create a Python list with column names. The order of the labels needs to match the column order. In transposed_dict_df the participant ID column is not treated as a separate column, but as row labels - lack of numbering at the top indicates that. So at the moment, we need to match the number of labels with the number of recognised dataframe columns:

# naming columns
transposed_dict_df.columns = ["gender", "age", "height", "weight", "smoking", "diseases"]
transposed_dict_df
gender age height weight smoking diseases
participant1 female 25 175 70.9 False (Asthma, Diabetes)
participant2 male 40 1.79 103.4 True ()
participant3 male 18 1.75 85.1 True (Lung cancer,)
participant4 female 83 1.63 55.9 False (Cardio vascular disease, Alzheimers)
participant5 female 55 1.68 50.0 False (Asthma, Anxiety)
participant6 female 32 1.59 64.0 True (Diabetes,)
participant7 male 21 1.9 92.9 False (Asthma, Colon cancer)
participant8 male 46 1.71 75.4 False ()
participant9 female 32 1.66 90.7 True (Depression,)
participant10 male 67 1.78 82.3 False (Anxiety, Diabetes, Cardio vascular disease)

Naming rows follows the same logic - providing a list or array of labels to transposed_dict_df.index:

n_rows = transposed_dict_df.shape[0] # get the number of rows from the shape tuple

# here we will simply assign a number to each row
# np.arange is a NumPy function that generates an sequence of numbers
transposed_dict_df.index = np.arange(0, n_rows)
transposed_dict_df
gender age height weight smoking diseases
0 female 25 175 70.9 False (Asthma, Diabetes)
1 male 40 1.79 103.4 True ()
2 male 18 1.75 85.1 True (Lung cancer,)
3 female 83 1.63 55.9 False (Cardio vascular disease, Alzheimers)
4 female 55 1.68 50.0 False (Asthma, Anxiety)
5 female 32 1.59 64.0 True (Diabetes,)
6 male 21 1.9 92.9 False (Asthma, Colon cancer)
7 male 46 1.71 75.4 False ()
8 female 32 1.66 90.7 True (Depression,)
9 male 67 1.78 82.3 False (Anxiety, Diabetes, Cardio vascular disease)

Printing .columns and .index by themeselves lets you access and view their labels:

print(transposed_dict_df.columns)
print(transposed_dict_df.index)
Index(['gender', 'age', 'height', 'weight', 'smoking', 'diseases'], dtype='object')
Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

Save the dataframe to work on it in the next section. The method .to_csv will be explained in more detail in module 4d.

transposed_dict_df.to_csv("transposed_dict_df-4a.csv", index=False)