Module 4a: Creating a pandas dataframe

Module 4a: Creating a pandas dataframe#

In this section you will learn how to:

to create a pandas dataframe,
to assign labels to columns and rows.

# pd is a common abbreviation for pandas when importing the library
import pandas as pd
import numpy as np

Creating a pandas dataframe#

A pandas dataframe is a data structure designed for keeping and manipulating tabular data. It can be created from a Python dictionary. To show you an example of how to do this, let us first recreate a dictionary dataset from module 2:

participant1_gender = 'female'
participant1_age = 25
participant1_weight = 70.9
participant1_height = 175
participant1_smoking = False
participant1_diseases = ('Asthma', 'Diabetes')

participant1_information = [participant1_gender, participant1_age, participant1_height, participant1_weight, participant1_smoking, participant1_diseases]

participant2_information = ['male', 40, 1.79, 103.4, True, ()]

participant3_information = ['male', 18, 1.75, 85.1, True, ('Lung cancer',)]
participant4_information = ['female', 83, 1.63, 55.9, False, ('Cardio vascular disease', 'Alzheimers')]
participant5_information = ['female', 55, 1.68, 50.0, False, ('Asthma', 'Anxiety')]
participant6_information = ['female', 32, 1.59, 64.0, True, ('Diabetes',)]
participant7_information = ['male', 21, 1.90, 92.9, False, ('Asthma', 'Colon cancer')]
participant8_information = ['male', 46, 1.71, 75.4, False, ()]
participant9_information = ['female', 32, 1.66, 90.7, True, ('Depression',)]
participant10_information = ['male', 67, 1.78, 82.3, False, ('Anxiety', 'Diabetes', 'Cardio vascular disease')]

participants_clinical_information = {
    'participant1': participant1_information,
    'participant2': participant2_information,
    'participant3': participant3_information,
    'participant4': participant4_information,
    'participant5': participant5_information,
    'participant6': participant6_information,
    'participant7': participant7_information,
    'participant8': participant8_information,
    'participant9': participant9_information,
    'participant10': participant10_information
}

pandas dataframe can accept various data structures like NumPy arrays, Python lists or dictionaries. Keep in mind that the method for creating a dataframe is case-sensitive, so for this to work correctly you need to specify pd.DataFrame as it is with upper-case letters.

# make a Pandas dataframe out of the dictionary
dict_df = pd.DataFrame(participants_clinical_information) # alternatively the input could also be a NumPy array
dict_df # mentioning a variable at the end of the cell makes it appear in the output

	participant1	participant2	participant3	participant4	participant5	participant6	participant7	participant8	participant9	participant10
0	female	male	male	female	female	female	male	male	female	male
1	25	40	18	83	55	32	21	46	32	67
2	175	1.79	1.75	1.63	1.68	1.59	1.9	1.71	1.66	1.78
3	70.9	103.4	85.1	55.9	50.0	64.0	92.9	75.4	90.7	82.3
4	False	True	True	False	False	True	False	False	True	False
5	(Asthma, Diabetes)	()	(Lung cancer,)	(Cardio vascular disease, Alzheimers)	(Asthma, Anxiety)	(Diabetes,)	(Asthma, Colon cancer)	()	(Depression,)	(Anxiety, Diabetes, Cardio vascular disease)

The dataframe is not ideally positioned. The dictionary keys (“participant1”, etc.) are treated as column headers by default when loading them into a pandas dataframe. Let us transpose the table:

# transposing the dataframe
transposed_dict_df = dict_df.T
transposed_dict_df

	0	1	2	3	4	5
participant1	female	25	175	70.9	False	(Asthma, Diabetes)
participant2	male	40	1.79	103.4	True	()
participant3	male	18	1.75	85.1	True	(Lung cancer,)
participant4	female	83	1.63	55.9	False	(Cardio vascular disease, Alzheimers)
participant5	female	55	1.68	50.0	False	(Asthma, Anxiety)
participant6	female	32	1.59	64.0	True	(Diabetes,)
participant7	male	21	1.9	92.9	False	(Asthma, Colon cancer)
participant8	male	46	1.71	75.4	False	()
participant9	female	32	1.66	90.7	True	(Depression,)
participant10	male	67	1.78	82.3	False	(Anxiety, Diabetes, Cardio vascular disease)

You can access information about the dataframe shape just like in NumPy:

transposed_dict_df.shape

(10, 6)

A lot of pandas functions, that will be described here from now on, have the format of dataframe.method(). To apply a given method on the dataframe, you need to specify its name first, then the dot and the method name with ().

Naming rows and columns#

We successfully inverted rows and columns, but we lost column labels in the process. Let us add them back. To do this, we need to create a Python list with column names. The order of the labels needs to match the column order. In transposed_dict_df the participant ID column is not treated as a separate column, but as row labels - lack of numbering at the top indicates that. So at the moment, we need to match the number of labels with the number of recognised dataframe columns:

# naming columns
transposed_dict_df.columns = ["gender", "age", "height", "weight", "smoking", "diseases"]
transposed_dict_df

	gender	age	height	weight	smoking	diseases
participant1	female	25	175	70.9	False	(Asthma, Diabetes)
participant2	male	40	1.79	103.4	True	()
participant3	male	18	1.75	85.1	True	(Lung cancer,)
participant4	female	83	1.63	55.9	False	(Cardio vascular disease, Alzheimers)
participant5	female	55	1.68	50.0	False	(Asthma, Anxiety)
participant6	female	32	1.59	64.0	True	(Diabetes,)
participant7	male	21	1.9	92.9	False	(Asthma, Colon cancer)
participant8	male	46	1.71	75.4	False	()
participant9	female	32	1.66	90.7	True	(Depression,)
participant10	male	67	1.78	82.3	False	(Anxiety, Diabetes, Cardio vascular disease)

Naming rows follows the same logic - providing a list or array of labels to transposed_dict_df.index:

n_rows = transposed_dict_df.shape[0] # get the number of rows from the shape tuple

# here we will simply assign a number to each row
# np.arange is a NumPy function that generates an sequence of numbers
transposed_dict_df.index = np.arange(0, n_rows)
transposed_dict_df

	gender	age	height	weight	smoking	diseases
0	female	25	175	70.9	False	(Asthma, Diabetes)
1	male	40	1.79	103.4	True	()
2	male	18	1.75	85.1	True	(Lung cancer,)
3	female	83	1.63	55.9	False	(Cardio vascular disease, Alzheimers)
4	female	55	1.68	50.0	False	(Asthma, Anxiety)
5	female	32	1.59	64.0	True	(Diabetes,)
6	male	21	1.9	92.9	False	(Asthma, Colon cancer)
7	male	46	1.71	75.4	False	()
8	female	32	1.66	90.7	True	(Depression,)
9	male	67	1.78	82.3	False	(Anxiety, Diabetes, Cardio vascular disease)

Printing .columns and .index by themeselves lets you access and view their labels:

print(transposed_dict_df.columns)
print(transposed_dict_df.index)

Index(['gender', 'age', 'height', 'weight', 'smoking', 'diseases'], dtype='object')
Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

Save the dataframe to work on it in the next section. The method .to_csv will be explained in more detail in module 4d.

transposed_dict_df.to_csv("transposed_dict_df-4a.csv", index=False)

Module 4a: Creating a pandas dataframe

Contents

Module 4a: Creating a pandas dataframe#

Creating a pandas dataframe#

Naming rows and columns#