Module 4d: Saving and loading a dataframe

Module 4d: Saving and loading a dataframe#

In this module you will learn how:

  • to save and load a dataframe.

import pandas as pd

# load the dataframe from the previous section
transposed_dict_df = pd.read_csv("transposed_dict_df-4c.csv")

After spending a while on preparing your dataframe, it would be worthwhile to save it for later reuse. One of the common file formats for pandas dataframes are comma-separated csv files or tab-separated files. For example, tab-based files are frequently used in bioinformatics analysis of sequencing data, often kept in tab-delimited BED format.

You can save your tab-separated tsv file with transposed_dict_df.to_csv(). The method looks confusing with the csv-oriented name, but it works perfectly fine with tab files, because you can specify the delimiter with sep keyword. Tab spacing is indicated with \t.

# remove indexing, so it's easier to read the file later
# header is removed to highlight one data loading issue (discussed later)
transposed_dict_df.to_csv("example_dataframe.tsv", header=None, index=False, sep="\t") 

Now let’s try loading our tab-seperated file into a pandas dataframe:

pd.read_csv("example_dataframe.tsv", sep="\t")
participant1 Unnamed: 1 Unnamed: 2 Unnamed: 3 70.9 False ('Asthma', 'Diabetes')
0 participant2 male 40.0 1.88 50.0 True ()
1 participant3 male 18.0 1.65 73.0 True ('Lung cancer',)
2 participant4 female 83.0 1.72 87.0 False ('Cardio vascular disease', 'Alzheimers')
3 participant5 female 55.0 1.68 50.0 False ('Asthma', 'Anxiety')
4 participant6 NaN NaN NaN 64.0 True ('Diabetes',)
5 participant7 male 21.0 1.90 92.9 False ('Asthma', 'Colon cancer')
6 participant8 NaN NaN NaN 75.4 False ()
7 participant9 female 32.0 1.66 90.7 True ('Depression',)
8 participant10 male 67.0 1.78 82.3 False ('Anxiety', 'Diabetes', 'Cardio vascular disea...
9 participant11 female 34.0 1.55 64.0 False ()

We’ve loaded the file - but look at the very first row! It is treated as column names automatically. When loading datasets, you might frequently encounter dataframes that do not have header column names prespecified, so it is good to specify header = None when loading a file like this.

tab_file = pd.read_csv("example_dataframe.tsv", delimiter="\t", header=None)
tab_file
0 1 2 3 4 5 6
0 participant1 NaN NaN NaN 70.9 False ('Asthma', 'Diabetes')
1 participant2 male 40.0 1.88 50.0 True ()
2 participant3 male 18.0 1.65 73.0 True ('Lung cancer',)
3 participant4 female 83.0 1.72 87.0 False ('Cardio vascular disease', 'Alzheimers')
4 participant5 female 55.0 1.68 50.0 False ('Asthma', 'Anxiety')
5 participant6 NaN NaN NaN 64.0 True ('Diabetes',)
6 participant7 male 21.0 1.90 92.9 False ('Asthma', 'Colon cancer')
7 participant8 NaN NaN NaN 75.4 False ()
8 participant9 female 32.0 1.66 90.7 True ('Depression',)
9 participant10 male 67.0 1.78 82.3 False ('Anxiety', 'Diabetes', 'Cardio vascular disea...
10 participant11 female 34.0 1.55 64.0 False ()