Module 4d: Saving and loading a dataframe

Module 4d: Saving and loading a dataframe#

In this module you will learn how:

to save and load a dataframe.

import pandas as pd

# load the dataframe from the previous section
transposed_dict_df = pd.read_csv("transposed_dict_df-4c.csv")

After spending a while on preparing your dataframe, it would be worthwhile to save it for later reuse. One of the common file formats for pandas dataframes are comma-separated csv files or tab-separated files. For example, tab-based files are frequently used in bioinformatics analysis of sequencing data, often kept in tab-delimited BED format.

You can save your tab-separated tsv file with transposed_dict_df.to_csv(). The method looks confusing with the csv-oriented name, but it works perfectly fine with tab files, because you can specify the delimiter with sep keyword. Tab spacing is indicated with \t.

# remove indexing, so it's easier to read the file later
# header is removed to highlight one data loading issue (discussed later)
transposed_dict_df.to_csv("example_dataframe.tsv", header=None, index=False, sep="\t") 

Now let’s try loading our tab-seperated file into a pandas dataframe:

pd.read_csv("example_dataframe.tsv", sep="\t")

	participant1	Unnamed: 1	Unnamed: 2	Unnamed: 3	70.9	False	('Asthma', 'Diabetes')
0	participant2	male	40.0	1.88	50.0	True	()
1	participant3	male	18.0	1.65	73.0	True	('Lung cancer',)
2	participant4	female	83.0	1.72	87.0	False	('Cardio vascular disease', 'Alzheimers')
3	participant5	female	55.0	1.68	50.0	False	('Asthma', 'Anxiety')
4	participant6	NaN	NaN	NaN	64.0	True	('Diabetes',)
5	participant7	male	21.0	1.90	92.9	False	('Asthma', 'Colon cancer')
6	participant8	NaN	NaN	NaN	75.4	False	()
7	participant9	female	32.0	1.66	90.7	True	('Depression',)
8	participant10	male	67.0	1.78	82.3	False	('Anxiety', 'Diabetes', 'Cardio vascular disea...
9	participant11	female	34.0	1.55	64.0	False	()

We’ve loaded the file - but look at the very first row! It is treated as column names automatically. When loading datasets, you might frequently encounter dataframes that do not have header column names prespecified, so it is good to specify header = None when loading a file like this.

tab_file = pd.read_csv("example_dataframe.tsv", delimiter="\t", header=None)
tab_file

	0	1	2	3	4	5	6
0	participant1	NaN	NaN	NaN	70.9	False	('Asthma', 'Diabetes')
1	participant2	male	40.0	1.88	50.0	True	()
2	participant3	male	18.0	1.65	73.0	True	('Lung cancer',)
3	participant4	female	83.0	1.72	87.0	False	('Cardio vascular disease', 'Alzheimers')
4	participant5	female	55.0	1.68	50.0	False	('Asthma', 'Anxiety')
5	participant6	NaN	NaN	NaN	64.0	True	('Diabetes',)
6	participant7	male	21.0	1.90	92.9	False	('Asthma', 'Colon cancer')
7	participant8	NaN	NaN	NaN	75.4	False	()
8	participant9	female	32.0	1.66	90.7	True	('Depression',)
9	participant10	male	67.0	1.78	82.3	False	('Anxiety', 'Diabetes', 'Cardio vascular disea...
10	participant11	female	34.0	1.55	64.0	False	()