Module 4d: Saving and loading a dataframe#
In this module you will learn how:
to save and load a dataframe.
import pandas as pd
# load the dataframe from the previous section
transposed_dict_df = pd.read_csv("transposed_dict_df-4c.csv")
After spending a while on preparing your dataframe, it would be worthwhile to save it for later reuse. One of the common file formats for pandas dataframes are comma-separated csv files or tab-separated files. For example, tab-based files are frequently used in bioinformatics analysis of sequencing data, often kept in tab-delimited BED format.
You can save your tab-separated tsv file with transposed_dict_df.to_csv()
. The method looks confusing with the csv-oriented name, but it works perfectly fine with tab files, because you can specify the delimiter with sep
keyword. Tab spacing is indicated with \t
.
# remove indexing, so it's easier to read the file later
# header is removed to highlight one data loading issue (discussed later)
transposed_dict_df.to_csv("example_dataframe.tsv", header=None, index=False, sep="\t")
Now let’s try loading our tab-seperated file into a pandas dataframe:
pd.read_csv("example_dataframe.tsv", sep="\t")
participant1 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | 70.9 | False | ('Asthma', 'Diabetes') | |
---|---|---|---|---|---|---|---|
0 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () |
1 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) |
2 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') |
3 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') |
4 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) |
5 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') |
6 | participant8 | NaN | NaN | NaN | 75.4 | False | () |
7 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) |
8 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... |
9 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () |
We’ve loaded the file - but look at the very first row! It is treated as column names automatically. When loading datasets, you might frequently encounter dataframes that do not have header column names prespecified, so it is good to specify header = None
when loading a file like this.
tab_file = pd.read_csv("example_dataframe.tsv", delimiter="\t", header=None)
tab_file
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | participant1 | NaN | NaN | NaN | 70.9 | False | ('Asthma', 'Diabetes') |
1 | participant2 | male | 40.0 | 1.88 | 50.0 | True | () |
2 | participant3 | male | 18.0 | 1.65 | 73.0 | True | ('Lung cancer',) |
3 | participant4 | female | 83.0 | 1.72 | 87.0 | False | ('Cardio vascular disease', 'Alzheimers') |
4 | participant5 | female | 55.0 | 1.68 | 50.0 | False | ('Asthma', 'Anxiety') |
5 | participant6 | NaN | NaN | NaN | 64.0 | True | ('Diabetes',) |
6 | participant7 | male | 21.0 | 1.90 | 92.9 | False | ('Asthma', 'Colon cancer') |
7 | participant8 | NaN | NaN | NaN | 75.4 | False | () |
8 | participant9 | female | 32.0 | 1.66 | 90.7 | True | ('Depression',) |
9 | participant10 | male | 67.0 | 1.78 | 82.3 | False | ('Anxiety', 'Diabetes', 'Cardio vascular disea... |
10 | participant11 | female | 34.0 | 1.55 | 64.0 | False | () |