Module 4e: Exploratory data analysis#
In this section you will learn how:
to do exploratory data analysis in pandas,
to filter dataframes according to criteria,
missing data can be handled.
In this section we will focus on pandas methods for exploratory data analysis. To highlight its features, we will use a larger file. It is a popular stroke prediction dataset on Kaggle from Federico Soriano Palacios. It contains columns with clinical features such as age, gender, previous conditions such as hypertension, and smoking status. Each row represents information for a single patient. You can read more about the dataset here.
import pandas as pd
import re # regular expressions library for checking string patterns
It is recommended to copy the commands from this section into your own Jupyter notebook and fetch the dataset yourself from here to get better familiarised with pandas. However if you prefer to work on this submodule in “live code” mode on the website, just run the script below for your own convenience to easily load the dataset:
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..', 'shared')))
import setup_code
stroke_dataset = setup_code.stroke_data
If you decided to run this section locally on your computer and downloaded the dataset from Kaggle, you might want to open it in Excel or a text editor to inspect it further. Then you will notice that it’s a comma-separated csv file. You can load the dataset with Pandas using pd.read_csv
and specifying the filename.
# # You can comment/uncomment the following lines with: CTRL + / (or CMD + / on Mac)
# # specify the path to the dataset
# filepath = "/path/to/your/dataset/"
# stroke_dataset = pd.read_csv(filepath+"healthcare-dataset-stroke-data.csv")
Let us now explore the file and understand what data we are dealing with. Just running the cell with the dataframe name gives you a preview of the dataframe:
stroke_dataset
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
1 | 51676 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
2 | 31112 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
4 | 1665 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5105 | 18234 | Female | 80.0 | 1 | 0 | Yes | Private | Urban | 83.75 | NaN | never smoked | 0 |
5106 | 44873 | Female | 81.0 | 0 | 0 | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
5107 | 19723 | Female | 35.0 | 0 | 0 | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | 0 |
5108 | 37544 | Male | 51.0 | 0 | 0 | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | 0 |
5109 | 44679 | Female | 44.0 | 0 | 0 | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
5110 rows × 12 columns
Alternatively, you can select to show only the first 5 (.head(...)
) or the final 5 (.tail(...)
) rows. You can easily modify the number of rows you want to return in the n=5
argument.
stroke_dataset.head(n=5)
# or just stroke_dataset.head(5) without specifying "n=" as it is the only parameter in this method
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
1 | 51676 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
2 | 31112 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
4 | 1665 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
If you supply a negative integer to .head()
, eg. -10, you would remove the last 10 rows of your dataframe.
stroke_dataset.head(-10)
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | Male | 67.00 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
1 | 51676 | Female | 61.00 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
2 | 31112 | Male | 80.00 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
3 | 60182 | Female | 49.00 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
4 | 1665 | Female | 79.00 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5095 | 68598 | Male | 1.08 | 0 | 0 | No | children | Rural | 79.15 | 17.4 | Unknown | 0 |
5096 | 41512 | Male | 57.00 | 0 | 0 | Yes | Govt_job | Rural | 76.62 | 28.2 | never smoked | 0 |
5097 | 64520 | Male | 68.00 | 0 | 0 | Yes | Self-employed | Urban | 91.68 | 40.8 | Unknown | 0 |
5098 | 579 | Male | 9.00 | 0 | 0 | No | children | Urban | 71.88 | 17.5 | Unknown | 0 |
5099 | 7293 | Male | 40.00 | 0 | 0 | Yes | Private | Rural | 83.94 | NaN | smokes | 0 |
5100 rows × 12 columns
stroke_dataset.tail(5) # or just stroke_dataset.tail(5)
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5105 | 18234 | Female | 80.0 | 1 | 0 | Yes | Private | Urban | 83.75 | NaN | never smoked | 0 |
5106 | 44873 | Female | 81.0 | 0 | 0 | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
5107 | 19723 | Female | 35.0 | 0 | 0 | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | 0 |
5108 | 37544 | Male | 51.0 | 0 | 0 | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | 0 |
5109 | 44679 | Female | 44.0 | 0 | 0 | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
You will get even more comprehensive high-level information by running commands like stroke_dataset.info()
, stroke_dataset.describe()
or stroke_dataset.agg(...)
.
# this lists all columns, the number of rows/entries, the number of non-zero values each column has
# and the type of values stored in each column
stroke_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5110 non-null int64
1 gender 5110 non-null object
2 age 5110 non-null float64
3 hypertension 5110 non-null int64
4 heart_disease 5110 non-null int64
5 ever_married 5110 non-null object
6 work_type 5110 non-null object
7 Residence_type 5110 non-null object
8 avg_glucose_level 5110 non-null float64
9 bmi 4909 non-null float64
10 smoking_status 5110 non-null object
11 stroke 5110 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
In the info()
output, the numbers in brackets in the dtypes
row tells you how many columns have a given type (types are further explained in module 2). In this example, each column has the type float64
, int64
or object
. It means that the values are stored as integers, floats or a generic “object”. The last one usually indicates that the column stores a mix of types, or the values are all character strings like in this dataset. 64
in float64
/int64
is the number of bits used to store the value.
# this command lists a distribution of the values in each column with float or integer type
stroke_dataset.describe()
id | age | hypertension | heart_disease | avg_glucose_level | bmi | stroke | |
---|---|---|---|---|---|---|---|
count | 5110.000000 | 5110.000000 | 5110.000000 | 5110.000000 | 5110.000000 | 4909.000000 | 5110.000000 |
mean | 36517.829354 | 43.226614 | 0.097456 | 0.054012 | 106.147677 | 28.893237 | 0.048728 |
std | 21161.721625 | 22.612647 | 0.296607 | 0.226063 | 45.283560 | 7.854067 | 0.215320 |
min | 67.000000 | 0.080000 | 0.000000 | 0.000000 | 55.120000 | 10.300000 | 0.000000 |
25% | 17741.250000 | 25.000000 | 0.000000 | 0.000000 | 77.245000 | 23.500000 | 0.000000 |
50% | 36932.000000 | 45.000000 | 0.000000 | 0.000000 | 91.885000 | 28.100000 | 0.000000 |
75% | 54682.000000 | 61.000000 | 0.000000 | 0.000000 | 114.090000 | 33.100000 | 0.000000 |
max | 72940.000000 | 82.000000 | 1.000000 | 1.000000 | 271.740000 | 97.600000 | 1.000000 |
The distribution is not useful for the stroke column, because it seems to contain only binary values. However we are not completely sure, so let’s confirm it with .unique()
:
stroke_column = stroke_dataset['stroke'] # select only the 'stroke' column
print(stroke_column.unique()) # list the unique values in the 'stroke' column
[1 0]
Now we can be certain that the “stroke” column includes only binary values. The same can be said about “heart_disease” and “hypertension” columns. We can design our own general statistics dataframe for binary data using .agg(...)
or .aggregate(...)
. In the function we need to provide a string list with mathematical operations for each column. They are applied independently of each other. These can be commonly used pandas operations like mean
, std
, min
, max
, etc. Here we are interested in summing all the values to see how many non-zero entries are stored.
binary_columns = stroke_dataset[["stroke", "heart_disease", "hypertension"]] # select binary columns
binary_columns.agg(["count", "sum"]) # keep "count" to see how many entries are in total
stroke | heart_disease | hypertension | |
---|---|---|---|
count | 5110 | 5110 | 5110 |
sum | 249 | 276 | 498 |
What about columns with string values classified as object type? No problem! .describe(...)
got this covered. include
argument allows you to specify the types of columns in a list. Here we are only interested in the object columns, so we will have a single-element list:
stroke_dataset.describe(include=['object'])
gender | ever_married | work_type | Residence_type | smoking_status | |
---|---|---|---|---|---|
count | 5110 | 5110 | 5110 | 5110 | 5110 |
unique | 3 | 2 | 5 | 2 | 4 |
top | Female | Yes | Private | Urban | never smoked |
freq | 2994 | 3353 | 2925 | 2596 | 1892 |
All the outputs from the summary functions apart from .info()
are also dataframes and can be manipulated as such.
Filtering for values according to criteria#
You might be interested in further inspecting data that meet certain criteria. You can filter for them in a format like this:
dataframe[ (condition1) bitwise_operator (condition2) ]
if you want to search for two conditions.
The condition statements are separated by parentheses and usually include a logical bitwise operators like:
&
denoting “and”,|
standing for “or”.
You can easily extend it to more than 2 conditions.
Here are some examples:
# single condition: select all entries that that only include private work type
stroke_dataset[stroke_dataset["Residence_type"] == "Urban"]
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
5 | 56669 | Male | 81.0 | 0 | 0 | Yes | Private | Urban | 186.21 | 29.0 | formerly smoked | 1 |
7 | 10434 | Female | 69.0 | 0 | 0 | No | Private | Urban | 94.39 | 22.8 | never smoked | 1 |
9 | 60491 | Female | 78.0 | 0 | 0 | Yes | Private | Urban | 58.57 | 24.2 | Unknown | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5101 | 36901 | Female | 45.0 | 0 | 0 | Yes | Private | Urban | 97.95 | 24.5 | Unknown | 0 |
5103 | 22127 | Female | 18.0 | 0 | 0 | No | Private | Urban | 82.85 | 46.9 | Unknown | 0 |
5105 | 18234 | Female | 80.0 | 1 | 0 | Yes | Private | Urban | 83.75 | NaN | never smoked | 0 |
5106 | 44873 | Female | 81.0 | 0 | 0 | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
5109 | 44679 | Female | 44.0 | 0 | 0 | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
2596 rows × 12 columns
# multiple conditions: select all people with no heart disease that live in urban areas with glucose level higher than 150
stroke_dataset[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)]
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
5 | 56669 | Male | 81.0 | 0 | 0 | Yes | Private | Urban | 186.21 | 29.0 | formerly smoked | 1 |
17 | 34120 | Male | 75.0 | 1 | 0 | Yes | Private | Urban | 221.29 | 25.8 | smokes | 1 |
21 | 13861 | Female | 52.0 | 1 | 0 | Yes | Self-employed | Urban | 233.29 | 48.9 | never smoked | 1 |
22 | 68794 | Female | 79.0 | 0 | 0 | Yes | Self-employed | Urban | 228.70 | 26.6 | never smoked | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5035 | 49773 | Female | 78.0 | 0 | 0 | Yes | Private | Urban | 203.36 | 28.7 | formerly smoked | 0 |
5048 | 28788 | Male | 40.0 | 0 | 0 | Yes | Private | Urban | 191.15 | NaN | smokes | 0 |
5061 | 38009 | Male | 41.0 | 0 | 0 | Yes | Private | Urban | 223.78 | 32.3 | never smoked | 0 |
5063 | 68967 | Male | 39.0 | 0 | 0 | Yes | Private | Urban | 179.38 | 27.7 | Unknown | 0 |
5086 | 65411 | Female | 51.0 | 0 | 0 | Yes | Private | Urban | 152.56 | 21.8 | Unknown | 0 |
313 rows × 12 columns
Mini-exercise 1
Select entries that have either hypertension or heart disease.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
stroke_dataset[(stroke_dataset["hypertension"] == 1 ) | (stroke_dataset["heart_disease"] == 1 )]
Selecting only specific columns after filtering for conditions#
In the previous examples, you obtained rows for all columns. If you are interested in only one or a couple of columns, you should use .loc
and specify the condition in the space where you would provide row labels. It would keep this format:
dataframe.loc[condition][column_label]
Take a look at the following examples:
stroke_dataset.loc[stroke_dataset["Residence_type"] == "Urban"]["age"]
0 67.0
3 49.0
5 81.0
7 69.0
9 78.0
...
5101 45.0
5103 18.0
5105 80.0
5106 81.0
5109 44.0
Name: age, Length: 2596, dtype: float64
# select 2 columns that meet 3 criteria
stroke_dataset.loc[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)][["heart_disease", "gender"]]
heart_disease | gender | |
---|---|---|
3 | 0 | Female |
5 | 0 | Male |
17 | 0 | Male |
21 | 0 | Female |
22 | 0 | Female |
... | ... | ... |
5035 | 0 | Female |
5048 | 0 | Male |
5061 | 0 | Male |
5063 | 0 | Male |
5086 | 0 | Female |
313 rows × 2 columns
Filtering string values#
You can also filter for string values that contain or start with a certain phrase using .str
attribute. Here only a couple of examples will be mentioned, but you can explore more options in the pandas documentation.
.str
attribute only works for a pandas series, which is just a 1D NumPy array with additional labels from pandas [Pandas, 2025]. In this case you would need to select a single row or column first before filtering for string phrases.
# obtain a Boolean pandas series (i.e. consisting of only True/False values)
# with values that start with "ne"
never_bool = stroke_dataset["smoking_status"].str.startswith("ne")
Some .str
methods like .str.contains
allow you to search for a case-insensitive phrase using a flags
argument. This keyword accepts flags from regular expressions Python library re
:
# re.IGNORECASE flag allows you to ignore cases
gov_bool = stroke_dataset["work_type"].str.contains("gov", flags=re.IGNORECASE)
The series that these methods return can also be used as a condition for filtering pandas rows:
# return entries that contain "gov" in work_type and start with "ne" in smoking_status
stroke_dataset[gov_bool & never_bool]
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 14248 | Male | 48.0 | 0 | 0 | No | Govt_job | Urban | 84.20 | 29.7 | never smoked | 1 |
54 | 8752 | Female | 63.0 | 0 | 0 | Yes | Govt_job | Urban | 197.54 | NaN | never smoked | 1 |
59 | 5111 | Female | 54.0 | 1 | 0 | Yes | Govt_job | Urban | 180.93 | 27.7 | never smoked | 1 |
99 | 4639 | Female | 69.0 | 0 | 0 | Yes | Govt_job | Urban | 82.81 | 28.0 | never smoked | 1 |
101 | 63973 | Female | 77.0 | 0 | 0 | Yes | Govt_job | Rural | 190.32 | 31.4 | never smoked | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5044 | 27616 | Male | 33.0 | 0 | 0 | Yes | Govt_job | Rural | 81.10 | 24.8 | never smoked | 0 |
5053 | 43496 | Female | 46.0 | 0 | 0 | Yes | Govt_job | Urban | 55.84 | 27.8 | never smoked | 0 |
5057 | 38349 | Female | 49.0 | 0 | 0 | Yes | Govt_job | Urban | 69.92 | 47.6 | never smoked | 0 |
5092 | 56799 | Male | 76.0 | 0 | 0 | Yes | Govt_job | Urban | 82.35 | 38.9 | never smoked | 0 |
5096 | 41512 | Male | 57.0 | 0 | 0 | Yes | Govt_job | Rural | 76.62 | 28.2 | never smoked | 0 |
282 rows × 12 columns
Handling missing data#
Conducting statistical analysis with missing data is sometimes non-trivial. Like in the stroke dataset, you could have an entry that includes values for each column apart from one category. In this section you will learn some useful examples on how to handle this type of data.
Mini-exercise 2
We need to check which columns have missing values in stroke_dataset
. Print out unique values for each column apart from “id” and report which categories have unknown data entries.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
labels = stroke_dataset.columns[1:] # ignore "id" column
for i in range(0, len(labels)):
print(labels[i])
print(stroke_dataset[labels[i]].unique())
# The columns "smoking_status" and "bmi" have missing data.
If you did the exercise above, you might have inferred that two columns have missing data and they are defined differently. “Smoking_status” uses “Unknown” and “bmi” follows “nan”. What format is “nan”? Is it recognised as a missing value in pandas? Let’s check it with .isnull()
:
# obtain a Boolean dataframe (i.e. consisting of only True/False values)
# True means a value is "nan"
df_isnull = stroke_dataset.isnull()
# non-string True/False values are also treated as ones/zeros in Python
# summing along columns will indicate how many True values we have
df_isnull.sum()
id 0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 201
smoking_status 0
stroke 0
dtype: int64
This showed us that “bmi” “nan”s are recognised by Pandas. However “Unknown” in “smoking_status” is just a string. It should be changed to pd.NA
to streamline further data manipulation tasks.
Mini-exercise 3
Replace all “Unknown” instances with pd.NA
in stroke_dataset
.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
stroke_dataset.loc[stroke_dataset["smoking_status"] == "Unknown", "smoking_status"] = pd.NA
Choosing an appropriate way to handle missing data in statistical analysis is a non-trivial task at times. Here you will implement one of them for a continuous variable.
Mini-exercise 4
Replace missing values with the mean of all the other values in “bmi” category.
Look up Pandas documentation to learn how to compute mean without NA values.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
mean_val = stroke_dataset["bmi"].mean(skipna = True)
stroke_dataset.loc[stroke_dataset["bmi"].isnull(), "bmi"] = mean_val
References#
Pandas. Pandas.series — pandas documentation. https://pandas.pydata.org/docs/reference/api/pandas.Series.html, 2025. Accessed: 2025-06-25.