Module 4e: Exploratory data analysis#

In this section you will learn how:

  • to do exploratory data analysis in pandas,

  • to filter dataframes according to criteria,

  • missing data can be handled.

In this section we will focus on pandas methods for exploratory data analysis. To highlight its features, we will use a larger file. It is a popular stroke prediction dataset on Kaggle from Federico Soriano Palacios. It contains columns with clinical features such as age, gender, previous conditions such as hypertension, and smoking status. Each row represents information for a single patient. You can read more about the dataset here.

import pandas as pd
import re # regular expressions library for checking string patterns

It is recommended to copy the commands from this section into your own Jupyter notebook and fetch the dataset yourself from here to get better familiarised with pandas. However if you prefer to work on this submodule in “live code” mode on the website, just run the script below for your own convenience to easily load the dataset:

import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..', 'shared')))
import setup_code
stroke_dataset = setup_code.stroke_data

If you decided to run this section locally on your computer and downloaded the dataset from Kaggle, you might want to open it in Excel or a text editor to inspect it further. Then you will notice that it’s a comma-separated csv file. You can load the dataset with Pandas using pd.read_csv and specifying the filename.

# # You can comment/uncomment the following lines with: CTRL + / (or CMD + / on Mac)

# # specify the path to the dataset
# filepath = "/path/to/your/dataset/"
# stroke_dataset = pd.read_csv(filepath+"healthcare-dataset-stroke-data.csv")

Let us now explore the file and understand what data we are dealing with. Just running the cell with the dataframe name gives you a preview of the dataframe:

stroke_dataset
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5105 18234 Female 80.0 1 0 Yes Private Urban 83.75 NaN never smoked 0
5106 44873 Female 81.0 0 0 Yes Self-employed Urban 125.20 40.0 never smoked 0
5107 19723 Female 35.0 0 0 Yes Self-employed Rural 82.99 30.6 never smoked 0
5108 37544 Male 51.0 0 0 Yes Private Rural 166.29 25.6 formerly smoked 0
5109 44679 Female 44.0 0 0 Yes Govt_job Urban 85.28 26.2 Unknown 0

5110 rows × 12 columns

Alternatively, you can select to show only the first 5 (.head(...)) or the final 5 (.tail(...)) rows. You can easily modify the number of rows you want to return in the n=5 argument.

stroke_dataset.head(n=5) 
# or just stroke_dataset.head(5) without specifying "n=" as it is the only parameter in this method
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1

If you supply a negative integer to .head(), eg. -10, you would remove the last 10 rows of your dataframe.

stroke_dataset.head(-10) 
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.00 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.00 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.00 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.00 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.00 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5095 68598 Male 1.08 0 0 No children Rural 79.15 17.4 Unknown 0
5096 41512 Male 57.00 0 0 Yes Govt_job Rural 76.62 28.2 never smoked 0
5097 64520 Male 68.00 0 0 Yes Self-employed Urban 91.68 40.8 Unknown 0
5098 579 Male 9.00 0 0 No children Urban 71.88 17.5 Unknown 0
5099 7293 Male 40.00 0 0 Yes Private Rural 83.94 NaN smokes 0

5100 rows × 12 columns

stroke_dataset.tail(5) # or just stroke_dataset.tail(5)
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
5105 18234 Female 80.0 1 0 Yes Private Urban 83.75 NaN never smoked 0
5106 44873 Female 81.0 0 0 Yes Self-employed Urban 125.20 40.0 never smoked 0
5107 19723 Female 35.0 0 0 Yes Self-employed Rural 82.99 30.6 never smoked 0
5108 37544 Male 51.0 0 0 Yes Private Rural 166.29 25.6 formerly smoked 0
5109 44679 Female 44.0 0 0 Yes Govt_job Urban 85.28 26.2 Unknown 0

You will get even more comprehensive high-level information by running commands like stroke_dataset.info(), stroke_dataset.describe() or stroke_dataset.agg(...).

# this lists all columns, the number of rows/entries, the number of non-zero values each column has 
# and the type of values stored in each column
stroke_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB

In the info() output, the numbers in brackets in the dtypes row tells you how many columns have a given type (types are further explained in module 2). In this example, each column has the type float64, int64 or object. It means that the values are stored as integers, floats or a generic “object”. The last one usually indicates that the column stores a mix of types, or the values are all character strings like in this dataset. 64 in float64/int64 is the number of bits used to store the value.

# this command lists a distribution of the values in each column with float or integer type
stroke_dataset.describe()
id age hypertension heart_disease avg_glucose_level bmi stroke
count 5110.000000 5110.000000 5110.000000 5110.000000 5110.000000 4909.000000 5110.000000
mean 36517.829354 43.226614 0.097456 0.054012 106.147677 28.893237 0.048728
std 21161.721625 22.612647 0.296607 0.226063 45.283560 7.854067 0.215320
min 67.000000 0.080000 0.000000 0.000000 55.120000 10.300000 0.000000
25% 17741.250000 25.000000 0.000000 0.000000 77.245000 23.500000 0.000000
50% 36932.000000 45.000000 0.000000 0.000000 91.885000 28.100000 0.000000
75% 54682.000000 61.000000 0.000000 0.000000 114.090000 33.100000 0.000000
max 72940.000000 82.000000 1.000000 1.000000 271.740000 97.600000 1.000000

The distribution is not useful for the stroke column, because it seems to contain only binary values. However we are not completely sure, so let’s confirm it with .unique():

stroke_column = stroke_dataset['stroke'] # select only the 'stroke' column
print(stroke_column.unique()) # list the unique values in the 'stroke' column
[1 0]

Now we can be certain that the “stroke” column includes only binary values. The same can be said about “heart_disease” and “hypertension” columns. We can design our own general statistics dataframe for binary data using .agg(...) or .aggregate(...). In the function we need to provide a string list with mathematical operations for each column. They are applied independently of each other. These can be commonly used pandas operations like mean, std, min, max, etc. Here we are interested in summing all the values to see how many non-zero entries are stored.

binary_columns = stroke_dataset[["stroke", "heart_disease", "hypertension"]] # select binary columns
binary_columns.agg(["count", "sum"]) # keep "count" to see how many entries are in total
stroke heart_disease hypertension
count 5110 5110 5110
sum 249 276 498

What about columns with string values classified as object type? No problem! .describe(...) got this covered. include argument allows you to specify the types of columns in a list. Here we are only interested in the object columns, so we will have a single-element list:

stroke_dataset.describe(include=['object'])
gender ever_married work_type Residence_type smoking_status
count 5110 5110 5110 5110 5110
unique 3 2 5 2 4
top Female Yes Private Urban never smoked
freq 2994 3353 2925 2596 1892

All the outputs from the summary functions apart from .info() are also dataframes and can be manipulated as such.

Filtering for values according to criteria#

You might be interested in further inspecting data that meet certain criteria. You can filter for them in a format like this:

dataframe[ (condition1) bitwise_operator (condition2) ] if you want to search for two conditions.

The condition statements are separated by parentheses and usually include a logical bitwise operators like:

  • & denoting “and”,

  • | standing for “or”.

You can easily extend it to more than 2 conditions.

Here are some examples:

# single condition: select all entries that that only include private work type
stroke_dataset[stroke_dataset["Residence_type"] == "Urban"]
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
5 56669 Male 81.0 0 0 Yes Private Urban 186.21 29.0 formerly smoked 1
7 10434 Female 69.0 0 0 No Private Urban 94.39 22.8 never smoked 1
9 60491 Female 78.0 0 0 Yes Private Urban 58.57 24.2 Unknown 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5101 36901 Female 45.0 0 0 Yes Private Urban 97.95 24.5 Unknown 0
5103 22127 Female 18.0 0 0 No Private Urban 82.85 46.9 Unknown 0
5105 18234 Female 80.0 1 0 Yes Private Urban 83.75 NaN never smoked 0
5106 44873 Female 81.0 0 0 Yes Self-employed Urban 125.20 40.0 never smoked 0
5109 44679 Female 44.0 0 0 Yes Govt_job Urban 85.28 26.2 Unknown 0

2596 rows × 12 columns

# multiple conditions: select all people with no heart disease that live in urban areas with glucose level higher than 150
stroke_dataset[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)]
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
5 56669 Male 81.0 0 0 Yes Private Urban 186.21 29.0 formerly smoked 1
17 34120 Male 75.0 1 0 Yes Private Urban 221.29 25.8 smokes 1
21 13861 Female 52.0 1 0 Yes Self-employed Urban 233.29 48.9 never smoked 1
22 68794 Female 79.0 0 0 Yes Self-employed Urban 228.70 26.6 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5035 49773 Female 78.0 0 0 Yes Private Urban 203.36 28.7 formerly smoked 0
5048 28788 Male 40.0 0 0 Yes Private Urban 191.15 NaN smokes 0
5061 38009 Male 41.0 0 0 Yes Private Urban 223.78 32.3 never smoked 0
5063 68967 Male 39.0 0 0 Yes Private Urban 179.38 27.7 Unknown 0
5086 65411 Female 51.0 0 0 Yes Private Urban 152.56 21.8 Unknown 0

313 rows × 12 columns

Mini-exercise 1

Select entries that have either hypertension or heart disease.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

stroke_dataset[(stroke_dataset["hypertension"] == 1 ) | (stroke_dataset["heart_disease"] == 1 )]

Selecting only specific columns after filtering for conditions#

In the previous examples, you obtained rows for all columns. If you are interested in only one or a couple of columns, you should use .loc and specify the condition in the space where you would provide row labels. It would keep this format:

dataframe.loc[condition][column_label]

Take a look at the following examples:

stroke_dataset.loc[stroke_dataset["Residence_type"] == "Urban"]["age"]
0       67.0
3       49.0
5       81.0
7       69.0
9       78.0
        ... 
5101    45.0
5103    18.0
5105    80.0
5106    81.0
5109    44.0
Name: age, Length: 2596, dtype: float64
# select 2 columns that meet 3 criteria
stroke_dataset.loc[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)][["heart_disease", "gender"]]
heart_disease gender
3 0 Female
5 0 Male
17 0 Male
21 0 Female
22 0 Female
... ... ...
5035 0 Female
5048 0 Male
5061 0 Male
5063 0 Male
5086 0 Female

313 rows × 2 columns

Filtering string values#

You can also filter for string values that contain or start with a certain phrase using .str attribute. Here only a couple of examples will be mentioned, but you can explore more options in the pandas documentation.

.str attribute only works for a pandas series, which is just a 1D NumPy array with additional labels from pandas [Pandas, 2025]. In this case you would need to select a single row or column first before filtering for string phrases.

# obtain a Boolean pandas series (i.e. consisting of only True/False values)
# with values that start with "ne"

never_bool = stroke_dataset["smoking_status"].str.startswith("ne")

Some .str methods like .str.contains allow you to search for a case-insensitive phrase using a flags argument. This keyword accepts flags from regular expressions Python library re:

# re.IGNORECASE flag allows you to ignore cases
gov_bool = stroke_dataset["work_type"].str.contains("gov", flags=re.IGNORECASE) 

The series that these methods return can also be used as a condition for filtering pandas rows:

# return entries that contain "gov" in work_type and start with "ne" in smoking_status
stroke_dataset[gov_bool & never_bool]
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
34 14248 Male 48.0 0 0 No Govt_job Urban 84.20 29.7 never smoked 1
54 8752 Female 63.0 0 0 Yes Govt_job Urban 197.54 NaN never smoked 1
59 5111 Female 54.0 1 0 Yes Govt_job Urban 180.93 27.7 never smoked 1
99 4639 Female 69.0 0 0 Yes Govt_job Urban 82.81 28.0 never smoked 1
101 63973 Female 77.0 0 0 Yes Govt_job Rural 190.32 31.4 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5044 27616 Male 33.0 0 0 Yes Govt_job Rural 81.10 24.8 never smoked 0
5053 43496 Female 46.0 0 0 Yes Govt_job Urban 55.84 27.8 never smoked 0
5057 38349 Female 49.0 0 0 Yes Govt_job Urban 69.92 47.6 never smoked 0
5092 56799 Male 76.0 0 0 Yes Govt_job Urban 82.35 38.9 never smoked 0
5096 41512 Male 57.0 0 0 Yes Govt_job Rural 76.62 28.2 never smoked 0

282 rows × 12 columns

Handling missing data#

Conducting statistical analysis with missing data is sometimes non-trivial. Like in the stroke dataset, you could have an entry that includes values for each column apart from one category. In this section you will learn some useful examples on how to handle this type of data.

Mini-exercise 2

We need to check which columns have missing values in stroke_dataset. Print out unique values for each column apart from “id” and report which categories have unknown data entries.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

labels = stroke_dataset.columns[1:]  # ignore "id" column

for i in range(0, len(labels)):
    print(labels[i])
    print(stroke_dataset[labels[i]].unique())

# The columns "smoking_status" and "bmi" have missing data.

If you did the exercise above, you might have inferred that two columns have missing data and they are defined differently. “Smoking_status” uses “Unknown” and “bmi” follows “nan”. What format is “nan”? Is it recognised as a missing value in pandas? Let’s check it with .isnull():

# obtain a Boolean dataframe (i.e. consisting of only True/False values)
# True means a value is "nan"
df_isnull = stroke_dataset.isnull() 

# non-string True/False values are also treated as ones/zeros in Python
# summing along columns will indicate how many True values we have
df_isnull.sum()
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

This showed us that “bmi” “nan”s are recognised by Pandas. However “Unknown” in “smoking_status” is just a string. It should be changed to pd.NA to streamline further data manipulation tasks.

Mini-exercise 3

Replace all “Unknown” instances with pd.NA in stroke_dataset.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

stroke_dataset.loc[stroke_dataset["smoking_status"] == "Unknown", "smoking_status"] = pd.NA

Choosing an appropriate way to handle missing data in statistical analysis is a non-trivial task at times. Here you will implement one of them for a continuous variable.

Mini-exercise 4

Replace missing values with the mean of all the other values in “bmi” category.

Look up Pandas documentation to learn how to compute mean without NA values.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

mean_val = stroke_dataset["bmi"].mean(skipna = True)

stroke_dataset.loc[stroke_dataset["bmi"].isnull(), "bmi"] = mean_val

References#

[Pan25]

Pandas. Pandas.series — pandas documentation. https://pandas.pydata.org/docs/reference/api/pandas.Series.html, 2025. Accessed: 2025-06-25.