Module 4e: Exploratory data analysis

Module 4e: Exploratory data analysis#

In this section you will learn how:

to do exploratory data analysis in pandas,
to filter dataframes according to criteria,
missing data can be handled.

In this section we will focus on pandas methods for exploratory data analysis. To highlight its features, we will use a larger file. It is a popular stroke prediction dataset on Kaggle from Federico Soriano Palacios. It contains columns with clinical features such as age, gender, previous conditions such as hypertension, and smoking status. Each row represents information for a single patient. You can read more about the dataset here.

import pandas as pd
import re # regular expressions library for checking string patterns

It is recommended to copy the commands from this section into your own Jupyter notebook and fetch the dataset yourself from here to get better familiarised with pandas. However if you prefer to work on this submodule in “live code” mode on the website, just run the script below for your own convenience to easily load the dataset:

import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..', 'shared')))
import setup_code
stroke_dataset = setup_code.stroke_data

If you decided to run this section locally on your computer and downloaded the dataset from Kaggle, you might want to open it in Excel or a text editor to inspect it further. Then you will notice that it’s a comma-separated csv file. You can load the dataset with Pandas using pd.read_csv and specifying the filename.

# # You can comment/uncomment the following lines with: CTRL + / (or CMD + / on Mac)

# # specify the path to the dataset
# filepath = "/path/to/your/dataset/"
# stroke_dataset = pd.read_csv(filepath+"healthcare-dataset-stroke-data.csv")

Let us now explore the file and understand what data we are dealing with. Just running the cell with the dataframe name gives you a preview of the dataframe:

stroke_dataset

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
1	51676	Female	61.0	0	0	Yes	Self-employed	Rural	202.21	NaN	never smoked	1
2	31112	Male	80.0	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
4	1665	Female	79.0	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5105	18234	Female	80.0	1	0	Yes	Private	Urban	83.75	NaN	never smoked	0
5106	44873	Female	81.0	0	0	Yes	Self-employed	Urban	125.20	40.0	never smoked	0
5107	19723	Female	35.0	0	0	Yes	Self-employed	Rural	82.99	30.6	never smoked	0
5108	37544	Male	51.0	0	0	Yes	Private	Rural	166.29	25.6	formerly smoked	0
5109	44679	Female	44.0	0	0	Yes	Govt_job	Urban	85.28	26.2	Unknown	0

5110 rows × 12 columns

Alternatively, you can select to show only the first 5 (.head(...)) or the final 5 (.tail(...)) rows. You can easily modify the number of rows you want to return in the n=5 argument.

stroke_dataset.head(n=5) 
# or just stroke_dataset.head(5) without specifying "n=" as it is the only parameter in this method

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
1	51676	Female	61.0	0	0	Yes	Self-employed	Rural	202.21	NaN	never smoked	1
2	31112	Male	80.0	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
4	1665	Female	79.0	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1

If you supply a negative integer to .head(), eg. -10, you would remove the last 10 rows of your dataframe.

stroke_dataset.head(-10) 

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.00	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
1	51676	Female	61.00	0	0	Yes	Self-employed	Rural	202.21	NaN	never smoked	1
2	31112	Male	80.00	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
3	60182	Female	49.00	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
4	1665	Female	79.00	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5095	68598	Male	1.08	0	0	No	children	Rural	79.15	17.4	Unknown	0
5096	41512	Male	57.00	0	0	Yes	Govt_job	Rural	76.62	28.2	never smoked	0
5097	64520	Male	68.00	0	0	Yes	Self-employed	Urban	91.68	40.8	Unknown	0
5098	579	Male	9.00	0	0	No	children	Urban	71.88	17.5	Unknown	0
5099	7293	Male	40.00	0	0	Yes	Private	Rural	83.94	NaN	smokes	0

5100 rows × 12 columns

stroke_dataset.tail(5) # or just stroke_dataset.tail(5)

	id	gender	age	hypertension	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status
5105	18234	Female	80.0	1	Yes	Private	Urban	83.75	NaN	never smoked
5106	44873	Female	81.0	0	Yes	Self-employed	Urban	125.20	40.0	never smoked
5107	19723	Female	35.0	0	Yes	Self-employed	Rural	82.99	30.6	never smoked
5108	37544	Male	51.0	0	Yes	Private	Rural	166.29	25.6	formerly smoked
5109	44679	Female	44.0	0	Yes	Govt_job	Urban	85.28	26.2	Unknown

You will get even more comprehensive high-level information by running commands like stroke_dataset.info(), stroke_dataset.describe() or stroke_dataset.agg(...).

# this lists all columns, the number of rows/entries, the number of non-zero values each column has 
# and the type of values stored in each column
stroke_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB

In the info() output, the numbers in brackets in the dtypes row tells you how many columns have a given type (types are further explained in module 2). In this example, each column has the type float64, int64 or object. It means that the values are stored as integers, floats or a generic “object”. The last one usually indicates that the column stores a mix of types, or the values are all character strings like in this dataset. 64 in float64/int64 is the number of bits used to store the value.

# this command lists a distribution of the values in each column with float or integer type
stroke_dataset.describe()

	id	age	hypertension	heart_disease	avg_glucose_level	bmi	stroke
count	5110.000000	5110.000000	5110.000000	5110.000000	5110.000000	4909.000000	5110.000000
mean	36517.829354	43.226614	0.097456	0.054012	106.147677	28.893237	0.048728
std	21161.721625	22.612647	0.296607	0.226063	45.283560	7.854067	0.215320
min	67.000000	0.080000	0.000000	0.000000	55.120000	10.300000	0.000000
25%	17741.250000	25.000000	0.000000	0.000000	77.245000	23.500000	0.000000
50%	36932.000000	45.000000	0.000000	0.000000	91.885000	28.100000	0.000000
75%	54682.000000	61.000000	0.000000	0.000000	114.090000	33.100000	0.000000
max	72940.000000	82.000000	1.000000	1.000000	271.740000	97.600000	1.000000

The distribution is not useful for the stroke column, because it seems to contain only binary values. However we are not completely sure, so let’s confirm it with .unique():

stroke_column = stroke_dataset['stroke'] # select only the 'stroke' column
print(stroke_column.unique()) # list the unique values in the 'stroke' column

[1 0]

Now we can be certain that the “stroke” column includes only binary values. The same can be said about “heart_disease” and “hypertension” columns. We can design our own general statistics dataframe for binary data using .agg(...) or .aggregate(...). In the function we need to provide a string list with mathematical operations for each column. They are applied independently of each other. These can be commonly used pandas operations like mean, std, min, max, etc. Here we are interested in summing all the values to see how many non-zero entries are stored.

binary_columns = stroke_dataset[["stroke", "heart_disease", "hypertension"]] # select binary columns
binary_columns.agg(["count", "sum"]) # keep "count" to see how many entries are in total

	stroke	heart_disease	hypertension
count	5110	5110	5110
sum	249	276	498

What about columns with string values classified as object type? No problem! .describe(...) got this covered. include argument allows you to specify the types of columns in a list. Here we are only interested in the object columns, so we will have a single-element list:

stroke_dataset.describe(include=['object'])

	gender	ever_married	work_type	Residence_type	smoking_status
count	5110	5110	5110	5110	5110
unique	3	2	5	2	4
top	Female	Yes	Private	Urban	never smoked
freq	2994	3353	2925	2596	1892

All the outputs from the summary functions apart from .info() are also dataframes and can be manipulated as such.

Filtering for values according to criteria#

You might be interested in further inspecting data that meet certain criteria. You can filter for them in a format like this:

dataframe[ (condition1) bitwise_operator (condition2) ] if you want to search for two conditions.

The condition statements are separated by parentheses and usually include a logical bitwise operators like:

& denoting “and”,
| standing for “or”.

You can easily extend it to more than 2 conditions.

Here are some examples:

# single condition: select all entries that that only include private work type
stroke_dataset[stroke_dataset["Residence_type"] == "Urban"]

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
5	56669	Male	81.0	0	0	Yes	Private	Urban	186.21	29.0	formerly smoked	1
7	10434	Female	69.0	0	0	No	Private	Urban	94.39	22.8	never smoked	1
9	60491	Female	78.0	0	0	Yes	Private	Urban	58.57	24.2	Unknown	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5101	36901	Female	45.0	0	0	Yes	Private	Urban	97.95	24.5	Unknown	0
5103	22127	Female	18.0	0	0	No	Private	Urban	82.85	46.9	Unknown	0
5105	18234	Female	80.0	1	0	Yes	Private	Urban	83.75	NaN	never smoked	0
5106	44873	Female	81.0	0	0	Yes	Self-employed	Urban	125.20	40.0	never smoked	0
5109	44679	Female	44.0	0	0	Yes	Govt_job	Urban	85.28	26.2	Unknown	0

2596 rows × 12 columns

# multiple conditions: select all people with no heart disease that live in urban areas with glucose level higher than 150
stroke_dataset[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)]

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
5	56669	Male	81.0	0	0	Yes	Private	Urban	186.21	29.0	formerly smoked	1
17	34120	Male	75.0	1	0	Yes	Private	Urban	221.29	25.8	smokes	1
21	13861	Female	52.0	1	0	Yes	Self-employed	Urban	233.29	48.9	never smoked	1
22	68794	Female	79.0	0	0	Yes	Self-employed	Urban	228.70	26.6	never smoked	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5035	49773	Female	78.0	0	0	Yes	Private	Urban	203.36	28.7	formerly smoked	0
5048	28788	Male	40.0	0	0	Yes	Private	Urban	191.15	NaN	smokes	0
5061	38009	Male	41.0	0	0	Yes	Private	Urban	223.78	32.3	never smoked	0
5063	68967	Male	39.0	0	0	Yes	Private	Urban	179.38	27.7	Unknown	0
5086	65411	Female	51.0	0	0	Yes	Private	Urban	152.56	21.8	Unknown	0

313 rows × 12 columns

Mini-exercise 1

Select entries that have either hypertension or heart disease.

# your code goes here

💡 Solution

One of the possible ways to achieve this result:

stroke_dataset[(stroke_dataset["hypertension"] == 1 ) | (stroke_dataset["heart_disease"] == 1 )]

Selecting only specific columns after filtering for conditions#

In the previous examples, you obtained rows for all columns. If you are interested in only one or a couple of columns, you should use .loc and specify the condition in the space where you would provide row labels. It would keep this format:

dataframe.loc[condition][column_label]

Take a look at the following examples:

stroke_dataset.loc[stroke_dataset["Residence_type"] == "Urban"]["age"]

     67.0
     49.0
     81.0
     69.0
     78.0
        ... 
  45.0
  18.0
  80.0
  81.0
  44.0
Name: age, Length: 2596, dtype: float64

# select 2 columns that meet 3 criteria
stroke_dataset.loc[(stroke_dataset["heart_disease"] == 0) & (stroke_dataset["Residence_type"] == "Urban") & (stroke_dataset["avg_glucose_level"] > 150)][["heart_disease", "gender"]]

	heart_disease	gender
3	0	Female
5	0	Male
17	0	Male
21	0	Female
22	0	Female
...	...	...
5035	0	Female
5048	0	Male
5061	0	Male
5063	0	Male
5086	0	Female

313 rows × 2 columns

Filtering string values#

You can also filter for string values that contain or start with a certain phrase using .str attribute. Here only a couple of examples will be mentioned, but you can explore more options in the pandas documentation.

.str attribute only works for a pandas series, which is just a 1D NumPy array with additional labels from pandas [Pandas, 2025]. In this case you would need to select a single row or column first before filtering for string phrases.

# obtain a Boolean pandas series (i.e. consisting of only True/False values)
# with values that start with "ne"

never_bool = stroke_dataset["smoking_status"].str.startswith("ne")

Some .str methods like .str.contains allow you to search for a case-insensitive phrase using a flags argument. This keyword accepts flags from regular expressions Python library re:

# re.IGNORECASE flag allows you to ignore cases
gov_bool = stroke_dataset["work_type"].str.contains("gov", flags=re.IGNORECASE) 

The series that these methods return can also be used as a condition for filtering pandas rows:

# return entries that contain "gov" in work_type and start with "ne" in smoking_status
stroke_dataset[gov_bool & never_bool]

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
34	14248	Male	48.0	0	0	No	Govt_job	Urban	84.20	29.7	never smoked	1
54	8752	Female	63.0	0	0	Yes	Govt_job	Urban	197.54	NaN	never smoked	1
59	5111	Female	54.0	1	0	Yes	Govt_job	Urban	180.93	27.7	never smoked	1
99	4639	Female	69.0	0	0	Yes	Govt_job	Urban	82.81	28.0	never smoked	1
101	63973	Female	77.0	0	0	Yes	Govt_job	Rural	190.32	31.4	never smoked	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5044	27616	Male	33.0	0	0	Yes	Govt_job	Rural	81.10	24.8	never smoked	0
5053	43496	Female	46.0	0	0	Yes	Govt_job	Urban	55.84	27.8	never smoked	0
5057	38349	Female	49.0	0	0	Yes	Govt_job	Urban	69.92	47.6	never smoked	0
5092	56799	Male	76.0	0	0	Yes	Govt_job	Urban	82.35	38.9	never smoked	0
5096	41512	Male	57.0	0	0	Yes	Govt_job	Rural	76.62	28.2	never smoked	0

282 rows × 12 columns

Handling missing data#

Conducting statistical analysis with missing data is sometimes non-trivial. Like in the stroke dataset, you could have an entry that includes values for each column apart from one category. In this section you will learn some useful examples on how to handle this type of data.

Mini-exercise 2

We need to check which columns have missing values in stroke_dataset. Print out unique values for each column apart from “id” and report which categories have unknown data entries.

# your code goes here

💡 Solution

One of the possible ways to achieve this result:

labels = stroke_dataset.columns[1:]  # ignore "id" column

for i in range(0, len(labels)):
    print(labels[i])
    print(stroke_dataset[labels[i]].unique())

# The columns "smoking_status" and "bmi" have missing data.

If you did the exercise above, you might have inferred that two columns have missing data and they are defined differently. “Smoking_status” uses “Unknown” and “bmi” follows “nan”. What format is “nan”? Is it recognised as a missing value in pandas? Let’s check it with .isnull():

# obtain a Boolean dataframe (i.e. consisting of only True/False values)
# True means a value is "nan"
df_isnull = stroke_dataset.isnull() 

# non-string True/False values are also treated as ones/zeros in Python
# summing along columns will indicate how many True values we have
df_isnull.sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

This showed us that “bmi” “nan”s are recognised by Pandas. However “Unknown” in “smoking_status” is just a string. It should be changed to pd.NA to streamline further data manipulation tasks.

Mini-exercise 3

Replace all “Unknown” instances with pd.NA in stroke_dataset.

# your code goes here

💡 Solution

One of the possible ways to achieve this result:

stroke_dataset.loc[stroke_dataset["smoking_status"] == "Unknown", "smoking_status"] = pd.NA

Choosing an appropriate way to handle missing data in statistical analysis is a non-trivial task at times. Here you will implement one of them for a continuous variable.

Mini-exercise 4

Replace missing values with the mean of all the other values in “bmi” category.

Look up Pandas documentation to learn how to compute mean without NA values.

# your code goes here

💡 Solution

One of the possible ways to achieve this result:

mean_val = stroke_dataset["bmi"].mean(skipna = True)

stroke_dataset.loc[stroke_dataset["bmi"].isnull(), "bmi"] = mean_val

References#

[Pan25]

Pandas. Pandas.series — pandas documentation. https://pandas.pydata.org/docs/reference/api/pandas.Series.html, 2025. Accessed: 2025-06-25.

Module 4e: Exploratory data analysis

Contents

Module 4e: Exploratory data analysis#

Filtering for values according to criteria#

Selecting only specific columns after filtering for conditions#

Filtering string values#

Handling missing data#

References#