Module 4b: Selecting rows and columns in pandas#
In this section you will learn how:
to select specific rows and columns in a dataframe,
to fill a dataframe with values.
Selecting rows and columns in a pandas dataframe follows different rules from accessing elements in NumPy arrays or Python lists. You can do that either by specifying their labels or indices.
import pandas as pd
import numpy as np
# load the dataframe from the previous section, method is further explained in module 4d
transposed_dict_df = pd.read_csv("transposed_dict_df-4a.csv")
Selection by labels - columns#
This is one of the main ways to select a single column in a Pandas dataframe by its name: referring to the label as a string character in brackets:
# select a single column by name
transposed_dict_df["height"]
0 175.00
1 1.79
2 1.75
3 1.63
4 1.68
5 1.59
6 1.90
7 1.71
8 1.66
9 1.78
Name: height, dtype: float64
Now if you want to select multiple columns, you need to put their names in a Python list, that then goes into the dataframe square brackets:
# select multiple columns by name
transposed_dict_df[["height", "weight"]]
height | weight | |
---|---|---|
0 | 175.00 | 70.9 |
1 | 1.79 | 103.4 |
2 | 1.75 | 85.1 |
3 | 1.63 | 55.9 |
4 | 1.68 | 50.0 |
5 | 1.59 | 64.0 |
6 | 1.90 | 92.9 |
7 | 1.71 | 75.4 |
8 | 1.66 | 90.7 |
9 | 1.78 | 82.3 |
Mini-exercise 1
Give names to the rows in transposed_dict_df
as they used to be - i.e. row 0 gets label “participant1” etc. Don’t type participant labels manually, automate it.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
participant_labels = []
n_rows = transposed_dict_df.shape[0]
for i in range(0, n_rows):
label = "participant"+str(i+1)
participant_labels.append(label)
transposed_dict_df.index = participant_labels
Selection by labels - rows#
Let us reuse the dataframe that you updated in the exercise above. To select a row by its label, you need to use .loc
:
# selecting a single row
print("Single row selection")
print(transposed_dict_df.loc["participant1"])
print("------")
print("Multiple row selection")
# selecting multiple rows - you need to put the labels in a list
# like in multiple column selection previously
transposed_dict_df.loc[["participant1", "participant8"]]
Single row selection
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[6], line 3
1 # selecting a single row
2 print("Single row selection")
----> 3 print(transposed_dict_df.loc["participant1"])
4 print("------")
5 print("Multiple row selection")
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\pandas\core\indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\pandas\core\indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
1429 # fall thru to straight lookup
1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\pandas\core\indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
1379 def _get_label(self, label, axis: AxisInt):
1380 # GH#5567 this will fail if the label is not present in the axis.
-> 1381 return self.obj.xs(label, axis=axis)
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\pandas\core\generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
4299 new_index = index[loc]
4300 else:
-> 4301 loc = index.get_loc(key)
4303 if isinstance(loc, np.ndarray):
4304 if loc.dtype == np.bool_:
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\pandas\core\indexes\range.py:417, in RangeIndex.get_loc(self, key)
415 raise KeyError(key) from err
416 if isinstance(key, Hashable):
--> 417 raise KeyError(key)
418 self._check_indexing_error(key)
419 raise KeyError(key)
KeyError: 'participant1'
Selecting both rows and columns by label#
If you want to select rows and columns by label together, you need to specify rows with .loc
first and then in the next brackets list desired columns.
# single row and single column
print(transposed_dict_df.loc["participant8"]["weight"])
75.4
Mini-exercise 2
Select smoking and diseases data for participants 5 and 9 in one line using labels.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
print(transposed_dict_df.loc[["participant5", "participant9"]][["smoking", "diseases"]])
Selecting values by their integer position#
If you don’t want to use labels, row and column selection can be done with indices - you need to use .iloc[.., ..]
for this. For example, to access diseases (column 5) present in participant 3 (row 2), you need to type transposed_dict_df.iloc[2, 5]
. Pandas dataframes also use 0-based indexing, slicing rules are very similar to NumPy arrays.
# selecting a single value
print("Single value")
print(transposed_dict_df.iloc[2, 5])
print("----")
print("Range")
# select a range (the end position 7, 5 are not included in the selected range)
print(transposed_dict_df.iloc[2:7, 3:5])
print("----")
print("Specific values")
# selecting multiple specific values - provide indexes in a list
print(transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]])
print("----")
print("Entire row")
# select the entire row
print(transposed_dict_df.iloc[2, :])
print("----")
print("Entire column")
# select the entire column
print(transposed_dict_df.iloc[:, 0])
Single value
('Lung cancer',)
----
Range
weight smoking
participant3 85.1 True
participant4 55.9 False
participant5 50.0 False
participant6 64.0 True
participant7 92.9 False
----
Specific values
gender age height
participant1 female 25 175.00
participant6 female 32 1.59
participant8 male 46 1.71
----
Entire row
gender male
age 18
height 1.75
weight 85.1
smoking True
diseases ('Lung cancer',)
Name: participant3, dtype: object
----
Entire column
participant1 female
participant2 male
participant3 male
participant4 female
participant5 female
participant6 female
participant7 male
participant8 male
participant9 female
participant10 male
Name: gender, dtype: object
Filling out values in a datafame#
Sometimes you might want to change values in a dataframe. Here are some examples for how to do it for a single value or larger data structures:
transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]] = np.nan # insert a null value
transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]]
gender | age | height | |
---|---|---|---|
participant1 | NaN | NaN | NaN |
participant6 | NaN | NaN | NaN |
participant8 | NaN | NaN | NaN |
# can fill out the range of values with a numpy array, Python lists or a Pandas dataframe
revised_weight_height = np.array([[1.88, 1.65, 1.72], [50, 73, 87]])
transposed_dict_df.iloc[[1, 2, 3], [2, 3]] = revised_weight_height.T
transposed_dict_df.iloc[[1, 2, 3], [2, 3]]
height | weight | |
---|---|---|
participant2 | 1.88 | 50.0 |
participant3 | 1.65 | 73.0 |
participant4 | 1.72 | 87.0 |
Mini-exercise 3
Participant 1 has their height recorded in centimeters in transposed_dict_df
. Change it to meters and update the dataframe.
# your code goes here
💡 Solution
One of the possible ways to achieve this result:
transposed_dict_df.loc["participant1"][["height"]] = 1.75
Save the dataframe to work on it in the next section. The method .to_csv
will be explained in more detail in module 4d.
transposed_dict_df.to_csv("transposed_dict_df-4b.csv")