3 minute read

How do you perform element-wise operations on a DataFrame and another element data structure?

Take the following dataframe:

data={
    'hot_dogs': [5, 11, 3, 14, 15],
    'pizza':[9, 2, 11, 12, 6],
    'burgers':[8, 9, 7, 5, 6]
}
years = pd.Series([2015, 2016, 2017, 2018, 2019], name='year')

df = pd.DataFrame(data=data, index=years)

df
hot_dogs pizza burgers
year
2015 5 9 8
2016 11 2 9
2017 3 11 7
2018 14 12 5
2019 15 6 6

Let’s say that, for each year, you want to compute each item as a proportion of the total items sold. In 2015, for instance, we’d expect a proportion of .23 for hot dogs (5 / (5 + 9 + 8)) and .41 for pizza (9 / (5 + 9 + 8)).

It’s simple enough to find the sum of each row:

annual_sum = df.sum(axis=1)
annual_sum
year
2015    22
2016    22
2017    21
2018    31
2019    27
dtype: int64

So just a matter of dividing our dataframe by the sum of each of its rows, right?

df / annual_sum
2015 2016 2017 2018 2019 burgers hot_dogs pizza
year
2015 NaN NaN NaN NaN NaN NaN NaN NaN
2016 NaN NaN NaN NaN NaN NaN NaN NaN
2017 NaN NaN NaN NaN NaN NaN NaN NaN
2018 NaN NaN NaN NaN NaN NaN NaN NaN
2019 NaN NaN NaN NaN NaN NaN NaN NaN

Uh oh, what the heck happened?

Well, Pandas is attempting to align the columns of df (‘pizza’, ‘burgers’, ‘hot_dogs’) with the index of annual_sum (‘2015’, ‘2016’, ‘2017’, ‘2018’, ‘2019’) and then perform the division, but the two do not align. For this operation to work, each pizza value in df would need to be divided be a value sharing that same index pizza.

Although it doesn’t serve our purposes here, the following illustrates this behaviour:

foods = pd.Series([5, 10, 20], index=['pizza', 'burgers', 'pie'])
foods
pizza       5
burgers    10
pie        20
dtype: int64
df / foods
burgers hot_dogs pie pizza
year
2015 0.8 NaN NaN 1.8
2016 0.9 NaN NaN 0.4
2017 0.7 NaN NaN 2.2
2018 0.5 NaN NaN 2.4
2019 0.6 NaN NaN 1.2

Now, each value in df['burgerss'] is divided by 10 (the burgers-indexed value in foods), and each value in df['pizza'] is divded by 5 (the pizza-indexed value in foods). But notice the NaNs that result: NaNs for the df['hot_dogs'] values, which in this case do not have a matching divisor in foods, and NaN’s for a new column pie, since there was no matching dividend in df.

This is not what we want. To perform this division operation elementwise for each row, we’d instead have to use an object method: pd.DataFrame.div.

On it’s own, this method produces a behavior idential to that which we observed before:

df.div(annual_sum)
2015 2016 2017 2018 2019 burgers hot_dogs pizza
year
2015 NaN NaN NaN NaN NaN NaN NaN NaN
2016 NaN NaN NaN NaN NaN NaN NaN NaN
2017 NaN NaN NaN NaN NaN NaN NaN NaN
2018 NaN NaN NaN NaN NaN NaN NaN NaN
2019 NaN NaN NaN NaN NaN NaN NaN NaN

Why? Because by default the axis 1 or columns, which means that pandas is once again attempting to match the columns of df (‘pizza’, ‘burgers’, ‘hot_dogs’) with the index of annual_sum (‘2015’, ‘2016’, ‘2017’, ‘2018’, ‘2019’), which we already saw doesn’t work.

All we have to do is specify axis=0 so that we’re matching the index of df to annual sum, and in this case there is a match, since we’re performing division on each 2015 value of df and the matching 2015 value of annual_sum, and so on:

df.div(annual_sum, axis=0)
hot_dogs pizza burgers
year
2015 0.227273 0.409091 0.363636
2016 0.500000 0.090909 0.409091
2017 0.142857 0.523810 0.333333
2018 0.451613 0.387097 0.161290
2019 0.555556 0.222222 0.222222

Now we know that pizza apparently accounted for almsot 41% of all sales in 2015 but only 9% the following year–weird!

Tech:

Python Pandas

Tags:

Updated: