4 minute read

Here’s an interesting puzzle I encountered.

I’m working with a dataset with two features of interest: naive datetimes (naive meaning without any timezone information) and timezones. What I’d like is a new feature that has all the date and time information in the datetime feature, except converted to UTC time.

Seems simple enough, right? Well… no. Especially not for a very, very big dataset.

robot foot race
It's a race! By DALL-E 2

The Data

Here’s a simplified version of the dataset I’m working with:

data.head()
datetime_naive timezone
0 2022-10-08 19:13:00 America/New_York
1 2022-11-04 08:00:00 America/Los_Angeles
2 2023-04-05 09:00:00 America/New_York
3 2022-11-25 14:40:00 America/Denver
4 2023-03-11 09:00:00 America/New_York
data.dtypes
datetime_naive    datetime64[ns]
timezone                  object
dtype: object

As promised, there’s datetime_naive, which is contains our date/time information, and timezone, which contains the timezone string.

Now, if the datetime_naive feature all belonged to the same timezone (say, “America/New_York”), converting to UTC after first localizing to East coast time is trivial:

tz = 'America/New_York'

data['datetime_nyc'] = data['datetime_naive'].dt.tz_localize(tz)
data['datetime_utc'] = data['datetime_nyc'].dt.tz_convert('UTC')

data.head()
datetime_naive timezone datetime_nyc datetime_utc
0 2022-10-08 19:13:00 America/New_York 2022-10-08 19:13:00-04:00 2022-10-08 23:13:00+00:00
1 2022-11-04 08:00:00 America/Los_Angeles 2022-11-04 08:00:00-04:00 2022-11-04 12:00:00+00:00
2 2023-04-05 09:00:00 America/New_York 2023-04-05 09:00:00-04:00 2023-04-05 13:00:00+00:00
3 2022-11-25 14:40:00 America/Denver 2022-11-25 14:40:00-05:00 2022-11-25 19:40:00+00:00
4 2023-03-11 09:00:00 America/New_York 2023-03-11 09:00:00-05:00 2023-03-11 14:00:00+00:00
data.dtypes
datetime_naive                      datetime64[ns]
timezone                                    object
datetime_nyc      datetime64[ns, America/New_York]
datetime_utc                   datetime64[ns, UTC]
dtype: object

First I localized datetime_naive to East Coast time (datetime_nyc), and then I converted that to UTC (datetime_utc). You’ll notice that datetime_naive and datetime_nyc are almost identical, except the localized version includes the UTC offset information, which above is either -04:00 or -05:00 depending on DST. The dtypes differ, as well, with datetime_nyc typed as a localized datetime64 object.

Then, I converted the localized datetime_nyc to UTC time, so you’ll notice that the UTC offset for datetime_utc (as expected) is now +00:00. Its dtype is localized to UTC, as well.

So, that’s easy enough. But our problem is not this. Our problem is complicated by the fact that each datetime value in datetime_naive corresponds to a distinct timezone, so calling .dt.tz_localize() will not work, since we aren’t localizing the entire feature.

Solution 1: .apply() yourself!

robot foot race
Let's see what you can do. By DALL-E 2

My first thought was to convert each datetime_naive record individually using .apply() in tandem with the pytz library. Something like this:

import pytz

def convert_to_utc(dt, tz):
    tz = pytz.timezone(tz)
    localized = tz.localize(dt)
    utc = localized.astimezone(pytz.utc)
    return utc

data['datetime_utc'] = data.apply(lambda row: convert_to_utc(row['datetime_naive'], row['timezone']), axis=1)
data.head()
datetime_naive timezone datetime_utc
0 2022-10-08 19:13:00 America/New_York 2022-10-08 23:13:00+00:00
1 2022-11-04 08:00:00 America/Los_Angeles 2022-11-04 15:00:00+00:00
2 2023-04-05 09:00:00 America/New_York 2023-04-05 13:00:00+00:00
3 2022-11-25 14:40:00 America/Denver 2022-11-25 21:40:00+00:00
4 2023-03-11 09:00:00 America/New_York 2023-03-11 14:00:00+00:00
data.dtypes
datetime_naive         datetime64[ns]
timezone                       object
datetime_utc      datetime64[ns, UTC]
dtype: object

Using .apply(), we call convert_to_utc() for each row of the dataframe, passing the paired datetime_naive and timezone values as arguments. Then convert_to_utc() localized the datetime_naive value according to the timezone before converting the localized valu to UTC.

Now, this works. Sort of. This dataset has 1,000 records. The one I was working with when I encountered this problem had millions. I’m not sure how long the operation took because I was too impatient, so let’s just say it took a while. .apply() is not the most efficent route.

How to proceed? I really scratched my head about this for a while. At first I was thinking that maybe I could chunk the dataset and perform the same .apply() method on one (smaller) chunk at a time. After all, it works well enough for 1,000 records, doesn’t it? My thinking was that maybe I was bumping up against a memory issue. I implemented this solution, but still was too impatient to find out how long it took to process the entire dataset.

Solution 2: Vector? Check, sir

robot foot race
Specter of the vector. By DALL-E 2

Knowing how fast it was to simply localize an entire datetime feature to a given timezone and then convert that, I kept on thinking that there must be a vectorized solution to this. Maybe if I process all the datetimes for a given timezone at once?

That ended up looking something like this:

def vectorized_convert_to_utc(df, dt_col, tz_col):
    # Get list of timezone
    tzs = df[tz_col].unique()

    for tz in tzs:
        # Create mask for current timezome
        is_current_tz = (df[tz_col] == tz)

        # Localize all datetime records matching `tz` and convert to UTC
        df.loc[is_current_tz, 'datetime_utc'] = (
            df.loc[is_current_tz, dt_col].dt.tz_localize(tz).dt.tz_convert('UTC')
        )

    return df['datetime_utc']

data['datetime_utc'] = vectorized_convert_to_utc(data, 'datetime_naive', 'timezone')
data.head()
datetime_naive timezone datetime_utc
0 2022-10-08 19:13:00 America/New_York 2022-10-08 23:13:00+00:00
1 2022-11-04 08:00:00 America/Los_Angeles 2022-11-04 15:00:00+00:00
2 2023-04-05 09:00:00 America/New_York 2023-04-05 13:00:00+00:00
3 2022-11-25 14:40:00 America/Denver 2022-11-25 21:40:00+00:00
4 2023-03-11 09:00:00 America/New_York 2023-03-11 14:00:00+00:00
data.dtypes
datetime_naive         datetime64[ns]
timezone                       object
datetime_utc      datetime64[ns, UTC]
dtype: object

The result is the same here, but it was miles apart with my other dataset of millions of records instead of 1,000, since the computation actually finished.

The results?

robot foot race
We got a winner, folks. By DALL-E 2

Let’s speed test side by side:

# `.apply()`
%timeit data.apply(lambda row: convert_to_utc(row['datetime_naive'], row['timezone']), axis=1)
136 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Vectorized
%timeit vectorized_convert_to_utc(data, 'datetime_naive', 'timezone')
2.32 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Hot damn! That right there is why the vectorized solution crunched through millions of records no problem, while .apply() left me hanging. It’s ~60 times faster. Which checks out, since even the vectorized version took maybe 5 seconds to execute on my machine, which means applying .apply() would’ve taken 5 minutes.

Tech:

Python Pandas