ENH: consistent strftime behaviour #58179

smarie · 2024-04-08T09:07:33Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :

> dta = pd.DatetimeIndex(np.array(['0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y-%m-%d")  # Falls back on default, so fast and not os-dependent
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y_%m_%d")  # Custom format, uses Timestamp.strftime
Index(['0020_01_01', '2020_01_02'], dtype='object')  # Note the zero-padding on the year on windows, that does not happen on linux

Also, Timestamp.strftime has the following behaviour:

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
'0020-01-01'  # Note the zero-padding on the year on windows, that does not happen on linux

Similar inconsistency can be observed on negative dates:

> dta = pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y-%m-%d")  # Falls back on default
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y_%m_%d")  # Custom so falls back on Timestamp.strftime
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

Why is it so ?

strftime was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, in datetime.strftime. (Note that the documentation makes some choices, for example shows the windows-based strftime for %Y, because this is the format required for strptime to parse it)

However strftime is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)
The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.

Feature Description

I propose that we introduce an engine parameter to the datetime/period arraylike strftime operation, in the same spirit than the pyarrow/fastparquet engine switcher in to_parquet https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

This parameter could have three values:

'pystr': converts the strftime format to a python string template, and uses the python string formatting engine. Raises an UnsupportedStrFmtDirective error when a strftime directive is not supported by the engine. This engine is significantly faster than 'os' on large arrays.
'os': uses the platform (C lib) strftime through datetime.strftime. This engine is OS-dependent.
'auto' (default): if the strftime format can be successfully converted to a python string template, uses 'pystr'. Otherwise, silently falls back on 'os'.

In addition, I propose to add an equivalent parameter to the instance-level Timestamp.strftime and Period.strftime.

Alternative Solutions

Alternatively, if we think that the instance-level Timestamp.strftime and Period.strftime could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.

Additional Context

This was identified during discussions in #51298 (comment)

The text was updated successfully, but these errors were encountered:

smarie added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2024

smarie mentioned this issue Apr 8, 2024

[READY] perf improvements for strftime #51298

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: consistent strftime behaviour #58179

ENH: consistent strftime behaviour #58179

smarie commented Apr 8, 2024 •

edited

Loading

ENH: consistent strftime behaviour #58179

ENH: consistent strftime behaviour #58179

Comments

smarie commented Apr 8, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

smarie commented Apr 8, 2024 •

edited

Loading