Skip to content

ENH: consistent strftime behaviour #58179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
smarie opened this issue Apr 8, 2024 · 0 comments
Open
1 of 3 tasks

ENH: consistent strftime behaviour #58179

smarie opened this issue Apr 8, 2024 · 0 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@smarie
Copy link
Contributor

smarie commented Apr 8, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :

> dta = pd.DatetimeIndex(np.array(['0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y-%m-%d")  # Falls back on default, so fast and not os-dependent
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y_%m_%d")  # Custom format, uses Timestamp.strftime
Index(['0020_01_01', '2020_01_02'], dtype='object')  # Note the zero-padding on the year on windows, that does not happen on linux

Also, Timestamp.strftime has the following behaviour:

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
'0020-01-01'  # Note the zero-padding on the year on windows, that does not happen on linux

Similar inconsistency can be observed on negative dates:

> dta = pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y-%m-%d")  # Falls back on default
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y_%m_%d")  # Custom so falls back on Timestamp.strftime
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

Why is it so ?

strftime was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, in datetime.strftime. (Note that the documentation makes some choices, for example shows the windows-based strftime for %Y, because this is the format required for strptime to parse it)

image

However strftime is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)
The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.

Feature Description

I propose that we introduce an engine parameter to the datetime/period arraylike strftime operation, in the same spirit than the pyarrow/fastparquet engine switcher in to_parquet https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

This parameter could have three values:

  • 'pystr': converts the strftime format to a python string template, and uses the python string formatting engine. Raises an UnsupportedStrFmtDirective error when a strftime directive is not supported by the engine. This engine is significantly faster than 'os' on large arrays.
  • 'os': uses the platform (C lib) strftime through datetime.strftime. This engine is OS-dependent.
  • 'auto' (default): if the strftime format can be successfully converted to a python string template, uses 'pystr'. Otherwise, silently falls back on 'os'.

In addition, I propose to add an equivalent parameter to the instance-level Timestamp.strftime and Period.strftime.

Alternative Solutions

Alternatively, if we think that the instance-level Timestamp.strftime and Period.strftime could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.

Additional Context

This was identified during discussions in #51298 (comment)

@smarie smarie added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant