You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :
>dta=pd.DatetimeIndex(np.array(['0020-01-01', '2020-01-02'], 'datetime64[s]'))
>dta.strftime(None) # Default (fast, not os-dependent)Index(['20-01-01', '2020-01-02'], dtype='object') # no zero padding>dta.strftime("%Y-%m-%d") # Falls back on default, so fast and not os-dependentIndex(['20-01-01', '2020-01-02'], dtype='object') # no zero padding>dta.strftime("%Y_%m_%d") # Custom format, uses Timestamp.strftimeIndex(['0020_01_01', '2020_01_02'], dtype='object') # Note the zero-padding on the year on windows, that does not happen on linux
Also, Timestamp.strftime has the following behaviour:
>dta[0].strftime("%Y-%m-%d") # Relies on datetime.strftime > OS-dependent.'0020-01-01'# Note the zero-padding on the year on windows, that does not happen on linux
Similar inconsistency can be observed on negative dates:
>dta=pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]'))
>dta.strftime(None) # Default (fast, not os-dependent)Index(['-20-01-01', '2020-01-02'], dtype='object') # note: year is not zero-padded>dta.strftime("%Y-%m-%d") # Falls back on defaultIndex(['-20-01-01', '2020-01-02'], dtype='object') # note: year is not zero-padded>dta.strftime("%Y_%m_%d") # Custom so falls back on Timestamp.strftimeNotImplementedError: strftimenotyetsupportedonTimestampswhichareoutsidetherangeofPython'sstandardlibrary. Fornow, pleasecallthecomponentsyouneed (suchas`.year`and`.month`) andconstructyourstringfromthere.
>dta[0].strftime("%Y-%m-%d") # Relies on datetime.strftime > OS-dependent.NotImplementedError: strftimenotyetsupportedonTimestampswhichareoutsidetherangeofPython'sstandardlibrary. Fornow, pleasecallthecomponentsyouneed (suchas`.year`and`.month`) andconstructyourstringfromthere.
Why is it so ?
strftime was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, in datetime.strftime. (Note that the documentation makes some choices, for example shows the windows-based strftime for %Y, because this is the format required for strptime to parse it)
However strftime is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)
The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.
'pystr': converts the strftime format to a python string template, and uses the python string formatting engine. Raises an UnsupportedStrFmtDirective error when a strftime directive is not supported by the engine. This engine is significantly faster than 'os' on large arrays.
'os': uses the platform (C lib) strftime through datetime.strftime. This engine is OS-dependent.
'auto' (default): if the strftime format can be successfully converted to a python string template, uses 'pystr'. Otherwise, silently falls back on 'os'.
In addition, I propose to add an equivalent parameter to the instance-level Timestamp.strftime and Period.strftime.
Alternative Solutions
Alternatively, if we think that the instance-level Timestamp.strftime and Period.strftime could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :
Also,
Timestamp.strftime
has the following behaviour:Similar inconsistency can be observed on negative dates:
Why is it so ?
strftime
was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, indatetime.strftime
. (Note that the documentation makes some choices, for example shows the windows-based strftime for%Y
, because this is the format required forstrptime
to parse it)However
strftime
is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.
Feature Description
I propose that we introduce an
engine
parameter to the datetime/period arraylikestrftime
operation, in the same spirit than the pyarrow/fastparquetengine
switcher into_parquet
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.htmlThis parameter could have three values:
'pystr'
: converts the strftime format to a python string template, and uses the python string formatting engine. Raises anUnsupportedStrFmtDirective
error when a strftime directive is not supported by the engine. This engine is significantly faster than'os'
on large arrays.'os'
: uses the platform (C lib)strftime
throughdatetime.strftime
. This engine is OS-dependent.'auto'
(default): if the strftime format can be successfully converted to a python string template, uses'pystr'
. Otherwise, silently falls back on'os'
.In addition, I propose to add an equivalent parameter to the instance-level
Timestamp.strftime
andPeriod.strftime
.Alternative Solutions
Alternatively, if we think that the instance-level
Timestamp.strftime
andPeriod.strftime
could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.Additional Context
This was identified during discussions in #51298 (comment)
The text was updated successfully, but these errors were encountered: