-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Add date dtype #34441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add date dtype #34441
Conversation
Hello @zbrookle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-05-29 16:47:47 UTC |
# Conflicts: # pandas/core/arrays/__init__.py # pandas/core/arrays/integer.py # pandas/core/dtypes/dtypes.py # pandas/tests/dtypes/test_common.py
) | ||
raise ValueError(msg) | ||
|
||
if values.dtype == INTEGER_BACKEND: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't you simply keeping ordinals since epoch? its performant and much simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am keeping them, as I understand, the view just changes the outer representation, but not the backend. The same thing is done in the datetime array
@property | ||
def _box_func(self): | ||
# TODO Implement Datestamp of a similar form in cython | ||
return lambda x: Timestamp(x, freq="D", tz="utc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to be timezone naive by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's why I have a todo to create a date stamp, but that will need to be implemented in cython, which I can do, I just wanted to get something working first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldnt we want datetime.date objects anyway (or Period[D] objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Yeah probably this is just a place holder
could DateArray just be an alias for PeriodArray[D]? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?
pandas/core/dtypes/dtypes.py
Outdated
|
||
@property | ||
def type(self): | ||
return Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an appropriate scalar type? Why not datetime.date
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger No it's not, I didn't use datetime.date because I thought there should be a cython implemented data structure rather than the native python one, so that's currently a place holder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the scalar type is ever used in cython code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _libs/tslib.pyx is the Timestamp implementation which is written in cython, I figured a similar thing might need to be done for date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But DateArray is itself an array of dates. So the type here should be datetime.date
. if it helps, that class is in fact implemented in C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah, fair enough, if it's already implemented in C, then I can change it
|
||
@property | ||
def na_value(self): | ||
return NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be discussed in detail. We need to decide if we want NaT
or NA
semantics. I think I'd prefer NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on changing to NA at this time, we have a very well established policy of NaT in all datetimelikes.
@TomAugspurger The motivation is that most data frameworks have a dedicated date datatype. It's pretty common in databases and spread sheets that an end user might want just the date type when exporting. I think for all other uses the datetime64 type is fine because that lines up well with the "timestamp" type that's pretty commonly used in databases and other places. However that type doesn't line up when a user wants a date type |
@jbrockmendel I know you had suggested this on the original issue #32473 but I think that there a couple reasons to have a different dedicated type.
|
I think a
So maybe we first return to the issue to get a bit more agreement on those discussion items before further working on the PR. |
(and @zbrookle, thanks a lot for looking into this!) |
I've been looking at what it would take to support non-nano dt64/td64, am putting together an email to pandas-dev about the options/tradeoffs. From an implementation standpoint, the hard parts are a) timezones and b) combing the code for all the places where we assume always-nanos. Date does have the advantage that we probably wouldn't want a tzaware date, which simplifies the implementation. |
|
…8, datetime64[ns]
great idea. this PR is stale however and would need quite a bit to rebase. Please open a new one if interested. |
This is the beginning of the date data type, and it definitely works properly on a high level. For some of the places where strings might need to be converted and such I used the cython code that was implemented in tslib. The time complexity is still linear, but some of those methods may need to be rewritten for dates in cython, which I'm happy to do.