Skip to content

Add date dtype #34441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from
Closed

Add date dtype #34441

wants to merge 31 commits into from

Conversation

zbrookle
Copy link
Contributor

This is the beginning of the date data type, and it definitely works properly on a high level. For some of the places where strings might need to be converted and such I used the cython code that was implemented in tslib. The time complexity is still linear, but some of those methods may need to be rewritten for dates in cython, which I'm happy to do.

@pep8speaks
Copy link

pep8speaks commented May 28, 2020

Hello @zbrookle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 2040:1: E302 expected 2 blank lines, found 1
Line 2049:1: E302 expected 2 blank lines, found 1

Line 5:89: E501 line too long (94 > 88 characters)
Line 17:9: E126 continuation line over-indented for hanging indent
Line 20:5: E121 continuation line under-indented for hanging indent
Line 79:1: E302 expected 2 blank lines, found 1
Line 93:1: E302 expected 2 blank lines, found 1
Line 102:1: E302 expected 2 blank lines, found 1
Line 114:1: E302 expected 2 blank lines, found 1
Line 127:1: E302 expected 2 blank lines, found 1
Line 134:1: E302 expected 2 blank lines, found 1
Line 179:1: E305 expected 2 blank lines after class or function definition, found 1
Line 184:40: W292 no newline at end of file

Comment last updated at 2020-05-29 16:47:47 UTC

zbrookle added 3 commits May 28, 2020 17:01
# Conflicts:
#	pandas/core/arrays/__init__.py
#	pandas/core/arrays/integer.py
#	pandas/core/dtypes/dtypes.py
#	pandas/tests/dtypes/test_common.py
)
raise ValueError(msg)

if values.dtype == INTEGER_BACKEND:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you simply keeping ordinals since epoch? its performant and much simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keeping them, as I understand, the view just changes the outer representation, but not the backend. The same thing is done in the datetime array

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels May 28, 2020
@property
def _box_func(self):
# TODO Implement Datestamp of a similar form in cython
return lambda x: Timestamp(x, freq="D", tz="utc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be timezone naive by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's why I have a todo to create a date stamp, but that will need to be implemented in cython, which I can do, I just wanted to get something working first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldnt we want datetime.date objects anyway (or Period[D] objects)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Yeah probably this is just a place holder

@zbrookle zbrookle marked this pull request as draft May 28, 2020 23:21
@jbrockmendel
Copy link
Member

could DateArray just be an alias for PeriodArray[D]?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?


@property
def type(self):
return Timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an appropriate scalar type? Why not datetime.date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger No it's not, I didn't use datetime.date because I thought there should be a cython implemented data structure rather than the native python one, so that's currently a place holder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the scalar type is ever used in cython code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _libs/tslib.pyx is the Timestamp implementation which is written in cython, I figured a similar thing might need to be done for date

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But DateArray is itself an array of dates. So the type here should be datetime.date. if it helps, that class is in fact implemented in C.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah, fair enough, if it's already implemented in C, then I can change it


@property
def na_value(self):
return NaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed in detail. We need to decide if we want NaT or NA semantics. I think I'd prefer NA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on changing to NA at this time, we have a very well established policy of NaT in all datetimelikes.

@zbrookle
Copy link
Contributor Author

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?

@TomAugspurger The motivation is that most data frameworks have a dedicated date datatype. It's pretty common in databases and spread sheets that an end user might want just the date type when exporting. I think for all other uses the datetime64 type is fine because that lines up well with the "timestamp" type that's pretty commonly used in databases and other places. However that type doesn't line up when a user wants a date type

@zbrookle
Copy link
Contributor Author

could DateArray just be an alias for PeriodArray[D]?

@jbrockmendel I know you had suggested this on the original issue #32473 but I think that there a couple reasons to have a different dedicated type.

  1. When a user prints out a dataframe, thinking they're using a date, I think it would be really misleading to have Period[D] displayed as the dtype rather than date.

  2. When you export to other formats, like parquet for example, the cleanest way to map to what other frameworks have as a date type, is for there to be a dedicated pandas date type as well.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 29, 2020

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

  • Since this is a new dtype, IMO it should subclass BaseMaskedArray, and not the DatetimeLikeArray
  • Related to the above, IMO we should directly use pd.NA as missing value indicator
  • There are actually multiple ways this could be stored. Now it is "days since 1970-01-01", but it could also be microseconds since 1970-01-01, which still gives a good date range but would make a conversion between date dtype <-> future timestamp dtype with µs resolution a no-op (this exists in Arrow, but will ask there whether someones knows is that is a common type)

So maybe we first return to the issue to get a bit more agreement on those discussion items before further working on the PR.

@jorisvandenbossche
Copy link
Member

(and @zbrookle, thanks a lot for looking into this!)

@jbrockmendel
Copy link
Member

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

I've been looking at what it would take to support non-nano dt64/td64, am putting together an email to pandas-dev about the options/tradeoffs. From an implementation standpoint, the hard parts are a) timezones and b) combing the code for all the places where we assume always-nanos.

Date does have the advantage that we probably wouldn't want a tzaware date, which simplifies the implementation.

@zbrookle
Copy link
Contributor Author

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

  • Since this is a new dtype, IMO it should subclass BaseMaskedArray, and not the DatetimeLikeArray
  • Related to the above, IMO we should directly use pd.NA as missing value indicator
  • There are actually multiple ways this could be stored. Now it is "days since 1970-01-01", but it could also be microseconds since 1970-01-01, which still gives a good date range but would make a conversion between date dtype <-> future timestamp dtype with µs resolution a no-op (this exists in Arrow, but will ask there whether someones knows is that is a common type)

So maybe we first return to the issue to get a bit more agreement on those discussion items before further working on the PR.

  • I agree that those things should definitely be discussed further, which is why I held off on writing any cython code yet. Most of the changes that I've made can be easily adapted to use days or microseconds, since I considered both options.

  • The reason I used days was because numpy uses days since 1970-01-01 in it's datetime64[D] implementation so I figured it should be consistent, and if a user explicitly specifies date, using days will in theory allow a user to expand further out from epoch than with microseconds.

  • The pd.NA is something that I'd prefer as I think it ends up being easier to work with from my experience especially when using arrow.

  • I think that if we want to use pd.NA then maybe it would be worth having a nullable datetimelike that this would then inherit from

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Sep 5, 2020
@arw2019 arw2019 added the Stale label Nov 6, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

great idea. this PR is stale however and would need quite a bit to rebase. Please open a new one if interested.

@jreback jreback closed this Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Feature request, date type
8 participants