Add date dtype #34441

zbrookle · 2020-05-28T20:55:54Z

closes ENH: Feature request, date type #32473

This is the beginning of the date data type, and it definitely works properly on a high level. For some of the places where strings might need to be converted and such I used the cython code that was implemented in tslib. The time complexity is still linear, but some of those methods may need to be rewritten for dates in cython, which I'm happy to do.

pep8speaks · 2020-05-28T20:56:01Z

Hello @zbrookle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pandas/core/internals/blocks.py:

Line 2040:1: E302 expected 2 blank lines, found 1
Line 2049:1: E302 expected 2 blank lines, found 1

In the file pandas/tests/arrays/test_dates.py:

Line 5:89: E501 line too long (94 > 88 characters)
Line 17:9: E126 continuation line over-indented for hanging indent
Line 20:5: E121 continuation line under-indented for hanging indent
Line 79:1: E302 expected 2 blank lines, found 1
Line 93:1: E302 expected 2 blank lines, found 1
Line 102:1: E302 expected 2 blank lines, found 1
Line 114:1: E302 expected 2 blank lines, found 1
Line 127:1: E302 expected 2 blank lines, found 1
Line 134:1: E302 expected 2 blank lines, found 1
Line 179:1: E305 expected 2 blank lines after class or function definition, found 1
Line 184:40: W292 no newline at end of file

Comment last updated at 2020-05-29 16:47:47 UTC

# Conflicts: # pandas/core/arrays/__init__.py # pandas/core/arrays/integer.py # pandas/core/dtypes/dtypes.py # pandas/tests/dtypes/test_common.py

jreback · 2020-05-28T22:17:20Z

pandas/core/arrays/dates.py

+            )
+            raise ValueError(msg)
+
+        if values.dtype == INTEGER_BACKEND:


why aren't you simply keeping ordinals since epoch? its performant and much simpler

I am keeping them, as I understand, the view just changes the outer representation, but not the backend. The same thing is done in the datetime array

mroeschke · 2020-05-28T22:27:30Z

pandas/core/arrays/dates.py

+    @property
+    def _box_func(self):
+        # TODO Implement Datestamp of a similar form in cython
+        return lambda x: Timestamp(x, freq="D", tz="utc")


I think we want to be timezone naive by default

Yes that's why I have a todo to create a date stamp, but that will need to be implemented in cython, which I can do, I just wanted to get something working first.

wouldnt we want datetime.date objects anyway (or Period[D] objects)

@jbrockmendel Yeah probably this is just a place holder

jbrockmendel · 2020-05-28T23:29:23Z

could DateArray just be an alias for PeriodArray[D]?

TomAugspurger

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?

pandas/core/arrays/dates.py

TomAugspurger · 2020-05-29T12:01:08Z

pandas/core/dtypes/dtypes.py

+
+    @property
+    def type(self):
+        return Timestamp


Is this an appropriate scalar type? Why not datetime.date?

@TomAugspurger No it's not, I didn't use datetime.date because I thought there should be a cython implemented data structure rather than the native python one, so that's currently a place holder

I don't think the scalar type is ever used in cython code.

In _libs/tslib.pyx is the Timestamp implementation which is written in cython, I figured a similar thing might need to be done for date

But DateArray is itself an array of dates. So the type here should be datetime.date. if it helps, that class is in fact implemented in C.

Ok yeah, fair enough, if it's already implemented in C, then I can change it

TomAugspurger · 2020-05-29T12:01:30Z

pandas/core/dtypes/dtypes.py

+
+    @property
+    def na_value(self):
+        return NaT


This needs to be discussed in detail. We need to decide if we want NaT or NA semantics. I think I'd prefer NA.

-1 on changing to NA at this time, we have a very well established policy of NaT in all datetimelikes.

zbrookle · 2020-05-29T14:32:42Z

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?

@TomAugspurger The motivation is that most data frameworks have a dedicated date datatype. It's pretty common in databases and spread sheets that an end user might want just the date type when exporting. I think for all other uses the datetime64 type is fine because that lines up well with the "timestamp" type that's pretty commonly used in databases and other places. However that type doesn't line up when a user wants a date type

zbrookle · 2020-05-29T14:36:48Z

could DateArray just be an alias for PeriodArray[D]?

@jbrockmendel I know you had suggested this on the original issue #32473 but I think that there a couple reasons to have a different dedicated type.

When a user prints out a dataframe, thinking they're using a date, I think it would be really misleading to have Period[D] displayed as the dtype rather than date.
When you export to other formats, like parquet for example, the cleanest way to map to what other frameworks have as a date type, is for there to be a dedicated pandas date type as well.

jorisvandenbossche · 2020-05-29T14:40:05Z

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

Since this is a new dtype, IMO it should subclass BaseMaskedArray, and not the DatetimeLikeArray
Related to the above, IMO we should directly use pd.NA as missing value indicator
There are actually multiple ways this could be stored. Now it is "days since 1970-01-01", but it could also be microseconds since 1970-01-01, which still gives a good date range but would make a conversion between date dtype <-> future timestamp dtype with µs resolution a no-op (this exists in Arrow, but will ask there whether someones knows is that is a common type)

So maybe we first return to the issue to get a bit more agreement on those discussion items before further working on the PR.

jorisvandenbossche · 2020-05-29T14:41:09Z

(and @zbrookle, thanks a lot for looking into this!)

jbrockmendel · 2020-05-29T14:57:03Z

What's the motivation for solving just date frequency, rather than making a generic datetime array with arbitrary frequency? What's the additional complexity in allowing user-configurable frequency?

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

I've been looking at what it would take to support non-nano dt64/td64, am putting together an email to pandas-dev about the options/tradeoffs. From an implementation standpoint, the hard parts are a) timezones and b) combing the code for all the places where we assume always-nanos.

Date does have the advantage that we probably wouldn't want a tzaware date, which simplifies the implementation.

zbrookle · 2020-05-29T14:58:40Z

I think a date dtype would be cool to have in pandas, but I think there needs to be some more high-level discussion first:

Since this is a new dtype, IMO it should subclass BaseMaskedArray, and not the DatetimeLikeArray

Related to the above, IMO we should directly use pd.NA as missing value indicator

There are actually multiple ways this could be stored. Now it is "days since 1970-01-01", but it could also be microseconds since 1970-01-01, which still gives a good date range but would make a conversion between date dtype <-> future timestamp dtype with µs resolution a no-op (this exists in Arrow, but will ask there whether someones knows is that is a common type)

So maybe we first return to the issue to get a bit more agreement on those discussion items before further working on the PR.

I agree that those things should definitely be discussed further, which is why I held off on writing any cython code yet. Most of the changes that I've made can be easily adapted to use days or microseconds, since I considered both options.
The reason I used days was because numpy uses days since 1970-01-01 in it's datetime64[D] implementation so I figured it should be consistent, and if a user explicitly specifies date, using days will in theory allow a user to expand further out from epoch than with microseconds.
The pd.NA is something that I'd prefer as I think it ends up being easier to work with from my experience especially when using arrow.
I think that if we want to use pd.NA then maybe it would be worth having a nullable datetimelike that this would then inherit from

…8, datetime64[ns]

jreback · 2020-12-29T20:41:26Z

great idea. this PR is stale however and would need quite a bit to rebase. Please open a new one if interested.

zbrookle added 20 commits May 21, 2020 16:34

ENH: Add date dtype implementation

ac8e285

ENH: Date type now functions as expected

0ad60de

TST: Start adding unit tests for dates

ae1a498

ENH: ints, datetimes and objects can convert to datearray

5c5ee4b

ENH: Add proper formatting for dates

224b59d

ENH: Add initilization tests from datetime, int, and object numpy arrays

5213efe

ENH: All conversions and displays for date object now behave properly

e000786

CLN: Remove print statements

a9ac366

ENH: Can now convert date to object, string, int, and datetime64

4b441f3

CLN: Move dtype testing to test_common

4ec5d72

BUG: Raise exception when given incompatible dtype ndarray

539444e

ENH: Add integer able to convert to date

6db4aea

BUG: Fix numpy kind for date dtype

6f4eb44

ENH: Add conversion from datetime to date

0e30fa5

BUG: Add copy to from sequence

a26a4f7

BUG: Fix cast date as date type

af37183

ENH: Remove unneeded tests

69b297f

CLN: Remove main

9aab22d

CLN: Fix linting errors

2f3f579

CLN: Fix mypy errors

eb947d7

zbrookle added 3 commits May 28, 2020 17:01

CLN: Fix pep8 errors

a6d6bc5

CLN: Fix pep8 problems

61d07f9

Merge branch 'master' into add_date_dtype

85e71fd

# Conflicts: # pandas/core/arrays/__init__.py # pandas/core/arrays/integer.py # pandas/core/dtypes/dtypes.py # pandas/tests/dtypes/test_common.py

jreback requested changes May 28, 2020

View reviewed changes

jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels May 28, 2020

mroeschke reviewed May 28, 2020

View reviewed changes

zbrookle added 2 commits May 28, 2020 18:40

BUG: Fix convert date to string for newest pandas

5673cf3

BUG: Fix convert integer to date for new framework

79c9254

zbrookle marked this pull request as draft May 28, 2020 23:21

TomAugspurger reviewed May 29, 2020

View reviewed changes

DOC: Remove incorrect warning from docstring

e9c8d96

zbrookle added 5 commits May 29, 2020 11:36

ENH: Add support for int and datetime series converting to date

c209de1

ENH: Override from backing data

f207989

ENH: Add support for conversion from date series to object, string, i…

31fa485

…8, datetime64[ns]

BUG: String conversion was resulting in object numpy array

73e278b

ENH: Change DateType type to datetime.date

068e9bc

jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Sep 5, 2020

arw2019 added the Stale label Nov 6, 2020

jreback closed this Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add date dtype #34441

Add date dtype #34441

zbrookle commented May 28, 2020

pep8speaks commented May 28, 2020 •

edited

Loading

jreback May 28, 2020

zbrookle May 29, 2020

mroeschke May 28, 2020

zbrookle May 29, 2020

jbrockmendel May 29, 2020

zbrookle May 29, 2020

jbrockmendel commented May 28, 2020

TomAugspurger left a comment

TomAugspurger May 29, 2020

zbrookle May 29, 2020

TomAugspurger May 29, 2020

zbrookle May 29, 2020

jbrockmendel May 29, 2020

zbrookle May 29, 2020

TomAugspurger May 29, 2020

jreback May 29, 2020

zbrookle commented May 29, 2020

zbrookle commented May 29, 2020

jorisvandenbossche commented May 29, 2020 •

edited

Loading

jorisvandenbossche commented May 29, 2020

jbrockmendel commented May 29, 2020

zbrookle commented May 29, 2020

jreback commented Dec 29, 2020

Add date dtype #34441

Add date dtype #34441

Conversation

zbrookle commented May 28, 2020

pep8speaks commented May 28, 2020 • edited Loading

Comment last updated at 2020-05-29 16:47:47 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented May 28, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zbrookle commented May 29, 2020

zbrookle commented May 29, 2020

jorisvandenbossche commented May 29, 2020 • edited Loading

jorisvandenbossche commented May 29, 2020

jbrockmendel commented May 29, 2020

zbrookle commented May 29, 2020

jreback commented Dec 29, 2020

pep8speaks commented May 28, 2020 •

edited

Loading

jorisvandenbossche commented May 29, 2020 •

edited

Loading