-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Add DataFrame.assign method #9239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -450,6 +450,82 @@ available to insert at a particular location in the columns: | |
df.insert(1, 'bar', df['one']) | ||
df | ||
|
||
.. _dsintro.chained_assignment: | ||
|
||
Assigning New Columns in Method Chains | ||
-------------------------------------- | ||
|
||
.. versionadded:: 0.16.0 | ||
|
||
Inspired by `dplyr's | ||
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__ | ||
``mutate`` verb, DataFrame has an :meth:`~pandas.DataFrame.assign` | ||
method that allows you to easily create new columns that are potentially | ||
derived from existing columns. | ||
|
||
.. ipython:: python | ||
|
||
iris = read_csv('data/iris.data') | ||
iris.head() | ||
|
||
(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength']) | ||
.head()) | ||
|
||
Above was an example of inserting a precomputed value. We can also pass in | ||
a function of one argument to be evalutated on the DataFrame being assigned to. | ||
|
||
.. ipython:: python | ||
|
||
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] / | ||
x['SepalLength'])).head() | ||
|
||
``assign`` **always** returns a copy of the data, leaving the original | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe explain why you would use a callable (as opposed to straight assignment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do that down on line 498, but agreed. I'll put a sentence here. |
||
DataFrame untouched. | ||
|
||
Passing a callable, as opposed to an actual value to be inserted, is | ||
useful when you don't have a reference to the DataFrame at hand. This is | ||
common when using ``assign`` in chains of operations. For example, | ||
we can limit the DataFrame to just those observations with a Sepal Length | ||
greater than 5, calculate the ratio, and plot: | ||
|
||
.. ipython:: python | ||
|
||
@savefig basics_assign.png | ||
(iris.query('SepalLength > 5') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe you need the parens here either There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm using parens instead of |
||
.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength, | ||
PetalRatio = lambda x: x.PetalWidth / x.PetalLength) | ||
.plot(kind='scatter', x='SepalRatio', y='PetalRatio')) | ||
|
||
Since a function is passed in, the function is computed on the DataFrame | ||
being assigned to. Importantly, this is the DataFrame that's been filtered | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make it clear that this is a deferred operation (so that the purpose is to have the filtering happen first) |
||
to those rows with sepal length greater than 5. The filtering happens first, | ||
and then the ratio calculations. This is an example where we didn't | ||
have a reference to the *filtered* DataFrame available. | ||
|
||
The function signature for ``assign`` is simply ``**kwargs``. The keys | ||
are the column names for the new fields, and the values are either a value | ||
to be inserted (for example, a ``Series`` or NumPy array), or a function | ||
of one argument to be called on the ``DataFrame``. A *copy* of the original | ||
DataFrame is returned, with the new values inserted. | ||
|
||
.. warning:: | ||
|
||
Since the function signature of ``assign`` is ``**kwargs``, a dictionary, | ||
the order of the new columns in the resulting DataFrame cannot be guaranteed. | ||
|
||
All expressions are computed first, and then assigned. So you can't refer | ||
to another column being assigned in the same call to ``assign``. For example: | ||
|
||
.. ipython:: | ||
:verbatim: | ||
|
||
In [1]: # Don't do this, bad reference to `C` | ||
df.assign(C = lambda x: x['A'] + x['B'], | ||
D = lambda x: x['A'] + x['C']) | ||
In [2]: # Instead, break it into two assigns | ||
(df.assign(C = lambda x: x['A'] + x['B']) | ||
.assign(D = lambda x: x['A'] + x['C'])) | ||
|
||
Indexing / Selection | ||
~~~~~~~~~~~~~~~~~~~~ | ||
The basics of indexing are as follows: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,47 @@ New features | |
|
||
This method is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods. | ||
|
||
- DataFrame assign method | ||
|
||
Inspired by `dplyr's | ||
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__ ``mutate`` verb, DataFrame has a new | ||
:meth:`~pandas.DataFrame.assign` method. | ||
The function signature for ``assign`` is simply ``**kwargs``. The keys | ||
are the column names for the new fields, and the values are either a value | ||
to be inserted (for example, a ``Series`` or NumPy array), or a function | ||
of one argument to be called on the ``DataFrame``. The new values are inserted, | ||
and the entire DataFrame (with all original and new columns) is returned. | ||
|
||
.. ipython :: python | ||
|
||
iris = read_csv('data/iris.data') | ||
iris.head() | ||
|
||
iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head() | ||
|
||
Above was an example of inserting a precomputed value. We can also pass in | ||
a function to be evalutated. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say that the pupose of the callable is to have a defered operation |
||
|
||
.. ipython :: python | ||
|
||
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] / | ||
x['SepalLength'])).head() | ||
|
||
The power of ``assign`` comes when used in chains of operations. For example, | ||
we can limit the DataFrame to just those with a Sepal Length greater than 5, | ||
calculate the ratio, and plot | ||
|
||
.. ipython:: python | ||
|
||
(iris.query('SepalLength > 5') | ||
.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength, | ||
PetalRatio = lambda x: x.PetalWidth / x.PetalLength) | ||
.plot(kind='scatter', x='SepalRatio', y='PetalRatio')) | ||
|
||
.. image:: _static/whatsnew_assign.png | ||
|
||
See the :ref:`documentation <dsintro.chained_assignment>` for more. (:issue:`9229`) | ||
|
||
.. _whatsnew_0160.api: | ||
|
||
.. _whatsnew_0160.api_breaking: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2220,6 +2220,88 @@ def insert(self, loc, column, value, allow_duplicates=False): | |
self._data.insert( | ||
loc, column, value, allow_duplicates=allow_duplicates) | ||
|
||
def assign(self, **kwargs): | ||
""" | ||
Assign new columns to a DataFrame, returning a new object | ||
(a copy) with all the original columns in addition to the new ones. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add here a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. Done. |
||
.. versionadded:: 0.16.0 | ||
|
||
Parameters | ||
---------- | ||
kwargs : keyword, value pairs | ||
keywords are the column names. If the values are | ||
callable, they are computed on the DataFrame and | ||
assigned to the new columns. If the values are | ||
not callable, (e.g. a Series, scalar, or array), | ||
they are simply assigned. | ||
|
||
Returns | ||
------- | ||
df : DataFrame | ||
A new DataFrame with the new columns in addition to | ||
all the existing columns. | ||
|
||
Notes | ||
----- | ||
Since ``kwargs`` is a dictionary, the order of your | ||
arguments may not be preserved, and so the order of the | ||
new columns is not well defined. Assigning multiple | ||
columns within the same ``assign`` is possible, but you cannot | ||
reference other columns created within the same ``assign`` call. | ||
|
||
Examples | ||
-------- | ||
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | ||
|
||
Where the value is a callable, evaluated on `df`: | ||
|
||
>>> df.assign(ln_A = lambda x: np.log(x.A)) | ||
A B ln_A | ||
0 1 0.426905 0.000000 | ||
1 2 -0.780949 0.693147 | ||
2 3 -0.418711 1.098612 | ||
3 4 -0.269708 1.386294 | ||
4 5 -0.274002 1.609438 | ||
5 6 -0.500792 1.791759 | ||
6 7 1.649697 1.945910 | ||
7 8 -1.495604 2.079442 | ||
8 9 0.549296 2.197225 | ||
9 10 -0.758542 2.302585 | ||
|
||
Where the value already exists and is inserted: | ||
|
||
>>> newcol = np.log(df['A']) | ||
>>> df.assign(ln_A=newcol) | ||
A B ln_A | ||
0 1 0.426905 0.000000 | ||
1 2 -0.780949 0.693147 | ||
2 3 -0.418711 1.098612 | ||
3 4 -0.269708 1.386294 | ||
4 5 -0.274002 1.609438 | ||
5 6 -0.500792 1.791759 | ||
6 7 1.649697 1.945910 | ||
7 8 -1.495604 2.079442 | ||
8 9 0.549296 2.197225 | ||
9 10 -0.758542 2.302585 | ||
""" | ||
data = self.copy() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there has been an previous comment about this, but two things:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would violate pandas data model. The e.g. If you allowed inplace chaining
would then add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback Thanks for explaining! (this does me thinking we should really have some better docs about the internals .. but of course, someone has to write (and keep up to date) them) So assigning with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, I was saying side-effects meaning that whether this creates a new block and/or consolidates is an implementation detail (it actually creates a new block if its a new dtype, then consolidates) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, thinking more about it now, it is indeed logical, if you are chaining, that it returns a new object. It should just be clear from the docs, as @TomAugspurger adapted them now. |
||
|
||
# do all calculations first... | ||
results = {} | ||
for k, v in kwargs.items(): | ||
|
||
if callable(v): | ||
results[k] = v(data) | ||
else: | ||
results[k] = v | ||
|
||
# ... and then assign | ||
for k, v in results.items(): | ||
data[k] = v | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no this is correct; this is by definition a string setting of a column There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The keys in |
||
|
||
return data | ||
|
||
def _sanitize_column(self, key, value): | ||
# Need to make sure new columns (which go into the BlockManager as new | ||
# blocks) are always copied | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the parens here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call
.head
on the next line.