Skip to content

API: Add DataFrame.assign method #9239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 1, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/source/_static/whatsnew_assign.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from pandas.compat import lrange
options.display.max_rows=15


==============================
Essential Basic Functionality
==============================
Expand Down Expand Up @@ -793,6 +794,7 @@ This is equivalent to the following
result
result.loc[:,:,'ItemA']


.. _basics.reindexing:


Expand Down
76 changes: 76 additions & 0 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,82 @@ available to insert at a particular location in the columns:
df.insert(1, 'bar', df['one'])
df

.. _dsintro.chained_assignment:

Assigning New Columns in Method Chains
--------------------------------------

.. versionadded:: 0.16.0

Inspired by `dplyr's
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__
``mutate`` verb, DataFrame has an :meth:`~pandas.DataFrame.assign`
method that allows you to easily create new columns that are potentially
derived from existing columns.

.. ipython:: python

iris = read_csv('data/iris.data')
iris.head()

(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the parens here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call .head on the next line.

.head())

Above was an example of inserting a precomputed value. We can also pass in
a function of one argument to be evalutated on the DataFrame being assigned to.

.. ipython:: python

iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
x['SepalLength'])).head()

``assign`` **always** returns a copy of the data, leaving the original
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explain why you would use a callable (as opposed to straight assignment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do that down on line 498, but agreed. I'll put a sentence here.

DataFrame untouched.

Passing a callable, as opposed to an actual value to be inserted, is
useful when you don't have a reference to the DataFrame at hand. This is
common when using ``assign`` in chains of operations. For example,
we can limit the DataFrame to just those observations with a Sepal Length
greater than 5, calculate the ratio, and plot:

.. ipython:: python

@savefig basics_assign.png
(iris.query('SepalLength > 5')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you need the parens here either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using parens instead of \ to do line-continuation.

.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
.plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

Since a function is passed in, the function is computed on the DataFrame
being assigned to. Importantly, this is the DataFrame that's been filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it clear that this is a deferred operation (so that the purpose is to have the filtering happen first)

to those rows with sepal length greater than 5. The filtering happens first,
and then the ratio calculations. This is an example where we didn't
have a reference to the *filtered* DataFrame available.

The function signature for ``assign`` is simply ``**kwargs``. The keys
are the column names for the new fields, and the values are either a value
to be inserted (for example, a ``Series`` or NumPy array), or a function
of one argument to be called on the ``DataFrame``. A *copy* of the original
DataFrame is returned, with the new values inserted.

.. warning::

Since the function signature of ``assign`` is ``**kwargs``, a dictionary,
the order of the new columns in the resulting DataFrame cannot be guaranteed.

All expressions are computed first, and then assigned. So you can't refer
to another column being assigned in the same call to ``assign``. For example:

.. ipython::
:verbatim:

In [1]: # Don't do this, bad reference to `C`
df.assign(C = lambda x: x['A'] + x['B'],
D = lambda x: x['A'] + x['C'])
In [2]: # Instead, break it into two assigns
(df.assign(C = lambda x: x['A'] + x['B'])
.assign(D = lambda x: x['A'] + x['C']))

Indexing / Selection
~~~~~~~~~~~~~~~~~~~~
The basics of indexing are as follows:
Expand Down
41 changes: 41 additions & 0 deletions doc/source/whatsnew/v0.16.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,47 @@ New features

This method is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods.

- DataFrame assign method

Inspired by `dplyr's
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__ ``mutate`` verb, DataFrame has a new
:meth:`~pandas.DataFrame.assign` method.
The function signature for ``assign`` is simply ``**kwargs``. The keys
are the column names for the new fields, and the values are either a value
to be inserted (for example, a ``Series`` or NumPy array), or a function
of one argument to be called on the ``DataFrame``. The new values are inserted,
and the entire DataFrame (with all original and new columns) is returned.

.. ipython :: python

iris = read_csv('data/iris.data')
iris.head()

iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()

Above was an example of inserting a precomputed value. We can also pass in
a function to be evalutated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say that the pupose of the callable is to have a defered operation


.. ipython :: python

iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
x['SepalLength'])).head()

The power of ``assign`` comes when used in chains of operations. For example,
we can limit the DataFrame to just those with a Sepal Length greater than 5,
calculate the ratio, and plot

.. ipython:: python

(iris.query('SepalLength > 5')
.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
.plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

.. image:: _static/whatsnew_assign.png

See the :ref:`documentation <dsintro.chained_assignment>` for more. (:issue:`9229`)

.. _whatsnew_0160.api:

.. _whatsnew_0160.api_breaking:
Expand Down
82 changes: 82 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2220,6 +2220,88 @@ def insert(self, loc, column, value, allow_duplicates=False):
self._data.insert(
loc, column, value, allow_duplicates=allow_duplicates)

def assign(self, **kwargs):
"""
Assign new columns to a DataFrame, returning a new object
(a copy) with all the original columns in addition to the new ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a versionadded as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

.. versionadded:: 0.16.0

Parameters
----------
kwargs : keyword, value pairs
keywords are the column names. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. If the values are
not callable, (e.g. a Series, scalar, or array),
they are simply assigned.

Returns
-------
df : DataFrame
A new DataFrame with the new columns in addition to
all the existing columns.

Notes
-----
Since ``kwargs`` is a dictionary, the order of your
arguments may not be preserved, and so the order of the
new columns is not well defined. Assigning multiple
columns within the same ``assign`` is possible, but you cannot
reference other columns created within the same ``assign`` call.

Examples
--------
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Where the value is a callable, evaluated on `df`:

>>> df.assign(ln_A = lambda x: np.log(x.A))
A B ln_A
0 1 0.426905 0.000000
1 2 -0.780949 0.693147
2 3 -0.418711 1.098612
3 4 -0.269708 1.386294
4 5 -0.274002 1.609438
5 6 -0.500792 1.791759
6 7 1.649697 1.945910
7 8 -1.495604 2.079442
8 9 0.549296 2.197225
9 10 -0.758542 2.302585

Where the value already exists and is inserted:

>>> newcol = np.log(df['A'])
>>> df.assign(ln_A=newcol)
A B ln_A
0 1 0.426905 0.000000
1 2 -0.780949 0.693147
2 3 -0.418711 1.098612
3 4 -0.269708 1.386294
4 5 -0.274002 1.609438
5 6 -0.500792 1.791759
6 7 1.649697 1.945910
7 8 -1.495604 2.079442
8 9 0.549296 2.197225
9 10 -0.758542 2.302585
"""
data = self.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there has been an previous comment about this, but two things:

  • Is this actually necessary? (But I probably also do not yet fully understand pandas' datamodel) Eg does df['a'] = .. always copy?
  • In the probable case of misunderstanding (so this is my actual comment :-), I would maybe add a note about this in the docstring? DataFrame.append has this in the sense it says returning a new object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would violate pandas data model. The assign method would then have side effects (without it being obvious that it does), and further intuition on chaining would be very difficult to reason.

e.g. If you allowed inplace chaining

df.assign(C=df.A/df.C)

would then add C to the ORIGINAL frame. (I have some commentary at on this later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche
df['a'] = ... NEVER copies. That is the point its an inplace assignment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Thanks for explaining! (this does me thinking we should really have some better docs about the internals .. but of course, someone has to write (and keep up to date) them)

So assigning with df['a'] = .. adds a new block and does not consolidate it with another block if one exists of that type? Why not having the same approach here? What are the side effects you are talking about with df['a'] = .. ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I was saying side-effects meaning that df IS modified, as opposed to df.assign(...) which returns a NEW object. df['a'] = .. is just like is says, its an assignment INPLACE.

whether this creates a new block and/or consolidates is an implementation detail (it actually creates a new block if its a new dtype, then consolidates)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thinking more about it now, it is indeed logical, if you are chaining, that it returns a new object. It should just be clear from the docs, as @TomAugspurger adapted them now.


# do all calculations first...
results = {}
for k, v in kwargs.items():

if callable(v):
results[k] = v(data)
else:
results[k] = v

# ... and then assign
for k, v in results.items():
data[k] = v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use .loc here, __setitem__ can behave unexpectedly depending on input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is correct; this is by definition a string setting of a column
maybe just assert that the keys are steings (I think that the function call would raise before hand if they were not in any event)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys in **kwargs are required to be strings by Python. No need to check.


return data

def _sanitize_column(self, key, value):
# Need to make sure new columns (which go into the BlockManager as new
# blocks) are always copied
Expand Down
54 changes: 54 additions & 0 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -13965,6 +13965,60 @@ def test_select_dtypes_bad_arg_raises(self):
with tm.assertRaisesRegexp(TypeError, 'data type.*not understood'):
df.select_dtypes(['blargy, blarg, blarg'])

def test_assign(self):
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
original = df.copy()
result = df.assign(C=df.B / df.A)
expected = df.copy()
expected['C'] = [4, 2.5, 2]
assert_frame_equal(result, expected)

# lambda syntax
result = df.assign(C=lambda x: x.B / x.A)
assert_frame_equal(result, expected)

# original is unmodified
assert_frame_equal(df, original)

# Non-Series array-like
result = df.assign(C=[4, 2.5, 2])
assert_frame_equal(result, expected)
# original is unmodified
assert_frame_equal(df, original)

result = df.assign(B=df.B / df.A)
expected = expected.drop('B', axis=1).rename(columns={'C': 'B'})
assert_frame_equal(result, expected)

# overwrite
result = df.assign(A=df.A + df.B)
expected = df.copy()
expected['A'] = [5, 7, 9]
assert_frame_equal(result, expected)

# lambda
result = df.assign(A=lambda x: x.A + x.B)
assert_frame_equal(result, expected)

def test_assign_multiple(self):
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = df.assign(C=[7, 8, 9], D=df.A, E=lambda x: x.B)
expected = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9],
'D': [1, 2, 3], 'E': [4, 5, 6]})
# column order isn't preserved
assert_frame_equal(result.reindex_like(expected), expected)

def test_assign_bad(self):
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# non-keyword argument
with tm.assertRaises(TypeError):
df.assign(lambda x: x.A)
with tm.assertRaises(AttributeError):
df.assign(C=df.A, D=df.A + df.C)
with tm.assertRaises(KeyError):
df.assign(C=lambda df: df.A, D=lambda df: df['A'] + df['C'])
with tm.assertRaises(KeyError):
df.assign(C=df.A, D=lambda x: x['A'] + x['C'])

def skip_if_no_ne(engine='numexpr'):
if engine == 'numexpr':
Expand Down