Data Manipulation and Analysis Using Python

NGCM Summer Academy 2015

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

Instructors

Christopher Fonnesbeck (Vanderbilt University) Skipper Seabold (Civis Analytics)

Outline

Introduction to NumPy and Pandas

June 22, 09:30 - 13:00

NumPy arrays and indexing
Multidimensional arrays
Array methods and functions
Series and DataFrame objects
Importing data
Setting options
Categorical data
Indexing, data selection and subsetting
where and query
Hierarchical indexing
Reading and writing files
Sorting and ranking
Missing data
Data summarization

Data Wrangling with Pandas

June 22, 14:30 - 17:30

Date/time types
Merging and joining DataFrame objects
Concatenation
Text data operations
Reshaping DataFrame objects
Pivoting
Data transformation
Rolling and window operations
Permutation and sampling
Data aggregation and GroupBy operations
Out-of-core workflows
Performance

Plotting and Visualization

June 23, 09:30 - 13:00

Plotting in Pandas vs Matplotlib
Bar plots
Histograms
Box plots
Grouped plots
Scatterplots
Trellis plots

Statistical Data Modeling

June 23, 14:30 - 17:30

Statistical operations in pandas
Statistical modeling
Fitting data to probability distributions
Fitting regression models
Model selection
Cross-validation
Bootstrapping
Working with missing data

Prerequisites

This is an intermediate-level computing course, so some previous experience with Python is required. Some undergraduate-level statistics is also recommended.

Software Requirements

Python 3.4 or newer. We recommend installing the free Anaconda distribution of Python, available from Continuum Analytics.

The following packages should be installed on your system:

ipython>=3.0
numpy>=1.9
pandas>=0.16.2
scipy
matplotlib
scikit-learn
seaborn
patsy
numexpr
bottleneck
xlrd
jinja2
tornado
pyzmq
jsonschema
mpld3

If you have installed Anaconda, these should already be available to you.

Getting this repository

git clone https://github.com/fonnesbeck/ngcm_pandas_course.git

Make sure you have the requirements installed.

cd ngcm_pandas_course

If using pip,

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Manipulation and Analysis Using Python

Instructors

Outline

Introduction to NumPy and Pandas

June 22, 09:30 - 13:00

Data Wrangling with Pandas

June 22, 14:30 - 17:30

Plotting and Visualization

June 23, 09:30 - 13:00

Statistical Data Modeling

June 23, 14:30 - 17:30

Prerequisites

Software Requirements

Getting this repository

About

Releases

Packages

Languages

fonnesbeck/ngcm_pandas_course

Folders and files

Latest commit

History

Repository files navigation

Data Manipulation and Analysis Using Python

Instructors

Outline

Introduction to NumPy and Pandas

June 22, 09:30 - 13:00

Data Wrangling with Pandas

June 22, 14:30 - 17:30

Plotting and Visualization

June 23, 09:30 - 13:00

Statistical Data Modeling

June 23, 14:30 - 17:30

Prerequisites

Software Requirements

Getting this repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages