Pandas Cheat Sheet

Pandas Installation


conda install pandas	The best way to get pandas is via conda
python3 -m pip install –upgrade pandas	Via PyPI
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas	Install using Tsinghua mirror (recommended in China)

Key Abbreviations and Package Imports


df	Any Pandas DataFrame object
s	Any Pandas Series object

Importing Methods


import pandas as pd	Import pandas and abbreviate as pd
import numpy as np	Import numpy and abbreviate as np

Data Selection


df[col]	Returns column with label `col` as a Series
df[[col1, col2]]	Returns multiple columns as a DataFrame
s.iloc[0]	Selection by position
s.loc[‘index_one’]	Selection by index label
df.iloc[0,:]	Returns the first row
df.iloc[0,0]	Returns the first element of the first column

Data Cleaning


df.columns = [‘a’,‘b’,‘c’]	Rename columns
pd.isnull()	Checks for null values, returns a Boolean array
pd.notnull()	Checks for non-null values, returns a Boolean array
df.dropna()	Drop all rows that contain null values
df.dropna(axis=1)	Drop all columns that contain null values
df.dropna(axis=1,thresh=n)	Drop all rows have less than n non-null values
df.fillna(x)	Replace all null values with x
s.astype(float)	Convert the datatype of the series to float
s.replace(1,‘one’)	Replace all values equal to 1 with ‘one’
s.replace([1,3],[‘one’,’three’])	Replace 1 with ‘one’ and 3 with ’three’
df.rename(columns=lambda x: x + 1)	Mass renaming of columns
df.rename(columns={‘old_name’: ’new_name’})	Selective renaming
df.set_index(‘column_one’)	Change the index
df.rename(index=lambda x: x + 1)	Mass renaming of index

Importing Data


pd.read_csv(filename)	From a CSV file
pd.read_table(filename)	From a delimited text file
pd.read_excel(filename)	From an Excel file
pd.read_sql(query, connection_object)	Read from a SQL table/database
pd.read_json(json_string)	Read from a JSON formatted string
pd.read_html(url)	Parses a URL, string or file and extracts tables
pd.read_clipboard()	Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict)	From a dict, keys for columns names, values for data as lists

Exporting Data


df.to_csv(filename)	Write to a CSV file
df.to_excel(filename)	Write to an Excel file
df.to_sql(table_name, connection_object)	Write to a SQL table
df.to_json(filename)	Write to a file in JSON format

Filter, Sort, and GroupBy


df[df[col] > 0.5]	Rows where the column `col` is greater than 0.5
df.sort_values(col1)	Sort values by `col1` in ascending order
df.sort_values(col2, ascending=False)	Sort values by `col2` in descending order
df.sort_values([col1,col2], ascending=[True,False])	Sort values by `col1` ascending then `col2` descending
df.groupby(col)	Returns a groupby object for values from one column
df.groupby([col1,col2])	Returns a groupby object for values from multiple columns
df.groupby(col1)[col2]	Returns the mean of the values in `col2`, grouped by the values in `col1`
df.pivot_table(index=col1, values=[col2,col3], aggfunc=max)	Create a pivot table that groups by `col1` and calculates the max of `col2` and `col3`
df.groupby(col1).agg(np.mean)	Find the average across all columns for every unique `col1` group
data.apply(np.mean)	Apply the function `np.mean` across each column
data.apply(np.max,axis=1)	Apply the function `np.max` across each row

Creating Test Objects


pd.DataFrame(np.random.rand(20,5))	5 columns and 20 rows of random floats
pd.Series(my_list)	Create a series from an iterable `my_list`
df.index = pd.date_range(‘1900/1/30’, periods=df.shape[0])	Add a date index

Viewing/Inspecting Data


df.head(n)	First n rows of the DataFrame
df.tail(n)	Last n rows of the DataFrame
df.shape	Number of rows and columns
df.info()	Index, Datatype and Memory information
df.describe()	Summary statistics for numerical columns
s.value_counts(dropna=False)	View unique values and counts
df.apply(pd.Series.value_counts)	Unique values and counts for all columns

Joins/Combining


df1.append(df2)	Add the rows in `df2` to the end of `df1` (columns should be identical)
pd.concat([df1, df2],axis=1)	Add the columns in `df2` to the end of `df1` (rows should be identical)
df1.join(df2,on=col1,how=‘inner’)	SQL-style join the columns in `df1` with the columns on `df2` where the rows for `col` have identical values

Statistics


df.describe()	Summary statistics for numerical columns
df.mean()	Returns the mean of all columns
df.corr()	Returns the correlation between columns in a DataFrame
df.count()	Returns the number of non-null values in each DataFrame column
df.max()	Returns the highest value in each column
df.min()	Returns the lowest value in each column
df.median()	Returns the median of each column
df.std()	Returns the standard deviation of each column