Skip to content

Pandas Cheat Sheet

Pandas Installation

conda install pandas The best way to get pandas is via conda
python3 -m pip install –upgrade pandas Via PyPI
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas Install using Tsinghua mirror (recommended in China)

Key Abbreviations and Package Imports

df Any Pandas DataFrame object
s Any Pandas Series object

Importing Methods

import pandas as pd Import pandas and abbreviate as pd
import numpy as np Import numpy and abbreviate as np

Data Selection

df[col] Returns column with label col as a Series
df[[col1, col2]] Returns multiple columns as a DataFrame
s.iloc[0] Selection by position
s.loc[‘index_one’] Selection by index label
df.iloc[0,:] Returns the first row
df.iloc[0,0] Returns the first element of the first column

Data Cleaning

df.columns = [‘a’,‘b’,‘c’] Rename columns
pd.isnull() Checks for null values, returns a Boolean array
pd.notnull() Checks for non-null values, returns a Boolean array
df.dropna() Drop all rows that contain null values
df.dropna(axis=1) Drop all columns that contain null values
df.dropna(axis=1,thresh=n) Drop all rows have less than n non-null values
df.fillna(x) Replace all null values with x
s.astype(float) Convert the datatype of the series to float
s.replace(1,‘one’) Replace all values equal to 1 with ‘one’
s.replace([1,3],[‘one’,’three’]) Replace 1 with ‘one’ and 3 with ’three’
df.rename(columns=lambda x: x + 1) Mass renaming of columns
df.rename(columns={‘old_name’: ’new_name’}) Selective renaming
df.set_index(‘column_one’) Change the index
df.rename(index=lambda x: x + 1) Mass renaming of index

Importing Data

pd.read_csv(filename) From a CSV file
pd.read_table(filename) From a delimited text file
pd.read_excel(filename) From an Excel file
pd.read_sql(query, connection_object) Read from a SQL table/database
pd.read_json(json_string) Read from a JSON formatted string
pd.read_html(url) Parses a URL, string or file and extracts tables
pd.read_clipboard() Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) From a dict, keys for columns names, values for data as lists

Exporting Data

df.to_csv(filename) Write to a CSV file
df.to_excel(filename) Write to an Excel file
df.to_sql(table_name, connection_object) Write to a SQL table
df.to_json(filename) Write to a file in JSON format

Filter, Sort, and GroupBy

df[df[col] > 0.5] Rows where the column col is greater than 0.5
df.sort_values(col1) Sort values by col1 in ascending order
df.sort_values(col2, ascending=False) Sort values by col2 in descending order
df.sort_values([col1,col2], ascending=[True,False]) Sort values by col1 ascending then col2 descending
df.groupby(col) Returns a groupby object for values from one column
df.groupby([col1,col2]) Returns a groupby object for values from multiple columns
df.groupby(col1)[col2] Returns the mean of the values in col2, grouped by the values in col1
df.pivot_table(index=col1, values=[col2,col3], aggfunc=max) Create a pivot table that groups by col1 and calculates the max of col2 and col3
df.groupby(col1).agg(np.mean) Find the average across all columns for every unique col1 group
data.apply(np.mean) Apply the function np.mean across each column
data.apply(np.max,axis=1) Apply the function np.max across each row

Creating Test Objects

pd.DataFrame(np.random.rand(20,5)) 5 columns and 20 rows of random floats
pd.Series(my_list) Create a series from an iterable my_list
df.index = pd.date_range(‘1900/1/30’, periods=df.shape[0]) Add a date index

Viewing/Inspecting Data

df.head(n) First n rows of the DataFrame
df.tail(n) Last n rows of the DataFrame
df.shape Number of rows and columns
df.info() Index, Datatype and Memory information
df.describe() Summary statistics for numerical columns
s.value_counts(dropna=False) View unique values and counts
df.apply(pd.Series.value_counts) Unique values and counts for all columns

Joins/Combining

df1.append(df2) Add the rows in df2 to the end of df1 (columns should be identical)
pd.concat([df1, df2],axis=1) Add the columns in df2 to the end of df1 (rows should be identical)
df1.join(df2,on=col1,how=‘inner’) SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values

Statistics

df.describe() Summary statistics for numerical columns
df.mean() Returns the mean of all columns
df.corr() Returns the correlation between columns in a DataFrame
df.count() Returns the number of non-null values in each DataFrame column
df.max() Returns the highest value in each column
df.min() Returns the lowest value in each column
df.median() Returns the median of each column
df.std() Returns the standard deviation of each column