Pandas Cheat Sheet
Pandas Installation
| conda install pandas | The best way to get pandas is via conda |
| python3 -m pip install –upgrade pandas | Via PyPI |
| pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas | Install using Tsinghua mirror (recommended in China) |
Key Abbreviations and Package Imports
| df | Any Pandas DataFrame object |
| s | Any Pandas Series object |
Importing Methods
| import pandas as pd | Import pandas and abbreviate as pd |
| import numpy as np | Import numpy and abbreviate as np |
Data Selection
| df[col] | Returns column with label col as a Series |
| df[[col1, col2]] | Returns multiple columns as a DataFrame |
| s.iloc[0] | Selection by position |
| s.loc[‘index_one’] | Selection by index label |
| df.iloc[0,:] | Returns the first row |
| df.iloc[0,0] | Returns the first element of the first column |
Data Cleaning
| df.columns = [‘a’,‘b’,‘c’] | Rename columns |
| pd.isnull() | Checks for null values, returns a Boolean array |
| pd.notnull() | Checks for non-null values, returns a Boolean array |
| df.dropna() | Drop all rows that contain null values |
| df.dropna(axis=1) | Drop all columns that contain null values |
| df.dropna(axis=1,thresh=n) | Drop all rows have less than n non-null values |
| df.fillna(x) | Replace all null values with x |
| s.astype(float) | Convert the datatype of the series to float |
| s.replace(1,‘one’) | Replace all values equal to 1 with ‘one’ |
| s.replace([1,3],[‘one’,’three’]) | Replace 1 with ‘one’ and 3 with ’three’ |
| df.rename(columns=lambda x: x + 1) | Mass renaming of columns |
| df.rename(columns={‘old_name’: ’new_name’}) | Selective renaming |
| df.set_index(‘column_one’) | Change the index |
| df.rename(index=lambda x: x + 1) | Mass renaming of index |
Importing Data
| pd.read_csv(filename) | From a CSV file |
| pd.read_table(filename) | From a delimited text file |
| pd.read_excel(filename) | From an Excel file |
| pd.read_sql(query, connection_object) | Read from a SQL table/database |
| pd.read_json(json_string) | Read from a JSON formatted string |
| pd.read_html(url) | Parses a URL, string or file and extracts tables |
| pd.read_clipboard() | Takes the contents of your clipboard and passes it to read_table() |
| pd.DataFrame(dict) | From a dict, keys for columns names, values for data as lists |
Exporting Data
| df.to_csv(filename) | Write to a CSV file |
| df.to_excel(filename) | Write to an Excel file |
| df.to_sql(table_name, connection_object) | Write to a SQL table |
| df.to_json(filename) | Write to a file in JSON format |
Filter, Sort, and GroupBy
| df[df[col] > 0.5] | Rows where the column col is greater than 0.5 |
| df.sort_values(col1) | Sort values by col1 in ascending order |
| df.sort_values(col2, ascending=False) | Sort values by col2 in descending order |
| df.sort_values([col1,col2], ascending=[True,False]) | Sort values by col1 ascending then col2 descending |
| df.groupby(col) | Returns a groupby object for values from one column |
| df.groupby([col1,col2]) | Returns a groupby object for values from multiple columns |
| df.groupby(col1)[col2] | Returns the mean of the values in col2, grouped by the values in col1 |
| df.pivot_table(index=col1, values=[col2,col3], aggfunc=max) | Create a pivot table that groups by col1 and calculates the max of col2 and col3 |
| df.groupby(col1).agg(np.mean) | Find the average across all columns for every unique col1 group |
| data.apply(np.mean) | Apply the function np.mean across each column |
| data.apply(np.max,axis=1) | Apply the function np.max across each row |
Creating Test Objects
| pd.DataFrame(np.random.rand(20,5)) | 5 columns and 20 rows of random floats |
| pd.Series(my_list) | Create a series from an iterable my_list |
| df.index = pd.date_range(‘1900/1/30’, periods=df.shape[0]) | Add a date index |
Viewing/Inspecting Data
| df.head(n) | First n rows of the DataFrame |
| df.tail(n) | Last n rows of the DataFrame |
| df.shape | Number of rows and columns |
| df.info() | Index, Datatype and Memory information |
| df.describe() | Summary statistics for numerical columns |
| s.value_counts(dropna=False) | View unique values and counts |
| df.apply(pd.Series.value_counts) | Unique values and counts for all columns |
Joins/Combining
| df1.append(df2) | Add the rows in df2 to the end of df1 (columns should be identical) |
| pd.concat([df1, df2],axis=1) | Add the columns in df2 to the end of df1 (rows should be identical) |
| df1.join(df2,on=col1,how=‘inner’) | SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values |
Statistics
| df.describe() | Summary statistics for numerical columns |
| df.mean() | Returns the mean of all columns |
| df.corr() | Returns the correlation between columns in a DataFrame |
| df.count() | Returns the number of non-null values in each DataFrame column |
| df.max() | Returns the highest value in each column |
| df.min() | Returns the lowest value in each column |
| df.median() | Returns the median of each column |
| df.std() | Returns the standard deviation of each column |