Introduction
Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like Series and DataFrame to handle tabular and time-series data efficiently.
Pandas uses labeled axes and automatic alignment, which makes joins, reshaping, and time-series analysis far safer than manual list handling.
Installation
# Using pip pip install pandas # Using conda conda install pandas
For reproducibility, use virtual environments and pin versions in requirements.txt or environment.yml.
Pandas Series
import pandas as pd # Create Series s = pd.Series([10,20,30], index=['a','b','c']) print(s) print(s['b']) print(s.values, s.index)
A Series is a labeled 1D array with automatic index alignment and vectorized operations.
DataFrame Basics
data = {'Name':['Alice','Bob'],'Age':[25,30]}
df = pd.DataFrame(data)
print(df)
print(df['Name'])
print(df.iloc[0], df.loc[0])
A DataFrame is a 2D labeled table. Columns can have different dtypes and are stored in optimized internal blocks.
Reading & Writing Data
df = pd.read_csv('data.csv')
df.to_excel('output.xlsx', index=False)
df.to_json('data.json')
Specify dtype and parse_dates to speed up reading and avoid type inference mistakes.
Indexing & Selection
print(df['Age']) print(df.loc[0,'Name']) print(df.iloc[0:2,0:2]) print(df[df['Age']>25])
Use .loc for label-based selection and .iloc for positional selection to keep logic explicit.
Operations
df['Age'] = df['Age'] + 1 df['Salary'] = [50000,60000] print(df) print(df.describe()) print(df.mean())
Operations are vectorized and typically much faster than Python loops. Use assign for method chaining.
Handling Missing Data
df.isnull() df.dropna(inplace=True) df.fillna(0, inplace=True)
Choose strategies based on business meaning: dropping, filling, or interpolating can each be correct depending on context.
Grouping & Aggregation
grouped = df.groupby('Name')['Age'].mean()
print(grouped)
agg = df.agg({'Age':'mean','Salary':'sum'})
print(agg)
GroupBy follows split‑apply‑combine. Use multiple aggregations and named outputs for readability.
Merging & Joining
df1 = pd.DataFrame({'ID':[1,2],'Name':['Alice','Bob']})
df2 = pd.DataFrame({'ID':[1,2],'Salary':[50000,60000]})
merged = pd.merge(df1, df2, on='ID')
print(merged)
Control join behavior with how and validate keys to avoid accidental many‑to‑many merges.
Pivot & Reshape
df = pd.DataFrame({'Date':['2026-01-01','2026-01-01'],'Type':['A','B'],'Value':[10,20]})
pivoted = df.pivot(index='Date',columns='Type',values='Value')
print(pivoted)
Use pivot_table when you need aggregation with duplicate keys.
Core Data Structures
🔹 Series (1D Data)
import pandas as pd s = pd.Series([10, 20, 30, 40], index=['a','b','c','d']) print(s) print(type(s))
🔹 DataFrame (2D Data)
df = pd.DataFrame({
'Name': ['Alice','Bob','Charlie'],
'Age': [25,30,35]
})
print(df)
print(type(df))
Both structures are built on top of NumPy arrays with labeled indexes for alignment and safer operations.
DataFrame Attributes & Info
print(df.shape) # rows, columns print(df.columns) # column names print(df.index) # index values print(df.dtypes) # data types print(df.info()) # full summary print(df.head()) # first 5 rows print(df.tail(2)) # last 2 rows
Use info() and dtypes early to confirm schema and spot unexpected nulls or mixed types.
Column Operations
df['Salary'] = [50000, 60000, 70000] print(df)
df.rename(columns={'Name':'Employee'}, inplace=True)
print(df)
df.drop('Age', axis=1, inplace=True)
print(df)
Prefer vectorized column operations and avoid row-wise loops for performance.
Row Operations
print(df.loc[0]) # label-based print(df.iloc[1]) # position-based
df.loc[3] = ['David', 65000] print(df)
df.drop(1, inplace=True) print(df)
Row insertion can be expensive for large data; consider building a list of rows and creating a DataFrame once.
Filtering & Conditional Selection
high_salary = df[df['Salary'] > 60000] print(high_salary)
print(df[(df['Salary'] > 50000) & (df['Salary'] < 70000)])
print(df.query("Salary > 60000"))
Use boolean masks and query() to keep filters readable and fast.
Sorting & Ranking
df.sort_values(by='Salary', ascending=False, inplace=True) print(df)
df['Rank'] = df['Salary'].rank(ascending=False) print(df)
Sorting can be done by multiple columns and with stable ordering for reproducible results.
Handling Missing Data
df.isnull() df.notnull()
df.fillna(0, inplace=True) df.fillna(method='ffill', inplace=True) df.fillna(method='bfill', inplace=True)
df.dropna(axis=0) # remove rows df.dropna(axis=1) # remove columns
Consider using interpolate() for time-series data or domain-specific imputation.
Statistical & Aggregation Functions
print(df.mean()) print(df.sum()) print(df.min()) print(df.max()) print(df.count()) print(df.std()) print(df.var())
print(df.describe())
Use describe(include='all') for mixed types and quick data profiling.
GroupBy (Split → Apply → Combine)
data = {
'Dept': ['IT','IT','HR','HR'],
'Salary': [50000,60000,40000,45000]
}
df = pd.DataFrame(data)
print(df.groupby('Dept').mean())
print(df.groupby('Dept').sum())
print(df.groupby('Dept').agg(['mean','max','count']))
GroupBy supports custom functions and multiple keys for multi-dimensional summaries.
String Methods
df['Name'] = df['Name'].str.upper()
df['Name'] = df['Name'].str.lower()
df['Name'] = df['Name'].str.contains('a')
df['CleanName'] = df['Employee'].str.strip() print(df)
String methods are vectorized and handle missing values safely with na parameters.
DateTime & Time Series
df['Date'] = pd.to_datetime(['2026-01-01','2026-01-02','2026-01-03']) print(df['Date'].dt.year) print(df['Date'].dt.month) print(df['Date'].dt.day)
df.set_index('Date', inplace=True)
print(df.resample('D').mean())
Time-series workflows often use resample, rolling, and timezone-aware datetimes.
apply(), map() & lambda
df['Bonus'] = df['Salary'].apply(lambda x: x * 0.1) print(df)
df['Category'] = df['Salary'].map( lambda x: 'High' if x > 60000 else 'Low' ) print(df)
Prefer vectorized operations when possible; apply is flexible but slower on large data.
Handling Duplicates
df.duplicated() df.drop_duplicates(inplace=True)
df.drop_duplicates(subset=['Employee'], keep='first')
Use subset and keep to control which duplicates are preserved.
Pandas Performance Tips
- Use vectorized operations
- Avoid loops
- Use categorical dtype
- Use query() for filtering
df['Salary'] = df['Salary'] * 1.1 # vectorized
Consider categorical dtype, chunking, and avoiding chained indexing for large datasets.
Pandas Mastery Checklist
- Series & DataFrame
- Indexing & filtering
- GroupBy & aggregation
- Missing data handling
- Time series
- Merge, join, concat
- Apply & lambda
Revisit these topics regularly with real datasets to build intuition.
Advanced Pandas
- Time-Series Analysis
- MultiIndex & Hierarchical Indexing
- Window Functions — rolling, expanding
- Vectorization — avoiding loops
- Integration with NumPy, Matplotlib, and Seaborn
dates = pd.date_range('2026-01-01', periods=5)
ts = pd.Series([1,2,3,4,5], index=dates)
print(ts.rolling(2).mean())
print(ts.expanding().sum())
Advanced workflows often combine rolling windows with resampling and multi-index operations.