Data Analysis with Python: Learn Pandas Basics Step-by-Step

aakash

Jul 01, 2025

Introduction

Data analysis with Python has become an essential skill for professionals across various industries. By using Python’s powerful libraries like Pandas, data analysis has never been easier and more efficient. Whether you’re handling big data, performing scientific research, or working with business datasets, Python provides the tools to manage and analyze data effectively.

The demand for highly skilled professionals who can play with data, understand complex mathematical strategies by analysing complex datasets has increased abruptly. Python has become a top choice for data analysis due to its efficiency, clear syntax, and rich library ecosystem. Python and libraries like Pandas, Matplotlib, pyplot, numpy, and seaborn are widely used for data analysis. It provides structured data and features for efficient data manipulation, cleaning, and analysis.

What is Pandas in Data Analysis?

Pandas is a renowned library for data analysis in Python. It allows flexible data manipulation and analysis through two primary data structures: Series (1D) and DataFrame (2D). The series is one-dimensional, whereas the Data Frame is two-dimensional. These structures allow users to organize, label, and manage data. We can analyze any numerical or textual data extracted from any source, such as CSV files, SQL databases, or Excel spreadsheets. Pandas can perform various data-related operations, including:

Fast loading, efficient cleaning, and transformation of large datasets
Excellent alignment and indexing that allows clear organisation and easy access to data
Handling of missing and null values to increase data quality
Flexibility for the reshaping and merging of datasets for in-depth analysis

Python vs Excel vs R vs SQL: Which One’s Better for Data Analysis?

Excel is a user-friendly tool for analysing small datasets. It has special built-in features like Pivot Tables and charts for visual representation of data. But for large datasets consisting of more than 10,000 rows, we need efficient tools like Python with libraries like pandas and seaborn that can handle millions of rows easily.
R is a powerful tool commonly used for statistical analysis. Apart from analysis, it can perform various other tasks, including modelling, clustering, and Time-series modelling, and can also handle large datasets. However, Python with pandas is more versatile and integrates seamlessly with machine Learning workflows.
SQL is used for querying, filtering, and aggregating datasets. It is widely used in industry and is highly efficient for extracting and joining large datasets. However, SQL is limited to data retrieval and transformation. For advanced analysis, Pandas offers great flexibility and a set of analytical tools with Python scripts.
While Excel is great for small-scale analysis and SQL excels at querying large databases, Python with Pandas combines power, flexibility, and scalability — making it a top choice for end-to-end data analysis and automation.

Installing and Importing Pandas for Data Analysis with Python

Installing Pandas:

Using pip:
pip install pandas

Using conda:
conda install pandas

How to import Pandas in a script or notebook:
import pandas as pd

To check if Pandas is installed properly:
print(pd.__version__)

Insights about Pandas Data Structures

Series

A dimensional labelled array that can hold any type of data can hold integers, floats, strings, or even Python objects. Each value has its index label that helps in accessing and manipulating easy and efficient.

Example:

import pandas as pd
# Creating a Series from a list
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)

Output:

a    10  
b    20  
c    30  
d    40  
dtype: int64

DataFrame

Two-dimensional labelled data structure that helps to organise information into rows and columns.

From a List of Lists

import pandas as pd

data = [
    [1, 'Alice', 85],
    [2, 'Bob', 90],
    [3, 'Charlie', 78]
]
df = pd.DataFrame(data, columns=['ID', 'Name', 'Score'])
print(df)

From a Dictionary

data = {
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
}
df = pd.DataFrame(data)
print(df)

From a NumPy Array

import numpy as np

array_data = np.array([
    [1, 85],
    [2, 90],
    [3, 78]
])
df = pd.DataFrame(array_data, columns=['ID', 'Score'])
print(df)

From a CSV File

Assuming students.csv:

ID,Name,Score
1,Alice,85
2,Bob,90
3,Charlie,78
df = pd.read_csv('students.csv')
print(df)

Importing and Exporting Data

How to read data:

CSV: pd.read_csv('file.csv')
Excel: pd.read_excel('file.xlsx')
JSON: pd.read_json('file.json')

Writing data:

CSV: df.to_csv('output.csv')
Excel: df.to_excel('output.xlsx')
JSON: df.to_json('output.json')

Solution to common file path error:

Double-check the file path and extension.
Use raw strings (prefix with r) for Windows paths.
Ensure you have read/write permissions.

Exploring Data and Summary Stats

print(df.head())
print(df.info())
print(df.describe())

head() – Shows the first 5 rows
tail() – Shows the last 5 rows
sample() – Returns a random sample
info() – Displays DataFrame summary
describe() – Summary statistics

Cleaning Data in Pandas

Handling missing values:

df.isnull()
df.dropna()
df.fillna(value)

Dropping duplicates:

df.drop_duplicates()

Converting data types:

df['col'] = df['col'].astype(int)

Parsing dates and categoricals:

pd.to_datetime(df['date_col'])
df['col'] = df['col'].astype('category')

Selecting and Filtering Data

loc[] – Label-based:

df.loc[2, 'Name']
df.loc[df['Age'] > 25]

iloc[] – Position-based:

df.iloc[0, 1]
df.iloc[0:3, :]

Slicing:

df['Name']
df[['Name', 'Age']]

Transforming Data

Sorting: df.sort_values('column'), df.sort_index()
Renaming: df.rename(columns={'old': 'new'})
Applying functions: df['col'].apply(func), df['col'].map(dict)
Reshaping: df.melt(), df.pivot(), df.transpose()

Grouping and Aggregating

df.groupby('column').mean()
df.groupby(['col1', 'col2']).sum()
df.agg({'col1': 'mean', 'col2': 'sum'})

Merging and Joining DataFrames

pd.concat([df1, df2])
pd.merge(df1, df2, on='key')
df1.join(df2, how='left')

Join types: inner, outer, left, right

Visualizing Results in Data Analysis with Pandas

df.plot.line()
df.plot.bar()
df.plot.hist()
df.plot.box()

Customization:

Add titles, axis labels, and legends
For advanced plots, use Matplotlib or Seaborn

import matplotlib.pyplot as plt
plt.savefig('plot.png')

Hands-On Example: Analyzing a Public Dataset

Step 1: Load Data

import pandas as pd

df = pd.read_csv('World Happiness Report 2021.csv')

Step 2: Explore Data

print(df.head())
print(df.info())
print(df.describe())

Step 3: Clean Data

df['Score'].fillna(df['Score'].median(), inplace=True)
df.drop_duplicates(inplace=True)

Step 4: Analyze and Visualize

# Average happiness score by region
avg_score = df.groupby('Regional indicator')['Score'].mean()
print(avg_score)

# Plot average score by region
avg_score.plot(kind='bar', title='Average Happiness Score by Region')

import matplotlib.pyplot as plt
plt.ylabel('Score')
plt.show()

Step 5: Summarize Insights

Some regions consistently report higher happiness scores.
Visualizations reveal disparities and trends across regions.

Common Mistakes and Best Practices in Pandas

SettingWithCopyWarning

# Problematic
df_subset = df[df['age'] > 30]
df_subset['eligible'] = True

# Correct
df.loc[df['age'] > 30, 'eligible'] = True

Tips:

Avoid modifying raw data directly
Chain pandas methods for step-by-step clarity
Use chunks for large datasets
Comment code and use descriptive variables
Maintain reproducibility and collaboration clarity

Conclusion

Pandas act as essential resources for data analysis. They empower users to effectively manage, clean, and analyse real-world datasets. To become proficient in Python and Pandas, practice working with real-world datasets like those from Kaggle or open government portals. This will solidify your skills and prepare you for advanced topics such as time series analysis, predictive modeling, and machine learning workflows. With more practice, one can easily jump to more advanced topics like Time series and advanced data visualisation or combine Pandas with machine learning workflows.