Table of Content

Data Analysis with Python: Learn Pandas Basics Step-by-Step

Data analysis with Python using tools like Pandas, Excel, R, and SQL for efficient data manipulation and visualization.

Introduction

Data analysis with Python has become an essential skill for professionals across various industries. By using Python’s powerful libraries like Pandas, data analysis has never been easier and more efficient. Whether you’re handling big data, performing scientific research, or working with business datasets, Python provides the tools to manage and analyze data effectively.

The demand for highly skilled professionals who can play with data, understand complex mathematical strategies by analysing complex datasets has increased abruptly. Python has become a top choice for data analysis due to its efficiency, clear syntax, and rich library ecosystem. Python and libraries like Pandas, Matplotlib, pyplot, numpy, and seaborn are widely used for data analysis. It provides structured data and features for efficient data manipulation, cleaning, and analysis.

What is Pandas in Data Analysis?

Pandas is a renowned library for data analysis in Python. It allows flexible data manipulation and analysis through two primary data structures: Series (1D) and DataFrame (2D). The series is one-dimensional, whereas the Data Frame is two-dimensional. These structures allow users to organize, label, and manage data. We can analyze any numerical or textual data extracted from any source, such as CSV files, SQL databases, or Excel spreadsheets. Pandas can perform various data-related operations, including:

  • Fast loading, efficient cleaning, and transformation of large datasets
  • Excellent alignment and indexing that allows clear organisation and easy access to data
  • Handling of missing and null values to increase data quality
  • Flexibility for the reshaping and merging of datasets for in-depth analysis

Comparison of data analysis tools

Python vs Excel vs R vs SQL: Which One’s Better for Data Analysis?

  • Excel is a user-friendly tool for analysing small datasets. It has special built-in features like Pivot Tables and charts for visual representation of data. But for large datasets consisting of more than 10,000 rows, we need efficient tools like Python with libraries like pandas and seaborn that can handle millions of rows easily.
  • R is a powerful tool commonly used for statistical analysis. Apart from analysis, it can perform various other tasks, including modelling, clustering, and Time-series modelling, and can also handle large datasets. However, Python with pandas is more versatile and integrates seamlessly with machine Learning workflows.
  • SQL is used for querying, filtering, and aggregating datasets. It is widely used in industry and is highly efficient for extracting and joining large datasets. However, SQL is limited to data retrieval and transformation. For advanced analysis, Pandas offers great flexibility and a set of analytical tools with Python scripts.
    While Excel is great for small-scale analysis and SQL excels at querying large databases, Python with Pandas combines power, flexibility, and scalability — making it a top choice for end-to-end data analysis and automation.

Installing and Importing Pandas for Data Analysis with Python

Installing Pandas:

Using pip:
pip install pandas

Using conda:
conda install pandas

How to import Pandas in a script or notebook:
import pandas as pd

To check if Pandas is installed properly:
print(pd.__version__)

Insights about Pandas Data Structures

Series

A dimensional labelled array that can hold any type of data can hold integers, floats, strings, or even Python objects. Each value has its index label that helps in accessing and manipulating easy and efficient.

Example:

import pandas as pd
# Creating a Series from a list
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)

Output:

a    10  
b    20  
c    30  
d    40  
dtype: int64

DataFrame

Two-dimensional labelled data structure that helps to organise information into rows and columns.

From a List of Lists

import pandas as pd

data = [
    [1, 'Alice', 85],
    [2, 'Bob', 90],
    [3, 'Charlie', 78]
]
df = pd.DataFrame(data, columns=['ID', 'Name', 'Score'])
print(df)

From a Dictionary

data = {
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
}
df = pd.DataFrame(data)
print(df)

From a NumPy Array

import numpy as np

array_data = np.array([
    [1, 85],
    [2, 90],
    [3, 78]
])
df = pd.DataFrame(array_data, columns=['ID', 'Score'])
print(df)

From a CSV File

Assuming students.csv:

ID,Name,Score
1,Alice,85
2,Bob,90
3,Charlie,78
df = pd.read_csv('students.csv')
print(df)

Importing and Exporting Data

How to read data:

  • CSV: pd.read_csv('file.csv')
  • Excel: pd.read_excel('file.xlsx')
  • JSON: pd.read_json('file.json')

Writing data:

  • CSV: df.to_csv('output.csv')
  • Excel: df.to_excel('output.xlsx')
  • JSON: df.to_json('output.json')

Solution to common file path error:

  • Double-check the file path and extension.
  • Use raw strings (prefix with r) for Windows paths.
  • Ensure you have read/write permissions.

Exploring Data and Summary Stats

print(df.head())
print(df.info())
print(df.describe())
  • head() – Shows the first 5 rows
  • tail() – Shows the last 5 rows
  • sample() – Returns a random sample
  • info() – Displays DataFrame summary
  • describe() – Summary statistics

Cleaning Data in Pandas

Handling missing values:

df.isnull()
df.dropna()
df.fillna(value)

Dropping duplicates:

df.drop_duplicates()

Converting data types:

df['col'] = df['col'].astype(int)

Parsing dates and categoricals:

pd.to_datetime(df['date_col'])
df['col'] = df['col'].astype('category')

Selecting and Filtering Data

loc[] – Label-based:

df.loc[2, 'Name']
df.loc[df['Age'] > 25]

iloc[] – Position-based:

df.iloc[0, 1]
df.iloc[0:3, :]

Slicing:

df['Name']
df[['Name', 'Age']]

Transforming Data

  • Sorting: df.sort_values('column'), df.sort_index()
  • Renaming: df.rename(columns={'old': 'new'})
  • Applying functions: df['col'].apply(func), df['col'].map(dict)
  • Reshaping: df.melt(), df.pivot(), df.transpose()

Grouping and Aggregating

df.groupby('column').mean()
df.groupby(['col1', 'col2']).sum()
df.agg({'col1': 'mean', 'col2': 'sum'})

Merging and Joining DataFrames

pd.concat([df1, df2])
pd.merge(df1, df2, on='key')
df1.join(df2, how='left')

Join types: inner, outer, left, right

Visualizing Results in Data Analysis with Pandas

df.plot.line()
df.plot.bar()
df.plot.hist()
df.plot.box()

Customization:

  • Add titles, axis labels, and legends
  • For advanced plots, use Matplotlib or Seaborn
import matplotlib.pyplot as plt
plt.savefig('plot.png')

Hands-On Example: Analyzing a Public Dataset

Step 1: Load Data

import pandas as pd

df = pd.read_csv('World Happiness Report 2021.csv')

Step 2: Explore Data

print(df.head())
print(df.info())
print(df.describe())

Step 3: Clean Data

df['Score'].fillna(df['Score'].median(), inplace=True)
df.drop_duplicates(inplace=True)

Step 4: Analyze and Visualize

# Average happiness score by region
avg_score = df.groupby('Regional indicator')['Score'].mean()
print(avg_score)

# Plot average score by region
avg_score.plot(kind='bar', title='Average Happiness Score by Region')

import matplotlib.pyplot as plt
plt.ylabel('Score')
plt.show()

Step 5: Summarize Insights

  • Some regions consistently report higher happiness scores.
  • Visualizations reveal disparities and trends across regions.

Common Mistakes and Best Practices in Pandas

SettingWithCopyWarning

# Problematic
df_subset = df[df['age'] > 30]
df_subset['eligible'] = True

# Correct
df.loc[df['age'] > 30, 'eligible'] = True

Tips:

  • Avoid modifying raw data directly
  • Chain pandas methods for step-by-step clarity
  • Use chunks for large datasets
  • Comment code and use descriptive variables
  • Maintain reproducibility and collaboration clarity

Conclusion

Pandas act as essential resources for data analysis. They empower users to effectively manage, clean, and analyse real-world datasets. To become proficient in Python and Pandas, practice working with real-world datasets like those from Kaggle or open government portals. This will solidify your skills and prepare you for advanced topics such as time series analysis, predictive modeling, and machine learning workflows. With more practice, one can easily jump to more advanced topics like Time series and advanced data visualisation or combine Pandas with machine learning workflows.

 

FAQ

With dedicated effort, 6 months is ideal.

DSA is crucial, but practical development skills are also needed.

Work on projects, join coding competitions, and practice daily.

Don't just learn... Master it!

With expert mentors, hands-on projects, and a community of learners, we make skill-building easy and impactfull

Related Blog

8

Min Read

This image illustrates a glowing digital brain made of interconnected neural...
8

Min Read

This image illustrates a glowing digital brain made of interconnected neural...

Related Blog

Scroll to Top