Table of Content

Top SQL Queries Every Data Scientist Should Know

SQL for Data Scientists

Introduction

SQL for Data Scientists is essential for efficiently managing and analyzing large datasets. SQL (Structured Query Language) enables data professionals to store, retrieve, update, and delete data from relational databases. In the context of data science, where extracting insights from vast amounts of data is key, SQL acts as the foundation for data retrieval, transformation, and exploration. Mastering SQL for Data Scientists significantly enhances the speed and quality of data analysis, enabling faster insights and more informed decision-making.

Mastering SQL queries may significantly improve the workflows of data processing techniques. This allows the organisation to have faster insights into data and a more agile decision-making process. For instance, a well-structured query can be used to extract the behavioural pattern of the user using an extensive transactional database. Thus, proficient in SQL skills leads to efficient data handling and insightful analyses.

Selecting Data: The SELECT Statement

The SELECT statement is the fundamental block for exploring any data from any tables in an SQL query. This statement is used for retrieving data from one or more tables. 

Basic Syntax:

– Fetch all rows and columns from a table named ‘datSci’
SELECT *
FROM Courses;

Here, (*) the asterisk is used in the query to select all columns from the table. To fetch specific columns, column names should be listed instead of asterisk, separated by commas:

Example

– Fetch the listed columns: user ID, username
SELECT userID, userName
From Courses;

Most datasets have duplicate entries. In order to find unique data in a column, “DISTINCT” keyword is used to retrieve unique records from a dataset.

-Find all the unique course names from the Course table
SELECT DISTINCT courseName
FROM Courses;

Filtering Data : The WHERE Clause

The WHERE clause helps us to filter rows based on certain conditions. This enables us to focus on a subset of data which meets the given conditions.

Basic Filtering Conditions:
Comparison operators like =,>,<,>=,<= and <> are used in the “WHERE” clause to filter the data. 

Example

– Select the user who enrolled in the course “Data Science”
SELECT userName, email
FROM Courses
WHERE courseName = “Data Science”
–Find the course duration less than 10 week
SELECT courseName, durationWeeks
FROM Courses
WHERE durationWeeks <10;

Logical Operators: AND, OR, NOT:

To combine multiple conditions, logical operators such as AND, OR and NOT are used as follows:

Example:

–Find users registered on course “Data Science” with scholarship greater than 90
SELECT Courses_userName, courseName, marks
FROM Courses c
JOIN Schlorarship s ON c.userID = s.userId
WHERE courseName = “Data Science” AND marks >90;
–Find users registered on course “Data Science” or pass percentage is less than 70
SELECT userName, courseName, marks
FROM Courses c
LEFT JOIN Schlorarship s ON c.userID = s.userId
WHERE courseName = “Data Science” OR marks < 70;
–Find users who are not enrolled in the course “Data Science”
SELECT userName, courseName
FROM Courses c
WHERE NOT courseName = “Data Science”;

Aggregating Data with GROUP BY and Aggregate Functions

Common Aggregate functions such as SUM(), AVG(), COUNT() are used to perform calculations on sets of rows and the GROUP BY clause groups rows with the same values in one or more columns into a summary row.

Example

–Calculate the total fees amount paid
SELECT SUM(AmountPaid) AS totalPayments
FROM Paymentlist;
–Count the number of users paid for the course
SELECT DISTINCT COUNT(*) AS totalPaidUsers
FROM Courses–Find the
SELECT AVG(marks) As averageMarksFROM Marks;

Grouping Data for Summary Reports

GROUP BY is used to combine functions into specific groups

Example

–Calculate the number of users registered for Data Science course
SELECT userID, courseName, COUNT(userID) as total_members
FROM Users
GROUP BY courseName –Find the total payment done for each course
SELECT c.courseName, SUM(p.AmountPaid) as totalPayments
FROM Courses c
JOIN Paymentlist p ON c.courseID = p.CourseID
GROUP BY c.courseName;

Filtering Grouped Results: The Having Clause

The “HAVING” clause filters the output of the GROUP BY query. It works similar to the “WHERE” function. However, it works only on grouped rows rather than individual rows. The significant difference between “WHERE” and “HAVING” is that the “WHERE” clause filters rows before the group is created, while the “HAVING” clause filters groups after it is made.

Example

–Find courses with total payments greater than 1000
SELECT c.courseName, SUM(p.AmountPaid) as totalPayments
FROM Courses c
JOIN Paymentlists p ON c.courseID = p.courseID
GROUP BY c,courseName
HAVING SUM(p.AmountPaid)>1000;–Find cities with more than 2 registered users.
SELECT u.cities,
COUNT (u.userID) AS userCount
FROM Users u
GROUP BY u.cities
HAVING COUNT(u.userID) >2;

The above example filters the groups by cities. It only includes cities where the number of users is greater than 2.

Sorting Data with ORDER BY

The ORDER BY clause is used to sort in ascending or descending order based on one or more columns.

Sorting in Ascending and Descending Order

Example

–Sort the usernames in Ascending order
SELECT userName
FROM Users
ORDER BY userName ASC;
– Sort the duration of weeks in descending order
SELECT courseID, DurationWeeks
FROM Courses
ORDER BY DurationWeeks DESC;

Multi-Column Sorting for Better Insights

The SQL in Data Science needs to sort multiple columns for better insights.

Example

– Sort users by course name by first (ascending) and then by username (ascending)
SELECT courseName, userName
FROM users
ORDER BY courseName ASC, userName ASC;

Combining Data: SQL Joins

Joins are used for combining two or more table data based on similar columns.

INNER JOIN : It is used to return the result when both tables have the same rows.

Example

–Find the users ID and name who have made the payments
SELECT p.userID , u.userName
FROM Users u
INNER JOIN paymentlists p ON u.userID = p.userID;

LEFT JOIN (LEFT OUTER JOIN): It matches the columns from the left table with the right table and returns the results from the left table. If there is no match in the right table, it results NULL

Example

–Display all courses from the course table and their payments though there is no payments made.
SELECT c.courseName, SUM(p.amountPaid) AS totalPayments
FROM Courses c
LEFT JOIN Paymentlist p ON c.courseID = p.courseID
GROUP BY c.courseName
ORDER BY c.courseName;

RIGHT JOIN : t matches the columns from the left table with the right table and returns the results from the right table. If there is no match in the left table, it results NULL

Example

– Show all payments and its course Name, though course details are missing in Payment lists
SELECT p.paymentID, p.amountPaid, c.courseName
FROM Courses c
RIGHT JOIN paymentlists p ON c.courseID = p.courseID
ORDER BY p.paymentID

FULL OUTER JOIN : It combines all rows from both tables, irrespective of missing matches.

Use Case: Combining sales data with customer feedback where some sales lack feedback and some lack sales records.

GitHub Example : The “SQL Join Playground” on Github gives an interactive way to visualize and understand FULL OUTER JOIN and outer JOIN examples.

Subqueries: Nesting for Flexibility

Queries nested within others to enhance SQL’s flexibility are called subqueries. It will appear in WHERE clauses. Correlated subqueries are used when the inner query depends on the outer query for each iteration. Non-correlated subqueries are used for independent results.

Window Functions for Advanced Analysis

Window functions enable advanced analysis by performing arithmetic across a set of related rows without grouping. ROW_NUMBER(), RANK(), and DENSE_RANK() assign unique ranked values within a window. NTILE() divides rows into a specified number of groups.

UseCase : This window function is used in segmenting customers into quartiles for behavior analysis.
For more in-depth knowledge, search “Window Functions for Data Science” on Medium for practical applications.

Handling NULLS Gracefully

SQL uses the IS NULL and IS NOT NULL operators to identify null values in the result. COALESCE() is used to result in a specified default value in the place of NULLs. NULLIF() returns the NULL value, if the conditions written for “if” are not satisfied.

Converting Data Types: CAST() and CONVERT()

In SQL for Data Scientists, CAST() and CONVERT() are used to change data types. This is important in data science for comparison and operations across different data formats. CAST() is used in ANSI-SQL standard. CONVERT() is used mainly in date and times for styling options.

Example

CAST (‘2025-04-19’ AS VARCHAR(10));

CONVERT(DATE, 2025-04-19, 101) – Here, 101 is the code for the date format mm/dd/yyyy.

CONVERT(INT, 135.67);

KDnuggets 2-Part Guide on Data Type Conversion can be used for further reference and practical application of these functions.

Conclusion

For Data Science, SQL is necessary for data manipulation tasks such as querying and joining. Use this article to sharpen your SQL skills to excel easily in data science. A Successful Journey to Data Science begins with mastering SQL for Data Scientists, then advancing to tools like NoSQL and BigQuery. This prepares aspirants to master SQL, adapts to new data technologies, and thrives in the fast lane.

FAQ

With dedicated effort, 6 months is ideal.

DSA is crucial, but practical development skills are also needed.

Work on projects, join coding competitions, and practice daily.

Don't just learn... Master it!

With expert mentors, hands-on projects, and a community of learners, we make skill-building easy and impactfull

Related Blog

6

Min Read

Data Analysts and Data Scientists both work with data but in...
6

Min Read

Data Analysts and Data Scientists both work with data but in...

Related Blog

Scroll to Top