Introduction
SQL for Data Scientists is essential for efficiently managing and analyzing large datasets. SQL (Structured Query Language) enables data professionals to store, retrieve, update, and delete data from relational databases. In the context of data science, where extracting insights from vast amounts of data is key, SQL acts as the foundation for data retrieval, transformation, and exploration. Mastering SQL for Data Scientists significantly enhances the speed and quality of data analysis, enabling faster insights and more informed decision-making.
Mastering SQL queries may significantly improve the workflows of data processing techniques. This allows the organisation to have faster insights into data and a more agile decision-making process. For instance, a well-structured query can be used to extract the behavioural pattern of the user using an extensive transactional database. Thus, proficient in SQL skills leads to efficient data handling and insightful analyses.
Selecting Data: The SELECT Statement
The SELECT statement is the fundamental block for exploring any data from any tables in an SQL query. This statement is used for retrieving data from one or more tables.
Basic Syntax:
– Fetch all rows and columns from a table named ‘datSci’ SELECT * FROM Courses; |
Here, (*) the asterisk is used in the query to select all columns from the table. To fetch specific columns, column names should be listed instead of asterisk, separated by commas:
Example
– Fetch the listed columns: user ID, username SELECT userID, userName From Courses; |
Most datasets have duplicate entries. In order to find unique data in a column, “DISTINCT” keyword is used to retrieve unique records from a dataset.
-Find all the unique course names from the Course table SELECT DISTINCT courseName FROM Courses; |
Filtering Data : The WHERE Clause
The WHERE clause helps us to filter rows based on certain conditions. This enables us to focus on a subset of data which meets the given conditions.
Basic Filtering Conditions:
Comparison operators like =,>,<,>=,<= and <> are used in the “WHERE” clause to filter the data.
Example
– Select the user who enrolled in the course “Data Science” SELECT userName, email FROM Courses WHERE courseName = “Data Science” –Find the course duration less than 10 week SELECT courseName, durationWeeks FROM Courses WHERE durationWeeks <10; |
Logical Operators: AND, OR, NOT:
To combine multiple conditions, logical operators such as AND, OR and NOT are used as follows:
Example:
–Find users registered on course “Data Science” with scholarship greater than 90 SELECT Courses_userName, courseName, marks FROM Courses c JOIN Schlorarship s ON c.userID = s.userId WHERE courseName = “Data Science” AND marks >90; –Find users registered on course “Data Science” or pass percentage is less than 70 SELECT userName, courseName, marks FROM Courses c LEFT JOIN Schlorarship s ON c.userID = s.userId WHERE courseName = “Data Science” OR marks < 70; –Find users who are not enrolled in the course “Data Science” SELECT userName, courseName FROM Courses c WHERE NOT courseName = “Data Science”; |
Aggregating Data with GROUP BY and Aggregate Functions
Common Aggregate functions such as SUM(), AVG(), COUNT() are used to perform calculations on sets of rows and the GROUP BY clause groups rows with the same values in one or more columns into a summary row.
Example
–Calculate the total fees amount paid SELECT SUM(AmountPaid) AS totalPayments FROM Paymentlist; –Count the number of users paid for the course SELECT DISTINCT COUNT(*) AS totalPaidUsers FROM Courses–Find the SELECT AVG(marks) As averageMarksFROM Marks; |
Grouping Data for Summary Reports
GROUP BY is used to combine functions into specific groups
Example
–Calculate the number of users registered for Data Science course SELECT userID, courseName, COUNT(userID) as total_members FROM Users GROUP BY courseName –Find the total payment done for each course SELECT c.courseName, SUM(p.AmountPaid) as totalPayments FROM Courses c JOIN Paymentlist p ON c.courseID = p.CourseID GROUP BY c.courseName; |
Filtering Grouped Results: The Having Clause
The “HAVING” clause filters the output of the GROUP BY query. It works similar to the “WHERE” function. However, it works only on grouped rows rather than individual rows. The significant difference between “WHERE” and “HAVING” is that the “WHERE” clause filters rows before the group is created, while the “HAVING” clause filters groups after it is made.
Example
–Find courses with total payments greater than 1000 SELECT c.courseName, SUM(p.AmountPaid) as totalPayments FROM Courses c JOIN Paymentlists p ON c.courseID = p.courseID GROUP BY c,courseName HAVING SUM(p.AmountPaid)>1000;–Find cities with more than 2 registered users. SELECT u.cities, COUNT (u.userID) AS userCount FROM Users u GROUP BY u.cities HAVING COUNT(u.userID) >2; |
The above example filters the groups by cities. It only includes cities where the number of users is greater than 2.
Sorting Data with ORDER BY
The ORDER BY clause is used to sort in ascending or descending order based on one or more columns.
Sorting in Ascending and Descending Order
Example
–Sort the usernames in Ascending order SELECT userName FROM Users ORDER BY userName ASC; – Sort the duration of weeks in descending order SELECT courseID, DurationWeeks FROM Courses ORDER BY DurationWeeks DESC; |
Multi-Column Sorting for Better Insights
The SQL in Data Science needs to sort multiple columns for better insights.
Example
– Sort users by course name by first (ascending) and then by username (ascending) SELECT courseName, userName FROM users ORDER BY courseName ASC, userName ASC; |
Combining Data: SQL Joins
Joins are used for combining two or more table data based on similar columns.
INNER JOIN : It is used to return the result when both tables have the same rows.
Example
–Find the users ID and name who have made the payments SELECT p.userID , u.userName FROM Users u INNER JOIN paymentlists p ON u.userID = p.userID; |
LEFT JOIN (LEFT OUTER JOIN): It matches the columns from the left table with the right table and returns the results from the left table. If there is no match in the right table, it results NULL
Example
–Display all courses from the course table and their payments though there is no payments made. SELECT c.courseName, SUM(p.amountPaid) AS totalPayments FROM Courses c LEFT JOIN Paymentlist p ON c.courseID = p.courseID GROUP BY c.courseName ORDER BY c.courseName; |
RIGHT JOIN : t matches the columns from the left table with the right table and returns the results from the right table. If there is no match in the left table, it results NULL
Example
– Show all payments and its course Name, though course details are missing in Payment lists SELECT p.paymentID, p.amountPaid, c.courseName FROM Courses c RIGHT JOIN paymentlists p ON c.courseID = p.courseID ORDER BY p.paymentID |
FULL OUTER JOIN : It combines all rows from both tables, irrespective of missing matches.
Use Case: Combining sales data with customer feedback where some sales lack feedback and some lack sales records.
GitHub Example : The “SQL Join Playground” on Github gives an interactive way to visualize and understand FULL OUTER JOIN and outer JOIN examples.
Subqueries: Nesting for Flexibility
Queries nested within others to enhance SQL’s flexibility are called subqueries. It will appear in WHERE clauses. Correlated subqueries are used when the inner query depends on the outer query for each iteration. Non-correlated subqueries are used for independent results.
Window Functions for Advanced Analysis
Window functions enable advanced analysis by performing arithmetic across a set of related rows without grouping. ROW_NUMBER(), RANK(), and DENSE_RANK() assign unique ranked values within a window. NTILE() divides rows into a specified number of groups.
UseCase : This window function is used in segmenting customers into quartiles for behavior analysis.
For more in-depth knowledge, search “Window Functions for Data Science” on Medium for practical applications.
Handling NULLS Gracefully
SQL uses the IS NULL and IS NOT NULL operators to identify null values in the result. COALESCE() is used to result in a specified default value in the place of NULLs. NULLIF() returns the NULL value, if the conditions written for “if” are not satisfied.
Converting Data Types: CAST() and CONVERT()
In SQL for Data Scientists, CAST() and CONVERT() are used to change data types. This is important in data science for comparison and operations across different data formats. CAST() is used in ANSI-SQL standard. CONVERT() is used mainly in date and times for styling options.
Example
CAST (‘2025-04-19’ AS VARCHAR(10));
CONVERT(DATE, 2025-04-19, 101) – Here, 101 is the code for the date format mm/dd/yyyy.
CONVERT(INT, 135.67);
KDnuggets 2-Part Guide on Data Type Conversion can be used for further reference and practical application of these functions.
Conclusion
For Data Science, SQL is necessary for data manipulation tasks such as querying and joining. Use this article to sharpen your SQL skills to excel easily in data science. A Successful Journey to Data Science begins with mastering SQL for Data Scientists, then advancing to tools like NoSQL and BigQuery. This prepares aspirants to master SQL, adapts to new data technologies, and thrives in the fast lane.