Image by Author
SQL is one of the most used tools in data science for data manipulation. No wonder since SQL was created to query and manipulate data in databases. Because of its purpose, it offers a wide range of techniques for data manipulation.
Some belong to a more advanced spectrum and can be very useful in data science.
1. Subqueries and Correlated Subqueries
An SQL subquery is a query within another (main) query. They are typically used in the SELECT statement but can also be used in INSERT, UPDATE, and DELETE. Subqueries can be used in FROM, WHERE, HAVING, and JOIN clauses.
A correlated subquery is a special type of subquery that depends on the results of the main query.
When to Use Them:
- Filtering
- Calculations
- Restructuring data
- Row-by-row data evaluations (correlated subqueries)
- Dynamic lookups (correlated subqueries)
2. Common Table Expressions (CTE)
CTE is a temporary result set that can be referenced in SELECT, INSERT, UPDATE, or DELETE statements. In many cases, they are often nothing but neatly written subqueries. However, one significant difference between them is that CTEs are reusable in the main query, unlike subqueries.
When to Use Them:
- Recursive queries
- When the same ‘subquery’ result needs to be used across multiple steps
- Breaking down complex queries into smaller logical components
3. Recursive Queries
In SQL, recursive queries are written in recursive CTEs. A recursive query references itself, making it perfect for querying hierarchical and graph data structures.
When to Use Them:
- Finding descendants in a hierarchical structure (e.g., organizational chart or a product category tree)
- Calculating hierarchical paths (e.g., finding the reporting chain from an employee to the CEO)
- Generating sequential data
- Traversing graphs (e.g., finding all possible routes between locations in a transportation network)
- Nested totals (e.g., sales per product, product category, and grand total)
4. Window Functions
Window functions allow you to perform calculations across rows related to the current row, with the need to aggregate data.
When to Use Them:
5. Set Operators
Set operators are used to combine the results of two or more SELECT queries into a single output. They are:
- UNION: Combines the queries’ outputs and removes duplicate rows.
- UNION ALL: Combines the queries’ outputs, including duplicates.
- INTERSECT: Returns only rows present in all queries’ outputs.
- EXCEPT (or MINUS): Returns only rows appearing in the first query’s output but not others.
When to Use Them:
- Comparing datasets
- Filtering results across multiple tables
6. GROUP BY Extensions
GROUP BY is a standard SQL clause for data aggregation. However, you can perform more complex groupings using these GROUP BY extensions:
- GROUPING SETS: For multiple groupings in a single GROUP BY.
- ROLLUP: For creating subtotals and grand totals in a single query.
- CUBE: For creating all possible combinations of aggregations for columns in GROUP BY, including subtotals for each level and a grand total.
When to Use Them:
- Hierarchical summaries
- Multi-dimensional analysis
- Generating various aggregate views
7. String Functions
Complex data often includes textual fields that require manipulation to extract insights or prepare for analysis. In SQL, many string functions help you with that, such as:
- TRIM(): Removes leading and/or trailing characters from a string.
- REPLACE(): Substitutes all occurrences of a substring within a string with a new substring.
- SUBSTRING(): Extracts a part of a string from a specified starting position and length.
- LIKE: Allows pattern matching within strings using wildcard characters, such as % and _.
- PATINDEX(): Returns the starting position of a pattern within a string or zero if the pattern is not found.
- RegEx: Provides a way to search, match, and manipulate strings based on complex patterns.
- SPLIT_PART(): Splits a string by a delimiter and returns a particular segment based on an index.
- STRING_AGG(): Concatenates values from multiple rows into a single string, separated by a specified delimiter.
When to Use Them:
- Data cleaning
- Pattern matching
- Text parsing and tokenization
- Text aggregation
Conclusion
Consider learning and incorporating these SQL techniques into your data manipulation techniques in data science projects. They will definitely satisfy many of your needs.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.
This post was originally published on here