Table of Contents

Description

This tutorial is based on the book "SQL Practice Problems". The book uses fake database created by Microsoft called Northwind and modifies it by adding new entries. The book provides the answer on MS SQL server 2016.

Notes:

The Order data is corrupted. But it can be recreated using Yugabyte github.

Imports

Load the database

Get Pandas dataframes

Pyspark dataframes

Peek at dataframes

PySpark dataframes

Advanced SQL Questions

32. High-value customers

We want to send all of our high-value customers a special VIP gift. We're defining high-value customers as those who've made at least 1 order with a total value (not including the discount) equal to $10,000 or more. We only want to consider orders made in the year 1997.

33. High-value customers - total orders

The manager has changed his mind. Instead of
requiring that customers have at least one individual
orders totaling 10,000 or more, he wants to define
high-value customers as those who have orders
totaling 15,000 or more in 1997. How would you
change the answer to the problem above?

34. High-value customers - with discount

Change the above query to use the discount when calculating high-value customers. Order by the total amount which includes the discount.

35. Month-end orders

At the end of the month, salespeople are likely to try much harder to get orders, to meet their month-end quotas. Show all orders made on the last day of the month. Order by EmployeeID and OrderID

36. Orders with many line items

The Northwind mobile app developers are testing an app that customers will use to show orders. In order to make sure that even the largest orders will show up correctly on the app, they'd like some samples of orders that have lots of individual line items. Show the 10 orders with the most line items, in order of total line items.

37. Orders - random assortment

The Northwind mobile app developers would now like to just get a random assortment of orders for beta testing on their app. Show a random set of 2% of all orders.

38. Orders - accidental double-entry

Janet Leverling, one of the salespeople, has come to you with a request. She thinks that she accidentally double-entered a line item on an order, with a different ProductID, but the same quantity. She remembers that the quantity was 60 or more. Show all the OrderIDs with line items that match this, in order of OrderID.

39. Orders - accidental double-entry details

Based on the previous question, we now want to show details of the order, for orders that match the above criteria.

40. Orders - accidental double-entry

details, derived table Here's another way of getting the same results as in the previous problem, using a derived table instead of a CTE. However, there's a bug in this SQL. It returns 20 rows instead of 16. Correct the SQL. Problem SQL:

Select
  OrderDetails.OrderID,
  ProductID,
  UnitPrice,
  Quantity,
  Discount
From OrderDetails
  Join (
    Select
      OrderID
      From OrderDetails
      Where Quantity >= 60
      Group By OrderID, Quantity
      Having Count(*) > 1
       ) PotentialProblemOrders
  on PotentialProblemOrders.OrderID = OrderDetails.OrderID
Order by OrderID, ProductID

41. Late orders

Some customers are complaining about their orders arriving late. Which orders are late?

42. Late orders - which employees?

Some salespeople have more orders arriving late than others. Maybe they're not following up on the order process, and need more training. Which salespeople have the most orders arriving late?

43. Late orders vs. total orders

Andrew, the VP of sales, has been doing some more thinking some more about the problem of late orders. He realizes that just looking at the number of orders arriving late for each salesperson isn't a good idea. It needs to be compared against the total number of orders per salesperson. Return results like the following:

Expected Result
EmployeeID LastName AllOrders LateOrders
----------- -------------------- ----------- -----------
1 Davolio 123 3
2 Fuller 96 4
3 Leverling 127 5
4 Peacock 156 10
6 Suyama 67 3
7 King 72 4
8 Callahan 104 5
9 Dodsworth 43 5

45. Late orders vs. total orders - fix null

Continuing on the answer for above query, let's fix the results for row 5 - Buchanan. He should have a 0 instead of a Null in LateOrders.

46. Late orders vs. total orders - percentage

Now we want to get the percentage of late orders over total orders.

47. Late orders vs. total orders - fix decimal

So now for the PercentageLateOrders, we get a decimal value like we should. But to make the output easier to read, let's cut the PercentLateOrders off at 2 digits to the right of the decimal point.

48. Customer grouping

Andrew Fuller, the VP of sales at Northwind, would like to do a sales campaign for existing customers. He'd like to categorize customers into groups, based on how much they ordered in 1997. Then, depending on which group the customer is in, he will target the customer with different sales materials. The customer grouping categories are 0 to 1,000, 1,000 to 5,000, 5,000 to 10,000, and over 10,000. A good starting point for this query is the answer from the problem “High-value customers - total orders. We don’t want to show customers who don’t have any orders in 1997. Order the results by CustomerID.

49. Customer grouping - fix null

There's a bug with the answer for the previous question. The CustomerGroup value for one of the rows is null. Fix the SQL so that there are no nulls in the CustomerGroup field.

50. Customer grouping with percentage

Based on the above query, show all the defined CustomerGroups, and the percentage in each. Sort by the total in each group, in descending order.

51. Customer grouping - flexible

52. Countries with suppliers or customers

Some Northwind employees are planning a business trip, and would like to visit as many suppliers and customers as possible. For their planning, they’d like to see a list of all countries where suppliers and/or customers are based.

53. Countries with suppliers or customers, version 2

The employees going on the business trip don’t want just a raw list of countries, they want more details. We’d like to see output like the below, in the Expected Results:

SupplierCountry CustomerCountry
--------------- ---------------
NULL Argentina
Australia NULL

54. Countries with suppliers or customers - version 3

The output of the above is improved, but it’s still not ideal What we’d really like to see is the country name, the total suppliers, and the total customers.

55. First order in each country

Looking at the Orders table—we’d like to show details for each order that was the first in that particular country, ordered by OrderID. So, we need one row per ShipCountry, and CustomerID, OrderID, and OrderDate should be of the first order from that country.

Expected Results
ShipCountry CustomerID OrderID OrderDate
--------------- ---------- ----------- ----------
Argentina OCEAN 10409 2015-01-09

56. Customers with multiple orders in 5 day period

There are some customers for whom freight is a major expense when ordering from Northwind. However, by batching up their orders, and making one larger order instead of multiple smaller orders in a short period of time, they could reduce their freight costs significantly. Show those customers who have made more than 1 order in a 5 day period. The sales people will use this to help customers reduce their costs. Note: There are more than one way of solving this kind of problem. For this problem, we will not be using Window functions.

Expected Result
CustomerID InitialOrderID InitialOrderDate nextid nextDate
NextOrderID NextOrderDate DaysBetween


cid   iordid idate      nxtid nxtdate    daysBetwn 
------------------------------------------------
ANTON 10677 2015-09-22 10682 2015-09-25 3
AROUT 10741 2015-11-14 10743 2015-11-17 3
BERGS 10278 2014-08-12 10280 2014-08-14 2
BERGS 10444 2015-02-12 10445 2015-02-13 1
BERGS 10866 2016-02-03 10875 2016-02-06 3

57. Customers with multiple orders in 5 day period, version 2

There’s another way of solving the problem above, using Window functions. We would like to see the following results.

Time Taken