五个最难的SQL任务:数据缺失、透视、独热编码

应用开发2025-11-04 20:07:4718

许多人体验到了云数据仓库内集中计算的个最速度和效率优势,但同时大家也认识到这种方法存在一些缺点。任热编缺点之一是据缺需要学习和执行不同语言(尤其是SQL)的查询,这增加了很多复杂性。失透视独

为了解决这个问题,个最本文分享一些最难以在SQL中学习和执行的任热编具体转换,并提供实际所需的据缺SQL代码。

1. Datespine

Datespine是失透视独生成日期索引的转换。

想象下,个最您正在分析每日销售数据,任热编表格如下:

因16日和17日没有销售,据缺对应的失透视独行缺失。如果想计算平均每日销售额或构建时间序列预测模型,个最那这种格式是任热编个问题,我们需要插入缺失日期的据缺行。

基本概念:

生成或选择唯一日期生成或选择唯一产品交叉连接(笛卡尔积)1&2的所有组合将第3步的结果与原始数据进行外连接 复制WITH GLOBAL_SPINE AS ( SELECT ROW_NUMBER() OVER ( ORDER BY NULL ) as INTERVAL_ID, DATEADD( day, (INTERVAL_ID - 1), 2020-01-01T00:00 :: timestamp_ntz ) as SPINE_START, DATEADD( day, INTERVAL_ID, 2020-01-01T00:00 :: timestamp_ntz ) as SPINE_END FROM TABLE ( GENERATOR(ROWCOUNT => 1097) ) ), GROUPS AS ( SELECT product, MIN(sales_date) AS LOCAL_START, MAX(sales_date) AS LOCAL_END FROM My_First_Table GROUP BY product ), GROUP_SPINE AS ( SELECT product, SPINE_START AS GROUP_START, SPINE_END AS GROUP_END FROM GROUPS G CROSS JOIN LATERAL ( SELECT SPINE_START, SPINE_END FROM GLOBAL_SPINE S WHERE S.SPINE_START >= G.LOCAL_START ) ) SELECT G.product AS GROUP_BY_product, GROUP_START, GROUP_END, T.* FROM GROUP_SPINE G LEFT JOIN My_First_Table T ON sales_date >= G.GROUP_START AND sales_date < G.GROUP_END AND G.product = T.product;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.

最终结果如下所示:

2. Pivot

Pivot是一种将行数据转换为列数据的操作,以便更好地进行分析和可视化。

有时,在进行分析时,免费源码下载您希望重新构造表格。例如,有个罗列学生、科目和成绩表格,我们想把具体科目分解为每个列。

之前:

复制SELECT Student, MATHEMATICS, GEOGRAPHY, PHYS_ED FROM ( SELECT Student, Grade, Subject FROM skool) PIVOT ( AVG ( Grade ) FOR Subject IN ( Mathematics, Geography, Phys Ed ) ) as p ( Student, MATHEMATICS, GEOGRAPHY, PHYS_ED );1.2.3.4.

结果:

3. One-Hot编码(或“虚拟”变量)

One-Hot编码是一种将分类变量转换为数字变量的方法。

这里是一个使用STATE作为列进行独热编码的示例。

之前:

复制SELECT *, CASE WHEN State = AL THEN 1 ELSE 0 END as STATE_AL, CASE WHEN State = AK THEN 1 ELSE 0 END as STATE_AK, CASE WHEN State = AZ THEN 1 ELSE 0 END as STATE_AZ, CASE WHEN State = AR THEN 1 ELSE 0 END as STATE_AR, CASE WHEN State = AS THEN 1 ELSE 0 END as STATE_AS, CASE WHEN State = CA THEN 1 ELSE 0 END as STATE_CA, CASE WHEN State = CO THEN 1 ELSE 0 END as STATE_CO, CASE WHEN State = CT THEN 1 ELSE 0 END as STATE_CT, CASE WHEN State = DC THEN 1 ELSE 0 END as STATE_DC, CASE WHEN State = FL THEN 1 ELSE 0 END as STATE_FL, CASE WHEN State = GA THEN 1 ELSE 0 END as STATE_GA, CASE WHEN State = HI THEN 1 ELSE 0 END as STATE_HI, CASE WHEN State = ID THEN 1 ELSE 0 END as STATE_ID, CASE WHEN State = IL THEN 1 ELSE 0 END as STATE_IL, CASE WHEN State = IN THEN 1 ELSE 0 END as STATE_IN, CASE WHEN State = IA THEN 1 ELSE 0 END as STATE_IA, CASE WHEN State = KS THEN 1 ELSE 0 END as STATE_KS, CASE WHEN State = KY THEN 1 ELSE 0 END as STATE_KY, CASE WHEN State = LA THEN 1 ELSE 0 END as STATE_LA, CASE WHEN State = ME THEN 1 ELSE 0 END as STATE_ME, CASE WHEN State = MD THEN 1 ELSE 0 END as STATE_MD, CASE WHEN State = MA THEN 1 ELSE 0 END as STATE_MA, CASE WHEN State = MI THEN 1 ELSE 0 END as STATE_MI, CASE WHEN State = MN THEN 1 ELSE 0 END as STATE_MN, CASE WHEN State = MS THEN 1 ELSE 0 END as STATE_MS, CASE WHEN State = MO THEN 1 ELSE 0 END as STATE_MO, CASE WHEN State = MT THEN 1 ELSE 0 END as STATE_MT, CASE WHEN State = NE THEN 1 ELSE 0 END as STATE_NE, CASE WHEN State = NV THEN 1 ELSE 0 END as STATE_NV, CASE WHEN State = NH THEN 1 ELSE 0 END as STATE_NH, CASE WHEN State = NJ THEN 1 ELSE 0 END as STATE_NJ, CASE WHEN State = NM THEN 1 ELSE 0 END as STATE_NM, CASE WHEN State = NY THEN 1 ELSE 0 END as STATE_NY, CASE WHEN State = NC THEN 1 ELSE 0 END as STATE_NC, CASE WHEN State = ND THEN 1 ELSE 0 END as STATE_ND, CASE WHEN State = OH THEN 1 ELSE 0 END as STATE_OH, CASE WHEN State = OK THEN 1 ELSE 0 END as STATE_OK, CASE WHEN State = OR THEN 1 ELSE 0 END as STATE_OR, CASE WHEN State = PA THEN 1 ELSE 0 END as STATE_PA, CASE WHEN State = RI THEN 1 ELSE 0 END as STATE_RI, CASE WHEN State = SC THEN 1 ELSE 0 END as STATE_SC, CASE WHEN State = SD THEN 1 ELSE 0 END as STATE_SD, CASE WHEN State = TN THEN 1 ELSE 0 END as STATE_TN, CASE WHEN State = TX THEN 1 ELSE 0 END as STATE_TX, CASE WHEN State = UT THEN 1 ELSE 0 END as STATE_UT, CASE WHEN State = VT THEN 1 ELSE 0 END as STATE_VT, CASE WHEN State = VA THEN 1 ELSE 0 END as STATE_VA, CASE WHEN State = WA THEN 1 ELSE 0 END as STATE_WA, CASE WHEN State = WV THEN 1 ELSE 0 END as STATE_WV, CASE WHEN State = WI THEN 1 ELSE 0 END as STATE_WI, CASE WHEN State = WY THEN 1 ELSE 0 END as STATE_WY FROM BABYTABLE;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.

结果:

4. 市场篮子分析

市场篮子分析是购物篮分析和挖掘关联规则的一种方法。在这个过程中,首先需要对数据进行格式化,以便将每笔交易聚合到单个记录中。对于个人电脑来说,这一步骤可能很具挑战性,因为它涉及数据处理和转换。然而,数据仓库专为高效处理这些数据而设计,因此它更适合执行市场篮子分析所需的数据格式化任务。数据仓库提供了便捷的功能,使数据格式化更加容易,从而支持购物篮分析和关联规则的挖掘。

复制WITH order_detail as ( SELECT SALESORDERNUMBER, listagg(ENGLISHPRODUCTNAME, , ) WITHIN group ( order by ENGLISHPRODUCTNAME ) as ENGLISHPRODUCTNAME_listagg, COUNT(ENGLISHPRODUCTNAME) as num_products FROM transactions GROUP BY SALESORDERNUMBER ) SELECT ENGLISHPRODUCTNAME_listagg, count(SALESORDERNUMBER) as NumTransactions FROM order_detail where num_products > 1 GROUP BY ENGLISHPRODUCTNAME_listagg order by count(SALESORDERNUMBER) desc;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.

结果:

5. 时间序列聚合

时间序列聚合是指将时间序列数据按照一定的时间间隔进行汇总和聚合,广泛用语分析数据。b2b信息网然而,要正确执行时间序列聚合,关键因素之一是需要对数据进行适当的格式化,尤其是在使用窗口函数时。这一步骤的正确处理对于获得准确且有意义的聚合结果至关重要。

例如,假设我们想计算过去14天的平均销售额。使用窗口函数的方法要求我们将所有销售数据转换为每天一行的格式。然而,实际上,大多数销售数据是以交易级别存储的,这让格式转换变得困难。这就是时间序列聚合发挥作用的地方。通过时间序列聚合,我们可以创建历史指标的聚合结果,无需重新格式化整个数据集。如果我们想一次添加多个指标,它也会派上用场:

过去14天内的平均销售额过去6个月中最大的购买统计过去 90 天内的亿华云计算不同产品类型

如果您想使用窗口函数,则需要通过几个步骤独立构建每个指标。

处理此问题的更好方法是使用公共表表达式 (CTE) 来定义每个预先聚合的历史窗口。

例如:

复制WITH BASIC_OFFSET_14DAY AS ( SELECT A.CustomerID, A.TransactionDate, AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST14DAY, MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST14DAY, COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST14DAY FROM My_First_Table A INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID AND 1 = 1 WHERE B.TransactionDate >= DATEADD(day, -14, A.TransactionDate) AND B.TransactionDate <= A.TransactionDate GROUP BY A.CustomerID, A.TransactionDate ), BASIC_OFFSET_90DAY AS ( SELECT A.CustomerID, A.TransactionDate, AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST90DAY, MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST90DAY, COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST90DAY FROM My_First_Table A INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID AND 1 = 1 WHERE B.TransactionDate >= DATEADD(day, -90, A.TransactionDate) AND B.TransactionDate <= A.TransactionDate GROUP BY A.CustomerID, A.TransactionDate ), BASIC_OFFSET_180DAY AS ( SELECT A.CustomerID, A.TransactionDate, AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST180DAY, MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST180DAY, COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST180DAY FROM My_First_Table A INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID AND 1 = 1 WHERE B.TransactionDate >= DATEADD(day, -180, A.TransactionDate) AND B.TransactionDate <= A.TransactionDate GROUP BY A.CustomerID, A.TransactionDate ) SELECT src.*, BASIC_OFFSET_14DAY.AVG_PURCHASEAMOUNT_PAST14DAY, BASIC_OFFSET_14DAY.MAX_PURCHASEAMOUNT_PAST14DAY, BASIC_OFFSET_14DAY.COUNT_DISTINCT_TRANSACTIONID_PAST14DAY, BASIC_OFFSET_90DAY.AVG_PURCHASEAMOUNT_PAST90DAY, BASIC_OFFSET_90DAY.MAX_PURCHASEAMOUNT_PAST90DAY, BASIC_OFFSET_90DAY.COUNT_DISTINCT_TRANSACTIONID_PAST90DAY, BASIC_OFFSET_180DAY.AVG_PURCHASEAMOUNT_PAST180DAY, BASIC_OFFSET_180DAY.MAX_PURCHASEAMOUNT_PAST180DAY, BASIC_OFFSET_180DAY.COUNT_DISTINCT_TRANSACTIONID_PAST180DAY FROM My_First_Table src LEFT OUTER JOIN BASIC_OFFSET_14DAY ON BASIC_OFFSET_14DAY.TransactionDate = src.TransactionDate AND BASIC_OFFSET_14DAY.CustomerID = src.CustomerID LEFT OUTER JOIN BASIC_OFFSET_90DAY ON BASIC_OFFSET_90DAY.TransactionDate = src.TransactionDate AND BASIC_OFFSET_90DAY.CustomerID = src.CustomerID LEFT OUTER JOIN BASIC_OFFSET_180DAY ON BASIC_OFFSET_180DAY.TransactionDate = src.TransactionDate AND BASIC_OFFSET_180DAY.CustomerID = src.CustomerID;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.

结果:

结语

希望本篇文章有助于阐明数据从业者在操作现代数据堆栈时会遇到的不同问题。当涉及查询云数据仓库时,SQL是一把双刃剑。虽然将计算集中在云数据仓库中可以提高速度,但有时需要一些额外的SQL技能。

本文地址:http://www.bzve.cn/news/494f2199484.html
版权声明

本文仅代表作者观点,不代表本站立场。
本文系作者授权发表,未经许可,不得转载。

全站热门

探究电脑密码错误的原因与解决方法(密码保护下的安全隐患与应对策略)

循序渐进讲解负载均衡vivoGateway(VGW)

2023年数据中心存储冷却最新技术

年研发支出300亿美元,苹果想要自研更多芯片?

电脑主机配置和安装教程(从零开始学习电脑主机配置和安装,让你的电脑运行更流畅)

干货!分享Nginx搭建Web测试报告服务器的落地方案

重复利用数据中心多余热量的四种创新方法

日本修订芯片战略:到 2030 年国产半导体行业销售额提高两倍至 15 万亿日元

友情链接

滇ICP备2023006006号-39