Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
258 views
in Technique[技术] by (71.8m points)

python - Weighted moving average in Pyspark

I'm writing an anomaly detection algorithm for time series in Pyspark. I want to calculate a weighted moving average of a (-3,3) or (-4,4) window. Right now I am using lag and lead over window functions and multiplying them by a set of weights. My window currently is (-2,2).

I want to know if there is another way to calculate the weighted moving average in Pyspark.

Current code that I am using is :

data_frame_1 = spark_data_frame.withColumn("weighted_score_predicted", (weights[0] * lag(column_metric, 1).over(w) + weights[1] * lag(column_metric, 2).over(w) + weights[2] * lead(column_metric, 1).over(w) + weights[3] * lead(column_metric, 2).over(w)) / 2).na.drop()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can generalize your current code:

from pyspark.sql.functions import coalesce, lit, col, lead, lag
from operator import add
from functools import reduce

def weighted_average(c, window, offsets, weights):
    assert len(weights) == len(offsets)

    def value(i):
        if i < 0: return lag(c, -i).over(window)
        if i > 0: return lead(c, i).over(window)
        return c

    # Create a list of Columns
    # - `value_i * weight_i` if `value_i IS NOT NULL` 
    # - literal 0 otherwise
    values = [coalesce(value(i) * w, lit(0)) for i, w in zip(offsets, weights)]

    # or sum(values, lit(0))
    return reduce(add, values, lit(0))

It can be used as:

from pyspark.sql.window import Window

df = spark.createDataFrame([
    ("a", 1, 1.4), ("a", 2, 8.0), ("a", 3, -1.0), ("a", 4, 2.4),
    ("a", 5, 99.0), ("a", 6, 3.0), ("a", 7, -1.0), ("a", 8, 0.0)
]).toDF("id", "time", "value")

w = Window.partitionBy("id").orderBy("time")
offsets, delays =  [-2, -1, 0, 1, 2], [0.1, 0.20, 0.4, 0.20, 0.1]

result = df.withColumn("avg", weighted_average(
    col("value"), w, offsets, delays
))
result.show()

## +---+----+-----+-------------------+ 
## | id|time|value|                avg|
## +---+----+-----+-------------------+
## |  a|   1|  1.4|               2.06|
## |  a|   2|  8.0| 3.5199999999999996|
## |  a|   3| -1.0|              11.72|
## |  a|   4|  2.4|              21.66|
## |  a|   5| 99.0| 40.480000000000004|
## |  a|   6|  3.0|              21.04|
## |  a|   7| -1.0|               10.1|
## |  a|   8|  0.0|0.10000000000000003|
## +---+----+-----+-------------------+

Note:

You might consider normalizing the results for frames with missing lags:

 result.withColumn(
     "normalization_factor",
     weighted_average(lit(1), w, offsets, delays)
 ).withColumn(
     "normalized_avg",
      col("avg") / col("normalization_factor")
).show()

## +---+----+-----+-------------------+--------------------+------------------+ 
## | id|time|value|                avg|normalization_factor|    normalized_avg|
## +---+----+-----+-------------------+--------------------+------------------+
## |  a|   1|  1.4|               2.06|  0.7000000000000001|2.9428571428571426|
## |  a|   2|  8.0| 3.5199999999999996|                 0.9|3.9111111111111105|
## |  a|   3| -1.0|              11.72|  1.0000000000000002|11.719999999999999|
## |  a|   4|  2.4|              21.66|  1.0000000000000002|21.659999999999997|
## |  a|   5| 99.0| 40.480000000000004|  1.0000000000000002|             40.48|
## |  a|   6|  3.0|              21.04|  1.0000000000000002|21.039999999999996|
## |  a|   7| -1.0|               10.1|  0.9000000000000001| 11.22222222222222|
## |  a|   8|  0.0|0.10000000000000003|  0.7000000000000001|0.1428571428571429|
## +---+----+-----+-------------------+--------------------+------------------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...