Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
529 views
in Technique[技术] by (71.8m points)

streaming - Explaining The Count Sketch Algorithm

Can someone explain how the Count Sketch Algorithm works? I still can't figure out how hashes are used, for example. I have a hard time understanding this paper.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This streaming algorithm instantiates the following framework.

  1. Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).

  2. To reduce the variance/noise, run many independent copies in parallel and combine their outputs.

Usually 1 is more interesting than 2. This algorithm's 2 actually is somewhat nonstandard, but I'm going to talk about 1 only.

Suppose we're processing the input

a b c a b a .

With three counters, there's no need to hash.

a: 3, b: 2, c: 1

Let's suppose however that we have just one. There are eight possible functions h : {a, b, c} -> {+1, -1}. Here is a table of the outcomes.

 h  |
abc |  X = counter
----+--------------
+++ | +3 +2 +1 =  6
++- | +3 +2 -1 =  4
+-- | +3 -2 -1 =  0
+-+ | +3 -2 +1 =  2
--+ | -3 -2 +1 = -4
--- | -3 -2 -1 = -6
-+- | -3 +2 -1 = -2
-++ | -3 +2 +1 =  0

Now we can calculate expectations

            (6 + 4 + 0 + 2) - (-4 + -6 + -2 + 0)
E[h(a) X] = ------------------------------------ = 24/8 = 3
                             8

            (6 + 4 + -2 + 0) - (0 + 2 + -4 + -6)
E[h(b) X] = ------------------------------------ = 16/8 = 2
                             8

            (6 + 2 + -4 + 0) - (4 + 0 + -6 + -2)
E[h(c) X] = ------------------------------------ =  8/8 = 1 .
                             8

What's going on here? For a, say, we can decompose X = Y + Z, where Y is the change in the sum for the as, and Z is the sum for the non-as. By the linearity of expectation, we have

E[h(a) X] = E[h(a) Y] + E[h(a) Z] .

E[h(a) Y] is a sum with a term for each occurrence of a that is h(a)^2 = 1, so E[h(a) Y] is the number of occurrences of a. The other term E[h(a) Z] is zero; even given h(a), each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.

In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.

abc

+++
+--
-+-
--+

I'll leave the new calculations to you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...