Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
119 views
in Technique[技术] by (71.8m points)

sql - Window Functions or Common Table Expressions: count previous rows within range

I'd like to use a window function to determine, for each row, the total number of preceding records meeting a certain criteria.

A specific example:

clone=# d test
              Table "pg_temp_2.test"
 Column |            Type             | Modifiers 
--------+-----------------------------+-----------
 id     | bigint                      | 
 date   | timestamp without time zone | 

I'd like to know for each date the count of rows within '1 hour previous' to that date.

Can I do this with a window function? Or do I need to investigate CTE's?

I really want to be able to write something like (not working):

SELECT id, date, count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval))
FROM test;

I can write this by joining test against itself, as below - but this won't scale with particularly large tables.

SELECT a.id, a.date, count(b.*)-1 
FROM test a, test b 
WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date)
GROUP BY 1,2
ORDER BY 2;

Is this something I could do with a recursive query? Or a regular CTE? CTEs aren't something I know a whole lot about just yet. I have a feeling I'm going to very soon. :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I don't think you can do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).

Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...

Test case

Using ts instead of reserved word date as column name.

CREATE TABLE test (
  id  bigint
, ts  timestamp
);

ROM - Roman's query

Use CTEs, aggregate timestamps into an array, unnest, count ...
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.

ARR - count array elements

I took Roman's query and tried to streamline it a bit:

  • Remove 2nd CTE which is not necessary.
  • Transform 1st CTE into subquery, which is faster.
  • Direct count() instead of re-aggregating into an array and counting with array_length().

But array handling is expensive, and performance still deteriorates badly with more rows.

SELECT id, ts
     , (SELECT count(*)::int - 1
        FROM   unnest(dates) x
        WHERE  x >= sub.ts - interval '1h') AS ct
FROM (
   SELECT id, ts
        , array_agg(ts) OVER(ORDER BY ts) AS dates
   FROM   test
   ) sub;

COR - correlated subquery

You could solve it with a simple correlated subquery. A lot faster, but still ...

SELECT id, ts
     , (SELECT count(*)
        FROM   test t1
        WHERE  t1.ts >= t.ts - interval '1h'
        AND    t1.ts < t.ts) AS ct
FROM   test t
ORDER  BY ts;

FNC - Function

Loop over rows in chronological order with a row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:

CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
  RETURNS TABLE (id bigint, ts timestamp, ct int)
  LANGUAGE plpgsql AS
$func$
DECLARE
   cur   CURSOR FOR
         SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
         FROM   test t ORDER BY t.ts;
   rec   record;
   rn    int;

BEGIN
   OPEN cur;
   FETCH cur INTO rec;
   ct := -1;  -- init

   FOR id, ts, rn IN
      SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
      FROM   test t ORDER BY t.ts
   LOOP
      IF rec.ts1 >= ts THEN
         ct := ct + 1;
      ELSE
         LOOP
            FETCH cur INTO rec;
            EXIT WHEN rec.ts1 >= ts;
         END LOOP;
         ct := rn - rec.rn;
      END IF;

      RETURN NEXT;
   END LOOP;
END
$func$;

Call with default interval of one hour:

SELECT * FROM running_window_ct();

Or with any interval:

SELECT * FROM running_window_ct('2 hour - 3 second');

db<>fiddle here
Old sqlfiddle

Benchmark

With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).

-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
         + g * interval '5 min'
         + random() * 300 * interval '1 min' -- halfway realistic values
FROM   generate_series(1, 10000) g;

CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test;  -- temp table needs manual analyze

I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE.

100 rows
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms

1000 rows
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms

5000 rows
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms

100000 rows
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms

The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...