caching - How to query PostgreSQL real database, not its cache?

Question

Welcome To Ask or Share your Answers For Others

caching - How to query PostgreSQL real database, not its cache?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

caching - How to query PostgreSQL real database, not its cache?

I have an app written in Go (I don't think the programming language is relevant though...) that should act as a scheduler. It connects to a PostgreSQL database and calculates at what time is the next run for some tasks. The app sleeps until it's time to run the next task and then runs it (it actually publishes a message in a queue and another app is in charge with running the task).

In order to reduce the risk of not running a scheduled task at a given time, this app should be run on multiple nodes and only one node should publish the message to run the task.

The table in the DB has these two columns:

id serial
next_run int4

When the app starts it performs this query:

SELECT * FROM task;

After this it calculates all the times at which the tasks should be run. When it's time to run the first task, all nodes calculate the following next run and they perform this query:

UPDATE task SET next_run = ${new_next_run} WHERE id = ${id} AND next_run = ${current_next_run};

If the number of affected rows is 1, it means the node was the first one to reach the database and it is the one that will publish the message to run the task. In theory this should work and I even created a small prototype to test this produces the expected result.

In practice, there is a scenario in which this logic fails. When the app starts, it checks for all tasks and calculates their next runs. If the next task should run in less than 3 seconds (approx.), when the update query is performed, it doesn't check the "where" clause against the real database, but against what was saved in the cache. Or at least this is what I suppose happens...

If the first task to run is in more than 3 seconds, the tests pass:

T: node 1 -> rows affected = 1; publish message
T+100 milliseconds: node 2 -> rows affected = 0; don't publish message
T+130 milliseconds: node 3 -> rows affected = 0; don't publish message
..........
T+300 milliseconds: node n -> rows affected = 0; don't publish message

If the first task to run is in less than 3 seconds, the tests fail and my log looks like this:

T: node 1 -> rows affected = 1; publish message
T+100 milliseconds: node 2 -> rows affected = 1; publish message
T+130 milliseconds: node 3 -> rows affected = 0; don't publish message
T+180 milliseconds: node 4 -> rows affected = 1; publish message
T+200 milliseconds: node 5 -> rows affected = 1; publish message
T+220 milliseconds: node 6 -> rows affected = 0; don't publish message
..........
T+300 milliseconds: node n -> rows affected = 0; don't publish message

The only explanation making sense is that the UPDATE is performed using the cache, not the real database and that's why it gives unpredictable results. But I know in PostgreSQL there's no way to flush the cache using a query, so I come back to the question in the title: How to query PostgreSQL real database, not its cache?

Later edit:

The pseudocode:

start program
declare global tasks variable

read all tasks from DB
    SELECT * FROM task
for each task in tasks
    calculate next_run and save it in the global tasks variable
    if calculated next_run != next_run from database then
        update next_run in database
            UPDATE task SET next_run = ${next_run} WHERE id = ${id}
    endif
endfor

for true
    sort tasks by next_run asc
    extract first task from tasks
    if next_run > now
        sleep (next_run - now) seconds
    endif

    read a fresh instance of task from DB
        SELECT * FROM task WHERE id = ${id}
    calculate subsequent next_run
    update subsequent next_run in tasks
    update subsequent next_run in DB
        BEGIN;
        SELECT id FROM task WHERE id = ${id} AND next_run = ${current_next_run} FOR UPDATE;
        UPDATE task SET next_run = ${subsequent_next_run} WHERE id = ${id};
        COMMIT;
    if transaction succeeded then // rows in result set and rows affected
        run the task
    endif
endfor

question from:https://stackoverflow.com/questions/66051975/how-to-query-postgresql-real-database-not-its-cache

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:10:15+0000

All data modifications are performed only in the cache (shared buffers). Later, dirty pages from the cache are persisted to disk. The upshot is that the cache always contains the latest data, it can never be "stale".

So you have to look for another explanation, perhaps in your code.

Looking at your code, I see two oddities:

In the beginning, before the endless loop, you sometimes update next_run of task. Couldn't that lead to strangeness if either other threads have a different idea of what calculated next_run is or if that first update interferes with the processing in the endless loop of another thread?
In the database transaction in the endless loop, you use WHERE next_run = ${current_next_run} in the SELECT ... FOR UPDATE, but the condition is missing from the actual UPDATE, so the latter could modify a row even if its next_run has been modified by a concurrent UPDATE.

By the way, there is no need to do something like

BEGIN;
SELECT ... WHERE xy FOR UPDATE;
UPDATE ... WHERE xy;
COMMIT;

The UPDATE alone would do the same thing, since an UPDATE is always atomic in PostgreSQL: the value read is guaranteed to be the most recent visible version at the time of the data modification.

Categories

caching - How to query PostgreSQL real database, not its cache?

caching - How to query PostgreSQL real database, not its cache?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags