Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
470 views
in Technique[技术] by (71.8m points)

data modeling - Aggregation in Cassandra across partitions

I have a Data model like below,

CREATE TABLE appstat.nodedata (
    nodeip text,
    timestamp timestamp,
    flashmode text,
    physicalusage int,
    readbw int,
    readiops int,
    totalcapacity int,
    writebw int,
    writeiops int,
    writelatency int,
    PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)

where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),

Sample data in this table,

SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;

 nodeip       | timestamp                       | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
 172.30.56.60 | 2017-12-08 06:13:07.161000+0000 |       yes |            34 |     57 |       19 |            27 |       8 |        89 |           57
 172.30.56.60 | 2017-12-08 06:12:07.161000+0000 |       yes |            70 |      6 |       43 |            88 |      79 |        83 |           89

This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

Also successfully aggregate the data like below,

SELECT sum(readbw) FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

Now comes the next use case, Where I need to get the cluster data (All the data's of the four nodes),

Like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.62','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;

But It clearly mentioned in number of sites that, 'IN query' has lots of performance hiccups, So what is your suggestion in this Data Model of 'nodedata' table mentioned above? (NOTE: Doing Multiple queries in different partitions are okay which I feel like a last option)

Do you have a better approach (or) re-designing this data model in a better way (or) Any better solution to retrieve the data from multiple partitions?

Any help would be really appreciable.

Thanks,
Harry

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, the use of IN on the partition key is discouraged as it put more load on coordinating node, especially if many partitions will be specified in IN clause. Multiple separate requests done async, for example, could even be more performant, and make less load on coordinating nodes.

Also, you need into account the size of your partitions - from quick look to schema, I see that every partition will grow to ~55Mb in one year if you're doing sampling every minute. Having too wide partitions could also lead to some performance problems (although not always, depends on the use case). Maybe you'll need to change partition key to include year, or year+month to make smaller partitions. But in this case, some additional logic should be added to your code when you retrieve data that span several years/months.

P.S. Maybe this is not a fully answering to your question, but commentary field is too small for it :-)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...