Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
302 views
in Technique[技术] by (71.8m points)

ClickHouse log shows hash of uncompressed files doesn't match

ClickHouse logs printed the error messages as below frequently:

2021.01.07 00:55:24.112567 [ 6418 ] {} <Error> vms.analysis_data (7056dab3-3677-455b-a07a-4d16904479b4): 
Code: 40, e.displayText() = DB::Exception: Checksums of parts don't match: 
hash of uncompressed files doesn't match (version 20.11.4.13 (official build)). 
Data after merge is not byte-identical to data on another replicas. There could be several reasons: 
1. Using newer version of compression library after server update. 
2. Using another compression method. 
3. Non-deterministic compression algorithm (highly unlikely). 
4. Non-deterministic merge algorithm due to logical error in code. 
5. Data corruption in memory due to bug in code. 
6. Data corruption in memory due to hardware issue. 
7. Manual modification of source data after server startup. 
8. Manual modification of checksums stored in ZooKeeper. 
9. Part format related settings like 'enable_mixed_granularity_parts' are different on different replicas. 
We will download merged part from replica to force byte-identical result.

We use the same version(20.11.4.13) and the same compression method (LZ4) for all data nodes in the production environment, we wouldn't modify the data files or the values stored in Zookeeper also.

So my questions are:

  • How was the error caused? Furtherly, in which cases will the CickHouse server throws those exceptions?
  • Is there a checksum-checking mechanism among the replicas during the merging parts?
  • I also found that in one of our data nodes, there are many folders named like "ignored_20201208_23116_23116_0" in the detached folder, were these files the corrupted data caused by the referred problem?

Thanks.

question from:https://stackoverflow.com/questions/65623292/clickhouse-log-shows-hash-of-uncompressed-files-doesnt-match

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need to upgrade all nodes to 20.11.6.6 ASAP.

The reason of these errors is a serious bug related to AIO.


ignored_ -- it's not related. You can remove them.

gtranslate: Inactive parts are not deleted immediately, because when writing a new part, fsync is not called, i.e. for some time, the new part is only in the server's RAM (OS cache). So when the server is rebooted spontaneously, a new (merged) part can be lost or damaged. In this case, ClickHouse, during the startup process is checking the integrity of the parts, if it detects a problem, it returns the inactive chunks to the active list, and later merge them again. In this case, the broken piece is renamed (the prefix broken_ is added) and moved to the detached folder. If the integrity check detects no problems in the merged part, then the original inactive chunks are renamed (prefix ignored_ is added) and moved to the detached folder.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...