How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

Question

Welcome To Ask or Share your Answers For Others

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:23+0000

According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.

I have written more details about setting group.id and Spark's checkpointing of offsets in another post

Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:

group.id:?Kafka source will create a unique group id for each query automatically. According to the?code?the group.id will automatically be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}

auto.offset.reset:?Set the source option startingOffsets to specify where to start instead.?Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it

enable.auto.commit:?Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

Categories

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags