Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
212 views
in Technique[技术] by (71.8m points)

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.

I have written more details about setting group.id and Spark's checkpointing of offsets in another post

Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:

group.id:?Kafka source will create a unique group id for each query automatically. According to the?code?the group.id will automatically be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}

auto.offset.reset:?Set the source option startingOffsets to specify where to start instead.?Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it

enable.auto.commit:?Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...