Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
497 views
in Technique[技术] by (71.8m points)

hadoop - What is Lineage In Spark?

How lineage helps to recompute data?

For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Everything to understand about lineage is in the definition of RDD.

So let's review that :

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure

So there is mainly 2 things to understand:

Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.

And now to answer your question and doubts:

If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.

Thus, it will not save you those 15 minutes that you have mentioned!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...