hadoop - What is Lineage In Spark?

Question

Welcome To Ask or Share your Answers For Others

hadoop - What is Lineage In Spark?

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:58:59+0000

Everything to understand about lineage is in the definition of RDD.

So let's review that :

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure

So there is mainly 2 things to understand:

Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.

And now to answer your question and doubts:

If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.

Thus, it will not save you those 15 minutes that you have mentioned!

Categories

hadoop - What is Lineage In Spark?

hadoop - What is Lineage In Spark?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags