Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
166 views
in Technique[技术] by (71.8m points)

scala - Does Spark data set method serialize the computation itself?

I have a data set with multiple columns. A function needs to be invoked to compute result using the data available within a row. So I used a case class with a method and created a data set using it. As example,

case class testCase(x: Double, a1: Array[Double], a2: Array[Double]) {
    var someInt = 0
    def myMethod1(): Unit = {...}    // use x, a1 and a2
    def myMethod2(): Unit = {...}    // use x, a1 and a2
    def result(): { return someInt }

It is called from the main() as

val res = myDS.map(_.result()).toDF("result")

The problem I am facing is that while the code works correctly, no matter how I invoke, unlike for the other parts of the program, the above statement does not work concurrently. Irrespective of the number executors, cores and repartitioning, only one instance of a method seems to work at time!

Any hints to what I should look at would be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

testCase case class should not be mutable, if you concurrently modify the state of the object your program will be non deterministic because of that. What is looking wrong with the few information you are giving is this snippet

var someInt = 0

that value is probably being modified concurrently by several tasks and I am pretty sure you don't want that.

Can you explain what are you trying to do?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...