If you have a jarfile that specifies "Prime" as the main-class already, then at a basic level it's as simple as:
gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jar prime-jarfile.jar
If you have a jarfile that doesn't specify the main-class, you can submit the jarfile as "--jars" (with an 's' at the end) and specify the "--class" instead:
gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jars prime-jarfile.jar --class Prime
Note, however, since you specify setMaster("local")
, that overrides the cluster's own spark environment settings, and it will only run using threads on the master node. You simply need to remove the .setMaster("local")
entirely and it will automatically pick up the YARN configuration within the Dataproc cluster to actually run on multiple worker nodes.
Also, I realize this is just a getting started exercise so it probably doesn't matter but you almost certainly won't see any "speedup" in the real distributed mode because:
- The computation that uses Spark is too "cheap" compared to even the time it takes to load an integer.
- The number of elements being processed is too small compared to overhead of starting remote execution
- The number of partitions (4) will probably be too small for dynamic executor allocation to kick in, so they might just end up running mostly one after another
So you might see more "interesting" results if for example the numbers you parallelize just each represent large "ranges" for the worker to check; for example, if the number "0" means "count primes between 0 and 1,000,000", "1" means "count primes between 1,000,000 and 2,000,000", etc. Then you might have something like:
// Start with rdd is just parallelize the numbers 0 through 999 inclusive with something like 100 to 1000 "slices".
JavaRDD<Integer> countsPerRange = rdd.map(e -> countPrimesInRange(e*1000000, (e+1)*1000000));
int totalCount = countsPerRange.reduce((a, b) -> a + b);
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…