Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
107 views
in Technique[技术] by (71.8m points)

How to run a Java parallel algorithm on Google Dataproc cluster?

I have a simple java parallel algorithm implemented using Spark. But I am not sure how I can run it on Google Dataproc cluster. I found a lot of resources online that uses python or scala but not enough for java. here is the code

public class Prime {

    List<Integer> primes = new ArrayList<>();

    //Method to calculate and count the prime numbers
    public void countPrime(int n){
        for (int i = 2; i < n; i++){
            boolean  isPrime = true;

            //check if the number is prime or not
            for (int j = 2; j < i; j++){
                if (i % j == 0){
                    isPrime = false;
                    break;  // exit the inner for loop
                }
            }

            //add the primes into the List
            if (isPrime){
                primes.add(i);
            }
        }
    }

    //Main method to run the program
    public static void main(String[]args){
        //creating javaSparkContext object
        SparkConf conf = new SparkConf().setAppName("haha").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        //new prime object
        Prime prime = new Prime();
        prime.countPrime(100000);

        //parallelize the collection
        JavaRDD<Integer> rdd = sc.parallelize(prime.primes , 4);
        long count = rdd.filter(e  -> e == 2|| e % 2 != 0).count(); 
    }
}
question from:https://stackoverflow.com/questions/65601192/how-to-run-a-java-parallel-algorithm-on-google-dataproc-cluster

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you have a jarfile that specifies "Prime" as the main-class already, then at a basic level it's as simple as:

gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jar prime-jarfile.jar

If you have a jarfile that doesn't specify the main-class, you can submit the jarfile as "--jars" (with an 's' at the end) and specify the "--class" instead:

gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jars prime-jarfile.jar --class Prime

Note, however, since you specify setMaster("local"), that overrides the cluster's own spark environment settings, and it will only run using threads on the master node. You simply need to remove the .setMaster("local") entirely and it will automatically pick up the YARN configuration within the Dataproc cluster to actually run on multiple worker nodes.

Also, I realize this is just a getting started exercise so it probably doesn't matter but you almost certainly won't see any "speedup" in the real distributed mode because:

  1. The computation that uses Spark is too "cheap" compared to even the time it takes to load an integer.
  2. The number of elements being processed is too small compared to overhead of starting remote execution
  3. The number of partitions (4) will probably be too small for dynamic executor allocation to kick in, so they might just end up running mostly one after another

So you might see more "interesting" results if for example the numbers you parallelize just each represent large "ranges" for the worker to check; for example, if the number "0" means "count primes between 0 and 1,000,000", "1" means "count primes between 1,000,000 and 2,000,000", etc. Then you might have something like:

// Start with rdd is just parallelize the numbers 0 through 999 inclusive with something like 100 to 1000 "slices".
JavaRDD<Integer> countsPerRange = rdd.map(e -> countPrimesInRange(e*1000000, (e+1)*1000000));
int totalCount = countsPerRange.reduce((a, b) -> a + b);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...