java - Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

Question

Welcome To Ask or Share your Answers For Others

java - Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

I am using the Spark Cassandra connector. It take 5-6 minutes for fetch data from Cassandra table. In Spark I have seen many tasks and Executor in log. The reason might be that Spark divided the process in many tasks!

Below is my code example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:35+0000

After searching on Google i fond the issue in the latest spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb Default value is 64 MB which is being interpreted as 64 bytes in the code. So try with spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864

Hear is an example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1")
            .set("spark.cassandra.input.split.size_in_mb","67108864");


    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}

Categories

java - Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

java - Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags