Things to know when you are dealing with Apache Spark

Edit

Apache spark is an interesting big data framework using which we can do amazing things when we are dealing with big data.In journey of learning spark many developers face difficulties when to use java and when to use spark framework.There where plenty of scenarios In the journey of development there scenarios where it was effcient/effective to use java instead of using spark.

Spark is needed in a place where there are complex and manipulative operation on large dataset.A simple case where only lifting and shiftin of data is there in that case it can be achieved using other technologies also.Let's review the certain pointers which I have learned during the spark application development.

Packging spark application using Java

Using maven shade plugin is not encouraged, unless and until there is no other option apart from using the maven shade plugin. Below is the informative link for not to use the shade plugin.

Downsides of using Shade plugin relocation feature

From my perspective for running spark application, spring boot maven plugin is the best option for managing everything, including library dependencies. Even if your application is used as a scheduled job, in which you need to use spark-submit/spark launcher for launching the application.

In other case where your java app using spark and spring both but have controller/api to use the application in that case also spring boot maven plugin is the best one.

There are only 2 types of Challanges when we use spark-submit/spark launcher to launch the application which is created using spring boot maven plugin.

1. Main Class

When we package fat jar/uber jar using spring boot maven plugin, it is packging the class files and java libraries in spring way and not the way sprak is expecting it.Inside the uber jar generated by the spring boot maven plugin, we have boot-inf,meta-inf and org folder.So when we give mainclass in spark submit or spark launcher as a parameter,It will not able to find that class as it will not be able to locate package/path specified in parameter, due to change of structure in jar file.Even after you specify the correct location which is starting with BOOT-INF for main class, it will not work because the way spring launches the application is using differrent main class.

Below is the link which shows main class that should be used for launching the fat jar generated by spring boot maven plugin.

https://docs.spring.io/spring-boot/docs/current/reference/html/executable-jar.html#appendix.executable-jar.launching

On high level if I inform ,File named as MANIFEST.MF inside the uber jar contains below entries.Where Main-Class is the actual main class which is used for initialization of spring related stuff, after which customized main class should get started which is Start-Class entry.

Main-Class: org.springframework.boot.loader.JarLauncher Start-Class: com.mycompany.project.MyApplication

So as a conclusion specifiying main class as "org.springframework.boot.loader.JarLauncher" in spark-submit or spark-launcher will resolve our issue for this problem.This will only work if you are using sparing boot maven plugin for packagin the jar.

External common libraries used in pom.xml+spark installation. Another issue which might occure while using spark-submit or launcher application while launching uber jar, packaged using the spring boot maven plugin is jar conflicts.So the problem will be when we package jar using spring boot maven plugin, it is coping depedencies inside BOOT-INF/lib folder, let's say if you are using below depedency in pom.xml.

<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.10</version>
</dependency>

Now let's say this depedency already exist with differrent version in spark installation, in that case it will give conflict in classes.As it will be an issue of class loader ,its better not to use those which might create conflict due to which application might fail.I have faced this kind of issues related to logger classes or json libraries, as there are multiple json/logger library options are available.As a resolution you can excluded those classes or libraries or you can replace that library with alternative one.

Deployment/Triggering Spark Applciation

There are plenty of ways using which we can trigger or deploy the application in spark server.We will see two scenarios using which we can use the spark in the application development.

Deploy as Job In kubernates cluster

We can deploy the spark application as a job in kubermates cluster.If we choose to deploy application using this approach , in that case, main method will be triggered if you are using java.Because that is the entry point of the application.The application will not be running 24*7 , we can schdule this application in kubernates cluster as well to periodically trigger the job.In this case we can use below command to trigger the spark job.

 
  
  ./bin/spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key<=<value> \
  --driver-memory <value>g \
  --executor-memory <value>g \
  --executor-cores <number of cores>  \
  --jars  <comma separated dependencies>
  --class <main-class> \
  <application-jar> \
  [application-arguments]

Deploy as Rest In kubernates cluster

Another way to trigger the business logic of the spark aplication is to have one rest application deployed and running in the cluster.Using that application we will be ablt to trigger the business logic or the task.In this case we can pass the job parameter as the url parameter in to the rest api application.Below is sample code for same.

 

curl -X POST [http://sparkendpoint.com]/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "appResource": "file:/home/user/spark_pi.py",
  "sparkProperties": {
    "spark.executor.memory": "8g",
    "spark.master": "spark://192.168.1.1:7077",
    "spark.driver.memory": "8g",
    "spark.driver.cores": "2",
    "spark.eventLog.enabled": "false",
    "spark.app.name": "Spark REST API - PI",
    "spark.submit.deployMode": "cluster",
    "spark.driver.supervise": "true"
  },
  "clientSparkVersion": "2.4.0",
  "mainClass": "org.apache.spark.deploy.SparkSubmit",
  "environmentVariables": {
    "SPARK_ENV_LOADED": "1"
  },
  "action": "CreateSubmissionRequest",
  "appArgs": [ "/home/user/spark_pi.py",  "80" ]
}'

Navigation

Krutik Jayswal

Things to know when you are dealing with Apache Spark

Share this:

CONVERSATION

0 comments:

Post a Comment