Setting Up Your Spark/ Spark SQL Development Environment using PyCharm - Part 2
Since there are already instructions on how to set up your environment to run Spark using the REPL, I will assume that you are already:
1. Installed PyCharm
2. Installed Scala
3. Installed Spark
4. Setup and development Hadoop cluster or using the Hortonworks Sandbox (http://hortonworks.com/sandbox).
5. . Successfully ran "Hello World" (Word Count) from your laptop from a text file and using spark-submit on your cluster preferably from a Python script written on PyCharm, but then SFTP'ed to the cluster for submission.
At this point, you have a development environment setup. You can create python scripts using PyCharm, transferring them to the Hadoop cluster, then running via spark-submit either from local or YARN mode.
But there are a couple of things missing:
1. Intellisense for syntax help and highlighting while developing for the pyspark objects and classes.
2. An integrated way to deploy or copying the code (Pyspark scripts) from the development environment to the Hadoop cluster.
3. An integrated way to run the pyspark server script from PyCharm.
There are a couple of options depending on how you work, so let's set them all up.
Depending on your version of PyCharm,. whether you have a PC or a Mac, or other environment variables, these steps may be different. But the idea can be applied to either environment.
So, let's start with the Intellisense.
Set up Intellisense:
1. Click File -> Settings -> Project:<YourProject> -> Project Interpreter
2. Click the gear icon to the right of the Project Interpreter dropdown
3. Click More... from the context menu
4. Choose the interpreter, then click the "Show Paths" icon (bottom right)
5. Click the + icon two add the following paths:
6. Click OK, OK, OK
Go ahead and test your new intellisense capabilities.
Set up Deployment Server
1. Click Tools -> Deployment -> Configuraitons...
2. Under the Connections tab:
a. Select SFTP in the Type: dropdown list
b. Enter the host for the Spark Driver program on your Hadoop cluster. Choose either an edge node if you have one, otherwise choose a master node, If you only have the sandbox, then your host choice is easy.
c. Enter 22 in the Port: textbox
d. Enter your home directory or the path where you are going to launch your pyspark scripts from on the cluster in the Root path: textbox
e. Enter your username in the User Name textbox.
f. Enter the password in the Password textbox (if you are using key pair authentication, then choose that option instead.
g. Ignore the web server root URL. This functionality was envisioned for copying/ deploying files to a web server. We can leverage the same functionality for our purposes.
h. At the top, if you haven't already, enter a name for this connection.
i. Click OK
Test this by doing your first deployment. With a python script selected, Click Tools -> Deployment -> Upload to <ConnectionName>
Set up WinSCP
1. Install WinSCP
2. Click File -> Settings -> Tools -> External Tools
3. Click the + icon
4. Enter "Deploy to <ConnectionName>" in the Name textbox
5. Enter "C:\Program Files (x86)\WinSCP\WinSCP.com" in the Program textbox (your path may be different). Make sure you use WinSCP.com and not WinSCP.exe.
6. Enter '/command "open sftp://<username>:<password>@<hostURL>/" "put $FilePath$ <target directory>/" "exit" '. Do not enter the tick marks, only the bold characters replacing the information in brackets.
7. Enter "C:\Program Files (x86)\WinSCP" in the Working directory textbox
8. Click OK, OK
Click Tools -> External Tools (the top one) -> Deploy to <ConnectionName> to test the deployment. You now have two ways to get your code over to the cluster.
Set up Execute PySpark
1. Click File -> Settings -> Tools -> Remote SSH External Tools
2. Click the + icon
3. Enter "Execute PySpark on Cluster" in the Name textbox
4. Click the Deployment server option button under Connection settings and choose the deployment server from the deployment set up.
5. Enter "spark-submit" in the Program textbox
6. Enter "--master yarn $FileName$ " (add/change the spark parameters as you would like)
7. Leave $ProjectFileDir$ for the Working directory textbox
8. Click OK, OK
Click Tools -> External Tools (the bottom one) -> Execute PySpark on Cluster
Set up Run Configuration for File
1. First, run the script locally (it will fail)
2. Click Run -> Edit Configurations
3. The name of the script should be selected. If not, select it..
4. Click the Show command line afterwards checkbox
5. Click the + icon under the Before Launch: Activate Tool Window
6. Click Run External Tool.
7. Select the "Deploy to <ConnectionName> tool.
8. Click OK
9. Click the + icon again
10. This time, Click the Run Remote External Tool option
11. Select the Execute PySpark on Cluster tool
12. Click OK
Click Run -> Run '<file name>' With Coverage. Three windows will display. The Deploy window, the Execute Pyspark window, and the local process window. The local process will (should) fail.
Putting it All Together
Now we essentially have two ways we can work, both with their pros and cons.
1. Deployment and Execute Method
In this method we:
1. Click Tools -> Deployment -> Upload to <ConnectionName>
2. Click Tools -> External Tools (the bottom one) -> Execute PySpark on Cluster
Pros - Nice and clean. One click to deploy the code and another step to execute it
Cons - It's two steps
2. Run Method
In this method we:
1. Click Run -> Run '<file name>' With Coverage
OR Run -> Run '<file name>'
OR Shift + F10
Pros - One click deploy and execute. There is also a shortcut key combo.
Cons - It's a hack. The execute run command fails and is ignored. It's only there so we can run the pre processing steps (deploying and running on the server)
So there you have it. There is a solution for the purist and an easier one you hackers out there (you know who you are).