Since there are already instructions on how to set up your environment to run Spark using the REPL, I will assume that you are already:
1. Installed PyCharm
2. Installed Scala
3. Installed Spark
4. Setup and development Hadoop cluster or using the Hortonworks Sandbox (http://hortonworks.com/sandbox).
5. . Successfully ran "Hello World" (Word Count) from your laptop from a text file and using spark-submit on your cluster preferably from a Python script written on PyCharm, but then SFTP'ed to the cluster for submission.
At this point, you have a development environment setup. You can create python scripts using PyCharm, transferring them to the Hadoop cluster, then running via spark-submit either from local or YARN mode.
But there are a couple of things missing:
1. Intellisense for syntax help and highlighting while developing for the pyspark objects and classes.
2. An integrated way to deploy or copying the code (Pyspark scripts) from the development environment to the Hadoop cluster.
3. An integrated way to run the pyspark server script from PyCharm.
There are a couple of options depending on how you work, so let's set them all up.
Depending on your version of PyCharm,. whether you have a PC or a Mac, or other environment variables, these steps may be different. But the idea can be applied to either environment.
2. Click the gear icon to the right of the Project Interpreter dropdown
3. Click More... from the context menu
4. Choose the interpreter, then click the "Show Paths" icon (bottom right)
5. Click the + icon two add the following paths:
6. Click OK, OK, OK
Go ahead and test your new intellisense capabilities.
Set up Deployment Server
1. Click Tools -> Deployment -> Configuraitons...
2. Under the Connections tab:
a. Select SFTP in the Type: dropdown list
b. Enter the host for the Spark Driver program on your Hadoop cluster. Choose either an edge node if you have one, otherwise choose a master node, If you only have the sandbox, then your host choice is easy.
c. Enter 22 in the Port: textbox
d. Enter your home directory or the path where you are going to launch your pyspark scripts from on the cluster in the Root path: textbox
e. Enter your username in the User Name textbox.
f. Enter the password in the Password textbox (if you are using key pair authentication, then choose that option instead.
g. Ignore the web server root URL. This functionality was envisioned for copying/ deploying files to a web server. We can leverage the same functionality for our purposes.
h. At the top, if you haven't already, enter a name for this connection.
i. Click OK
Test this by doing your first deployment. With a python script selected, Click Tools -> Deployment -> Upload to <ConnectionName>
4. Enter "Deploy to <ConnectionName>" in the Name textbox
5. Enter "C:\Program Files (x86)\WinSCP\WinSCP.com" in the Program textbox (your path may be different). Make sure you use WinSCP.com and not WinSCP.exe.
6. Enter '/command "open sftp://<username>:<password>@<hostURL>/" "put $FilePath$ <target directory>/" "exit" '. Do not enter the tick marks, only the bold characters replacing the information in brackets.
7. Enter "C:\Program Files (x86)\WinSCP" in the Working directory textbox
8. Click OK, OK
Click Tools -> External Tools (the top one) -> Deploy to <ConnectionName> to test the deployment. You now have two ways to get your code over to the cluster.