last hacked on Jul 22, 2017

**Apache Spark** is an open-source, fast, and general-purpose cluster computing system. Originally developed at UC Berkeley, the Spark codebase was donated to the Apache Software Foundation. It was developed in response to the limitations in the **MapReduce** cluster computing paradigm found in **Apache Hadoop**. Some key components of Apache Spark include **Spark Core**, **Spark SQL**, **Spark Streaming**, **MLlib**, **GraphX**, & **Cluster Managers**. ## Spark Core Spark Core contains components for task scheduling, memory management, fault recovery, and is home to the API that defines *resilient distributed datasets*, which are Spark's main programming abstraction. RDDs represent a collection of items distributed across many compute notes that can be manipulated in parallel. ## Spark SQL Spark SQL allows for working with unstructured data. ## Spark Streaming Spark Streaming allows for processing of live streams of data. ## MLlib MLlib is a Spark machine learning library that comes with a variety of machine learning algoritms, including classification, regression, clustering, collaborative filtering, among others. ## GraphX GraphX is a library for manipulating graphs and performing graph-parallel computations. ## Cluster Managers Spark can scale from one to any number of compute nodes. To accomplish this, Spark can run on a variety of cluster managers, including Hadoop YARN, Apache Mesos, as well as the Spark Standalone Scheduler.
# Setting up Apache Spark This project assumes you are running on a fresh copy of **Ubuntu 16.04**. To install Apache Spark, you must first install a few dependencies, including Python, Java, SBT, and Scala. Let's get going! You can use the following script on your new instance to install spark:<br /> <a href="https://raw.githubusercontent.com/kaizenraizen/AWSSparkSetupScript/master/setupSpark.sh" download>setupSpark.sh</a>. The script is a compilation of the commands explained below. You can refer to the steps below for more clarity. To transfer the script onto your new instance, you can use the scp command. The syntax is as follows:<br /> $ scp -i "key.pem" setupSpark.sh ubuntu@(your instance's IP):/home/ubuntu ## Install Python **PySpark** will require **Python 2.7** as `python`. Normally, you should already meet this dependency; if you do not then enter into your terminal: $ sudo apt install python-minimal ## Install Java **Spark** depends on **Java**. To install `java8`, enter into your terminal: $ sudo apt-get update $ sudo apt-get install default-jre $ sudo apt-get install default-jdk $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer $ sudo update-alternatives --config java Select the `/usr/lib/jvm/java-8-oracle/jre/bin/java` option when prompted. Set your `JAVA_HOME` variable. For this, open `/etc/environment` and add this code: JAVA_HOME="/usr/lib/jvm/java-8-oracle" Save and quit the file, then reload it. For this, enter into your terminal: $ source /etc/environment $ echo $JAVA_HOME The terminal should return `/usr/lib/jvm/java-8-oracle`. ## Install SBT **SBT** is a build tool for **Java** and **Scala**. To install it, enter into your terminal: $ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 $ sudo apt-get update $ sudo apt-get install sbt ## Install Scala **Spark** depends on **Scala**. To install it, enter into your terminal: $ sudo apt-get update $ sudo apt-get install scala ## Install Apache Spark Now install the latest distribution of **Spark** on your system: $ cd ~ $ curl http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz | tar xvz -C . $ mv spark-2.1.0-bin-hadoop2.7 spark $ sudo mv spark/ /usr/lib/ $ cd /usr/lib/spark/conf/ $ cp spark-env.sh.template spark-env.sh Open `/usr/lib/spark/conf/spark-env.sh` and add this code: JAVA_HOME=/usr/lib/jvm/java-8-oracle SPARK_WORKER_MEMORY=4g Open `/etc/sysctl.conf` and add this code: net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Open `~/.bashrc` and add this code: export JAVA_HOME=/usr/lib/jvm/java-8-oracle export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar export SPARK_HOME=/usr/lib/spark export PATH=$PATH:$JAVA_HOME/bin export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin Re-start your terminal. # Run Apache Spark The Spark API is available in either Python or Scala. ## Spark API via Python To access the Spark API with Python, start PySpark. For this, enter into your terminal: $ pyspark Now you can interact with the Spark API by entering Python commands. Let's try a simple word-count example: >>> lines = sc.textFile("README.md") >>> lines.count() >>> lines.first() Congratulations. You have successfully setup Apache Spark on your system, and are ready to get more sophisticated with your own distributed computing jobs. To continue learning, check out the [official Apache Spark documentation](http://spark.apache.org/docs/latest/quick-start.html).


back to all projects