Apache Spark


Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core.

Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects.


Compute Systems Invocation Version(s)
Red Hat Linux (64-bit) % /util/spark/bin/beeline
% /util/spark/bin/pyspark
% /util/spark/bin/run-example
% /util/spark/bin/spark-class
% /util/spark/bin/sparkR
% /util/spark/bin/spark-shell
% /util/spark/bin/spark-sql
% /util/spark/bin/spark-submit
1.6.1 (default)


  1. Exit pyspark with quit().
  2. Exit spark-shell with Ctrl-d (Control d).


  1. Bina Ramamurthy, instructor
  2. Oliver Kennedy, instructor


  1. https://en.wikipedia.org/wiki/Apache_Spark
  2. https://spark.apache.org/