Compared to existing technologies, Spark significantly speeds up the analysis and generates results in real time while effectively storing data in-memory. Spark can perform both in-memory and on-disk and hence will effectively handle the cases of where in-memory is insufficient to handle complete data set or they do not fit in the total of all the cluster memories. Execution in in-memory or on-disk memory can be adjusted according to your application requirements. Since, Spark performs many operations, keeping intermediate results in-memory, it enables to achieve improved performance if you are constantly using the same set of data.
Apache Spark has optimized Mapreduce functions, reducing the computational cost involved in processing the data and also enables the user to optimize data processing pipelines by supporting lazy-evaluation. As mentioned above, Spark supports additional functionalities than MapReduce for data processing and analysis and has improved performance in generating arbitrary operator graphs as well.
Spark was originally written in Scala language and runs inside a Java Virtual Machine (JVM). A command line tool is available for Scala and Python and it provides programming interfaces in Scala, Java and Python languages. In addition to these three languages, applications running with Spark framework can be also implemented in R and Clojure.
Comparison with Hadoop
As far as big data analysis is considered, Hadoop has been a promising solution for over a decade. With its added features and optimizations, Apache Spark comes as a promising alternative for Hadoop MapReduce. Let’s dig deeper to see how.
Hadoop would be a perfect for a solution including sequential data processing, where each step would consist of a Map and a Reduce function with a single pass over the input. But for certain applications involving several passes over the input, it comes with the cost of maintaining the MapReduce operations at each step of the computation. This workflow of Hadoop MapReduce requires constant writing to and reading from the cluster memories and may significantly slow down the system with added time taken for these memory operations. Some other considerations which make using Hadoop painful are the complexity of configuring and maintaining the clusters and integration of different third party tools based on what the application includes, whether it includes machine learning or processing of streamed data, etc.
With its optimized design, Apache Spark executes on a set up similar to Hadoop Distributed File System and has achieved improved performance over the existing, while providing added functionalities. With its functions implemented using Directed Acyclic Graphs, Spark enables to develop pipelines involving multiprocessing of the data. It shares the data across these data structures in-memory and allows to process the same set of data in parallel. Spark comes with utilities to develop applications including different types of data analysis and processing and hence comes as a comprehensive solution as illustrated in the next section.