Apache Spark RDD and SparkSQL with Java Samples

M.D
3 min readSep 26, 2020

--

Hi everyone, I’m learning Big Data Technologies and I share with you my note about some technologies. I developed a Java application, with that you can learn how works Spark RDD (Resilient Distributed Dataset) and Spark Sql Commands.

Let’s start..

What’s Spark?

‘Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.’

  • Apache Spark Core : ‘Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems.
  • Spark SQL : ‘Spark SQL is Apache Spark’s module for working with structured data. The interfaces offered by Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.
  • Spark Streaming : ‘This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS (Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards.
  • MLlib (Machine Learning Library) : ‘Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms- classification, regression, clustering, and collaborative filtering. It also includes other tools for constructing, evaluating, and tuning ML Pipelines. All these functionalities help Spark scale out across a cluster.
  • GraphX : ‘Spark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL (Extract, Transform, and Load) process, exploratory analysis, and iterative graph computation within a single system.

You can static data as mongoDb, Elastic Search, PostgreSql or streaming data sources as kafka, amazon kinesis..

Spark core has RDD, however it’s preferred to use dataset and dataframe instead of RDD.

Spark Features

  • Fast processing : ‘The most important feature of Apache Spark that has made the big data world choose this technology over others is its speed. Big data is characterized by volume, variety, velocity, and veracity which needs to be processed at a higher speed. Spark contains Resilient Distributed Dataset (RDD) which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop.
  • Flexibility : ‘Apache Spark supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python.
  • In-memory computing : ‘Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed of analytics.
  • Real-time processing : ‘Spark is able to process real-time streaming data. Unlike MapReduce which processes only stored data, Spark is able to process real-time data and is, therefore, able to produce instant outcomes.
  • Better analytics : ‘In contrast to MapReduce that includes Map and Reduce functions, Spark includes much more than that. Apache Spark consists of a rich set of SQL queries, machine learning algorithms, complex analytics, etc. With all these functionalities, analytics can be performed in a better fashion with the help of Spark.

( https://chartio.com/learn/data-analytics/what-is-spark/ )

More Info about RDD : https://intellipaat.com/blog/tutorial/spark-tutorial/programming-with-rdds/

RDD Operations

We have two opertations : Transformations and Actions

  • Transformations create new RDD : map, filter, flatmap, mappartitions, distinct, sample, Union, Substract
  • Actions is used for calculations and saving : count, first, show etc..
  • Spark Session is new method for reading cvs, json.. data instead of map transaction.

Spark SQL

You can also write spark job with sql commands. That’s more performance than RDD and converting as RDD automatically.

On GitHub project you can find different samples about RDD and sparksql. I’m showing each Task in a different main method. Just open class like SQLFilter and run main method.

GitHub Project :

For more information :

I hope it was useful, see you next time..

--

--

M.D
M.D

Written by M.D

Software Engineer, Full Stack Developer

No responses yet