Monday 2 May 2016

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Spark  provides the flexibility you need to succeed, focusing on simplifying your time to deployment, making your business users self-sufficient, and accelerating your time to value. I have discussed what spark as a service has to offer:

Introduction about Spark 
Spark is an open source analytic engine  which is used for rapid large-scale data processing in real time. Spark provides iterative and interactive processing .Spark can be regarded as alternative for mapreduce . Apache Spark can process data from data repositories, including the Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (S3). Spark can process even unstructured data too.The main advantage is that Spark can support in-memory processing and disk based processing .Spark can be  incorporated with high level programming such as Java and other languages such as Scala,R which makes it popular among data scientists to make analytical applications.Spark make use of in-memory to its full advantage where recently read data are placed in-memory allowing faster query execution.

Spark vs Mapreduce 

  • SQL queries are executed much faster in Spark than Mapreduce.
  • Spark runs on hundreds of nodes whereas Mapreduce can run on thousands onf  nodes.
  • Mapreduce is ideal for batch processing ,Spark is ideal for real time processing since it uses in-memory storage and processing.
Spark SQL 
Spark sql is a component in Spark big data framework which allows sql queries on data.It can query structured ,columnar,tabular data in Spark.It can be integrated with other hadoop database like Hive allowing interaction with hadoop HDFS.

Spark's MLlib 
MLlib contains functionalities such as statistics, regression,classification ,filtering which are needed for machine learning and real time analysis.

Spark's Flagship
Spark flagship converts streaming of big data into data  streams where analysing, process takes place with these data streams using Spark stream module.
Spark GraphX
This component provides processing on big data graph such as changing edges , vertices in a graph dataset.

Now let us see what factors contribute Spark as a service in other words why we can consider Spark as a service.

Spark-as-a-service 
Spark is provided as a cloud based service by IBM,Databricks because of its advantages over other processing frameworks such as Mapreduce (which I discussed earlier). Companies like databricks allows you faster deployment of Sparks.Databricks eases the process such as cluster buliding and configuration. Process monitoring , resource monitoring are taken care by Databricks. IBM's Spark-as-a-service offers new API and tac and cognitikles unstructured data and follows IBM analytics on Spark on IBM bluemix.IBM DataCap Insight Cloud services delivers data science based on external data about events, people.The company said that services doesn't require deep knowledge of big data analysis.Whenever we come across IBM and cognite computing ,we think of WATSON but this IBM Datacap Insight provides a way to handle unstructured data (i.e. machine- unfriendly data).