Saturday 17 September 2016

BIG DATA ADVANCED ANALYTICS : ELASTICSEARCH FOR HADOOP

This post discuss about hadoop and Elasticsearch.
Let me commence with a brief introduction about hadoop and Elasticsearch( as many starters would not know about Elasticsearch).

Elasticsearch :
Elasticsearch is a great tool for document indexing and powerful full text search. Its JSON based Domain Specific query Language (DSL) is simple and powerful, Elastic’s ELK analytics stack is gaining momentum in web analytics use cases for these reasons:
  • It is very easy to get a toy instance of Elasticsearch running with a small sample dataset.
  • Aplication developers are more comfortable maintaining a second Elasticsearch instance over a completely new technology stack like Hadoop.

What about hadoop ?
HDFS separates data from state in its node architecture, using one over-arching node that manages state for the entire cluster, and several daughter nodes that store only data. These data nodes execute commands from their master node and log all operations in a static file. This allows a replica master to quickly recreate the state of the system without needing to talk to another master node during fallback. This makes the system extremely fault tolerant, and prevents the split-brain scenario that causes data loss amongst masters that must communicate with each other to restore state. 

ELASTICSEARCH FOR HADOOP: 

Implementing a Hadoop instance as the backbone of an analytics system has a steep learning curve, but it’s well worth your effort. In the end, you’ll be much better off for its rock solid data ingestion and broad compatibility with a number of third party analytics tools, including Elasticsearch. There are couple of advantages when it comes Elasticsearch in hadoop , namely, 

  • Speedy Search with Big Data Analytics.
  • Seamlessly Move Data between Elasticsearch and Hadoop.
  • Visualize HDFS Data in Real-Time with Kibana.
  • Second Search Queries and Analytics on Hadoop Data.
  • Hadoop's enhanced security includes basic HTTP authentication.
  • Works with Any Flavor of Hadoop Distribution.         
Hadoop also has a broad ecosystem of tools that support bulk uploading and ingestion of data, along with SQL engines to support the full querying power you expect from a standard database. On the other hand, it can be argued that standing up Hadoop, Zookeeper, and a Kafka ingestion agent requires as much domain specific knowledge as Elasticsearch. Thus, the raw power and stability of Hadoop comes at the price of heavy setup and maintenance costs.

BIG DATA ADVANCED ANALYTICS : ELASTICSEARCH FOR HADOOP

This post discuss about hadoop and Elasticsearch.
Let me commence with a brief introduction about hadoop and Elasticsearch( as many starters would not know about Elasticsearch).

Elasticsearch :
Elasticsearch is a great tool for document indexing and powerful full text search. Its JSON based Domain Specific query Language (DSL) is simple and powerful, Elastic’s ELK analytics stack is gaining momentum in web analytics use cases for these reasons:
  • It is very easy to get a toy instance of Elasticsearch running with a small sample dataset.
  • Aplication developers are more comfortable maintaining a second Elasticsearch instance over a completely new technology stack like Hadoop.

What about hadoop ?
HDFS separates data from state in its node architecture, using one over-arching node that manages state for the entire cluster, and several daughter nodes that store only data. These data nodes execute commands from their master node and log all operations in a static file. This allows a replica master to quickly recreate the state of the system without needing to talk to another master node during fallback. This makes the system extremely fault tolerant, and prevents the split-brain scenario that causes data loss amongst masters that must communicate with each other to restore state. 

ELASTICSEARCH FOR HADOOP: 

Implementing a Hadoop instance as the backbone of an analytics system has a steep learning curve, but it’s well worth your effort. In the end, you’ll be much better off for its rock solid data ingestion and broad compatibility with a number of third party analytics tools, including Elasticsearch. There are couple of advantages when it comes Elasticsearch in hadoop , namely, 

  • Speedy Search with Big Data Analytics.
  • Seamlessly Move Data between Elasticsearch and Hadoop.
  • Visualize HDFS Data in Real-Time with Kibana.
  • Second Search Queries and Analytics on Hadoop Data.
  • Hadoop's enhanced security includes basic HTTP authentication.
  • Works with Any Flavor of Hadoop Distribution.         
Hadoop also has a broad ecosystem of tools that support bulk uploading and ingestion of data, along with SQL engines to support the full querying power you expect from a standard database. On the other hand, it can be argued that standing up Hadoop, Zookeeper, and a Kafka ingestion agent requires as much domain specific knowledge as Elasticsearch. Thus, the raw power and stability of Hadoop comes at the price of heavy setup and maintenance costs.

Wednesday 24 August 2016

Evolutionary algorithm to tackle big-data clustering (in a nutshell )

Introduction:


Evolutionary Algorithms belong to the Evolutionary Computation field of study concerned with computational methods inspired by the process and mechanisms of biological evolution. The process of evolution by means of natural selection (descent with modification) was proposed by Darwin.Evolutionary Algorithms are concerned with investigating computational systems that resemble simplified versions of the processes and mechanisms of evolution toward achieving the effects of these processes and mechanisms, namely the development of adaptive systems.I will provide a brief introduction about genetic algorithm and its application in big data clustering problem.


Genetic algorithm: 

Genetic Algorithms (GAs) are adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. These are the following steps which involves in genetic algorithm:




  • Initialization : typically contains several hundreds or thousands of possible solutions. Often, the initial population is generated randomly, allowing the entire range of possible solutions (the search space) . Best way is to create select ("seed") with initial optimal solutions.
  • Selection: a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected.
  • Genetic operators:The next step is to generate a second generation population of solutions from those selected through a combination of genetic operators: crossover (also called recombination), and mutation. By producing a "child" solution using the crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents".
  • Termination:These steps are repeated until a solution is found that satisfies minimum criteria,fixed number of generations reached.

Employing Genetic algorithm to big data problem:

The clustering is one of the important data mining issue especially for big data analysis,
The goal of data clustering is to organize a set of n objects into k clusters such that objects in the same cluster are more similar to each other than objects in different clusters. Clustering is one of the most popular tools for data exploration and data organization that has been widely used in almost every scientific discipline that collects data. I will introduce the data clustering problem , main issues: (i) how to define pairwise similarity between objects? and (ii) how to efficiently cluster hundreds of millions of objects?. Usually k-means clustering is used but it has drawbacks when it comes into big data clustering.Here we will see how genetic algorithms may come in handy for big data clustering.
  • Initialization:For evaluating fitness you can use Davies-Bouldin index (https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index), after determining the fitness value of each possible solutions, all viable solutions are included in a set, each set contains a set of population with best fitness.
  • Selection:For selection you can use Tournament selection procedure (https://en.wikipedia.org/wiki/Tournament_selection). 
  • Genetic operators:  Genetic crossover  can be accomplished with concept of RB- Tree ( https://en.wikipedia.org/wiki/Red%E2%80%93black_tree ),is heavily used in algorithm design to make heavy optimization and create efficient bucket for data storage,now by cross over we will get hidden relationship between the disjoint sets henceforth we accomplished crossover by crossover of disjoint sets. 
  • Termination: A new population as generated replaces the older population. This population would again form a newer population using mating and selection procedure. This whole procedure would be repeated again and again until the termination condition is met. 
*The AnyScale Learning For All (ALFA) Group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) aims to solve the most challenging big-data problems — questions that go beyond the scope of typical analytics. ALFA applies the latest machine learning and evolutionary computing concepts to target very complex problems that involve high dimensionality. http://news.mit.edu/2015/una-may-oreilly-evolutionary-approaches-big-data-problems-0114 *
  

Sunday 10 July 2016

NoSQL : Essentials and Short tutorial on Mongodb - Part 2

In this post , I am going to present a short and precise tutorial on Mongodb and it's important commands and it's advance features.

Documents and Collections 

Unlike the RDMS, Mongodb lacks rows, columns, tables, joins. Instead Mongodb employs documents and collections. Documents are name\value pairs and can store array of values, to present a clear picture , consider documents as rows and collections as tables when compared to traditional database.Collections store documents. Now let me create a document,
vehicle={ 
name: "uden"
type:"four-wheeler"
}
company:"ferrai"
}
The below example shows how to create document using Java:
List<Integers>vehicle=Arrays.aslist();
DBObject owner=new                       BasicDBObject("name","uden") .append("type","four-wheeler")
.append("company","ferrai");
Documents could have nested documents, here company is the nested document within the vehicle document.

Performing Insert,Update,Delete and Query 

Let us see how to create a database and insert,update,
read,delete operations.Creating a document is discussed in the previous section.Creating database is done by using 
use udendb ,where udendb is the database name which I will use in this tutorial, you can name the database in whatever name you desire.
Now let us insert a document into the database using insert() command,
db.col.insert({"vehicle"} , here col is collection name where Mongodb creates Collections by default.
You can also add several documents by passing it as arrays.After inserting a document into the database, you can check the database by using,
show dbs 
There will be a default database know as test which stores collections if you don't wish to create a database.
Update: In order to update values, update() method is used, consider the following example,
db.col.update({'company':'ferrai'},{$set:{'company':'mahindra'}},{multi:false})
here col is the default collection name and I have updated my company from ferrai to mahindra and if you want to update value to multiple documents, then you need to set multi to true, in my example I have multi to false since I want only to update a single document.
Query: To query a document, you can use pretty() method along with find() which gives results in a structured way, for example:
db.col.find(). pretty()
You can also find according to criteria or some condition by using AND denoted by , (comma) between the values and by using OR denoted by $or.
dB.col.find({"color":{yellow},$or[{"by":"ferrai"},{"company":"mahindra"}]}).pretty()
Here the results will display vehicle in yellow and whose company  is either mahindra or by ferrai.
Delete: Delete operation is a simple task which can be achieved by using,
remove() which removes all documents.
remove(name of desired document to be deleted) which removes according to instruction give within in the parenthesis.


Using GridFS

GridFS is a file system which can be used for retrieving and storing files such as images , videos etc. It is mainly used to store large size of files more than more than 16 mb. 

fs.files is used to store metadata in files.

fs.chunks is used to store chunks , where each chunk is given objectID.

Now I will show you how to add an image file using GridFS,
mongofiles.exe -d gridfs put image.jpg
This command is typed in the mongofiles.exe in the bin directory, GridFS is the database name and put command stores the image file in gridfa database.

Mongodb Mapreduce 
Mapreduce is a large data processing tool which is also supported by Mongodb.The syntax command is,
mapreduce( function() {emit(key,value);}
                      function(key,values) 
                       {return reduceFunction},{
                        out: collection, query: document,
                         sort: document, limit:number } }
Let me explain with an example,I am going to collect all cars which are yellow in colour and group them under ferrai company and then count number of cars manufactured by ferrai.. Consider the document created in the first section (vehicle). Now the mapreduce would be,
mapreduce( function() {emit(ferrai,1);)}
                        function((key,values){return Array.sum(values)},
{query:{colour:"yellow"},
 out:"total ferrai cars"})
The result will be,
{ results : "total ferrai cars"
counts:{"input":18,"emit":3,"reduce":16,"output":2};}
The result shows 18 documes matched the query yellow and emitted 3 accoridmg to key value and finally reduce function grouped same values into 2.Hence there are two cars manufactured by ferrai.

Mongodb Text search
Mongodb Text search enables to search specified words.Let me show you how to search text with help of an example,
Consider the document vehicle which was created in the first section,I want to search for ferrai word under the nested document company. First you need to create text index by using this command,
db.vehicle.ensureIndex({company:"text"})
Now we can search the text using,
db. vehicle.find({$text:{$search:"ferrai"})


Conclusion: I covered the basics commands and some of its advanced features, I provided the links below for resources if you want detailed information.

  • For more on create ,read,update and bulk write https://docs.mongodb.com/manual/crud/
  • To download Mongodb http://www.mongodb.org/downloads
Useful books to learn 







Monday 13 June 2016

NoSQL : Essentials and Short tutorial on Mongodb - Part 1

This post will provide insights into NoSQL and as continuation, in next post I will provide short and precise tutorial on Mongodb .

About Databases : Before diving into NoSQL , let me discuss the importance of databases.Databases are fundamental systems found in every organisations, databases provides storage,retrieving,manipulating and analysing data.Databases are very vital to business because it radically enhances the advantage of data offers.In other words , a database converts data into meaningful information.In my point of view, storing isn't hard to achieve but deriving value or information from stored data is a tedious process and optimal solutions are hard to find. 


RDBMS:The relational model was proposed by E.F.Codd’s 1970 paper "A relational model of data for large shared data banks" which made data modeling and application programming much easier. Relational model is well-suited to client-server programming.


NoSQL :NoSQL encompass modern techniques such as simple design,enabling storage of enormous amount of data in database management.NoSQL has become a popular architecture for handling large data volumes, because they can be more efficient with regard to the processing power required to handle large files.Relational databases simply don't perform well unless they are given structured data.NoSQL databases are new enough that many database engineers will have some difficulties in handling. Hence emergence of NoSQL databases applications such as Mongodb,Neo4j will make things easier and provide developers with agility and flexibility.In the RDBMS , the developer needs to design the data schema from the outset, and SQL queries are then run against the database and if that application/database undergoes any changes such as updation , the developer must be contacted again.Most NoSQL databases are open source and come with built-in communities.According to Daniel Doubrovkine, Art.sy's head of engineering, states thatNoSQL databases like MongoDB are simple to start with and get more complex over time. The syntax of NoSQL differs from SQL and needs some training to naive users. However NoSQL proivdes plenty of online checking forums and documentations. Migrating to NoSQL can be done through writing a bunch of SELECT * FROM statements against the database and then loading the data into your NoSQL document [or key/value, column, graph] model using the language of your choice. And you can rewrite into NoSQL statetments by using insert(),find().
Personal user information, social graphs, geo location data, user-generated content and machine logging data are just a few examples where the data has been increasing exponentially,hence SQL is not suitable for these types of data. CAP theorem is the principle when you talk about NoSQL databases or in fact when designing any distributed system.
CAP theorem states that there are three basic requirements: Consistency,Availability,Partition Tolerance



Practically speaking , it is impossible to satisfy all three requirements.

NoSQL Categories:
Each of the these categories have their limitations and attributes.

Key-Value: Key-Value are designed to handle large data and stored in form of  hash tables and it stores values like string,JSON,etc. For example "ANIMAL" is the key for the value"Lion" .Key-Value satisfies availability and partition of CAP theorem.

Column-oriented databases: stores data in column and every column is considered as individual one.Example :Simpledb,Cassandra,bigtable.

Document-oriented databases: collection of documents and data is stored in these documents. Example: Mongodb,Couchdb.

Graph-database:stores data in graph and each node represents entity and edge represents between these two entities or nodes.For example Orientdb,Neo4J.








Saturday 4 June 2016

DEVELOPMENTS AND PROGRESS IN APACHE HADOOP HDFS

About HDFS:
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers , in other words it is a distributed Java-based  file system for storing large volumes of data.HDFS forms the management layer along with the YARN.HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.HDFS entails following features which ensures high availability , fault tolerance , scalability and efficient storage of data.

Rack awareness : Takes node's physical allocation                                          for scheduling tasks.

Standby NameNode: Main component for providing                                            redundancy and high availability

Less Data Movement: Processing of tasks takes place in the physical node where the data resides , hence data movement is reduced ,increasing high aggregate bandwidth


Features of HDFS


PROGRESS IN HDFS:

  • HDFS version 2.3.0 provides centralized cache management heterogeneous storage, implementation of openstack swift and HTTP supports
  • Version 2.4.0 provides metadata compatibility, rolling upgrades (allows upgrading individual HDFS daemons).
  • Version 2.5.0 provides incremental data copy , extended attributes for accessing metadata.

Monday 2 May 2016

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Spark  provides the flexibility you need to succeed, focusing on simplifying your time to deployment, making your business users self-sufficient, and accelerating your time to value. I have discussed what spark as a service has to offer:

Introduction about Spark 
Spark is an open source analytic engine  which is used for rapid large-scale data processing in real time. Spark provides iterative and interactive processing .Spark can be regarded as alternative for mapreduce . Apache Spark can process data from data repositories, including the Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (S3). Spark can process even unstructured data too.The main advantage is that Spark can support in-memory processing and disk based processing .Spark can be  incorporated with high level programming such as Java and other languages such as Scala,R which makes it popular among data scientists to make analytical applications.Spark make use of in-memory to its full advantage where recently read data are placed in-memory allowing faster query execution.

Spark vs Mapreduce 

  • SQL queries are executed much faster in Spark than Mapreduce.
  • Spark runs on hundreds of nodes whereas Mapreduce can run on thousands onf  nodes.
  • Mapreduce is ideal for batch processing ,Spark is ideal for real time processing since it uses in-memory storage and processing.
Spark SQL 
Spark sql is a component in Spark big data framework which allows sql queries on data.It can query structured ,columnar,tabular data in Spark.It can be integrated with other hadoop database like Hive allowing interaction with hadoop HDFS.

Spark's MLlib 
MLlib contains functionalities such as statistics, regression,classification ,filtering which are needed for machine learning and real time analysis.

Spark's Flagship
Spark flagship converts streaming of big data into data  streams where analysing, process takes place with these data streams using Spark stream module.
Spark GraphX
This component provides processing on big data graph such as changing edges , vertices in a graph dataset.

Now let us see what factors contribute Spark as a service in other words why we can consider Spark as a service.

Spark-as-a-service 
Spark is provided as a cloud based service by IBM,Databricks because of its advantages over other processing frameworks such as Mapreduce (which I discussed earlier). Companies like databricks allows you faster deployment of Sparks.Databricks eases the process such as cluster buliding and configuration. Process monitoring , resource monitoring are taken care by Databricks. IBM's Spark-as-a-service offers new API and tac and cognitikles unstructured data and follows IBM analytics on Spark on IBM bluemix.IBM DataCap Insight Cloud services delivers data science based on external data about events, people.The company said that services doesn't require deep knowledge of big data analysis.Whenever we come across IBM and cognite computing ,we think of WATSON but this IBM Datacap Insight provides a way to handle unstructured data (i.e. machine- unfriendly data).