Wednesday, 24 August 2016

Evolutionary algorithm to tackle big-data clustering (in a nutshell )

Introduction:


Evolutionary Algorithms belong to the Evolutionary Computation field of study concerned with computational methods inspired by the process and mechanisms of biological evolution. The process of evolution by means of natural selection (descent with modification) was proposed by Darwin.Evolutionary Algorithms are concerned with investigating computational systems that resemble simplified versions of the processes and mechanisms of evolution toward achieving the effects of these processes and mechanisms, namely the development of adaptive systems.I will provide a brief introduction about genetic algorithm and its application in big data clustering problem.


Genetic algorithm: 

Genetic Algorithms (GAs) are adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. These are the following steps which involves in genetic algorithm:




  • Initialization : typically contains several hundreds or thousands of possible solutions. Often, the initial population is generated randomly, allowing the entire range of possible solutions (the search space) . Best way is to create select ("seed") with initial optimal solutions.
  • Selection: a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected.
  • Genetic operators:The next step is to generate a second generation population of solutions from those selected through a combination of genetic operators: crossover (also called recombination), and mutation. By producing a "child" solution using the crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents".
  • Termination:These steps are repeated until a solution is found that satisfies minimum criteria,fixed number of generations reached.

Employing Genetic algorithm to big data problem:

The clustering is one of the important data mining issue especially for big data analysis,
The goal of data clustering is to organize a set of n objects into k clusters such that objects in the same cluster are more similar to each other than objects in different clusters. Clustering is one of the most popular tools for data exploration and data organization that has been widely used in almost every scientific discipline that collects data. I will introduce the data clustering problem , main issues: (i) how to define pairwise similarity between objects? and (ii) how to efficiently cluster hundreds of millions of objects?. Usually k-means clustering is used but it has drawbacks when it comes into big data clustering.Here we will see how genetic algorithms may come in handy for big data clustering.
  • Initialization:For evaluating fitness you can use Davies-Bouldin index (https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index), after determining the fitness value of each possible solutions, all viable solutions are included in a set, each set contains a set of population with best fitness.
  • Selection:For selection you can use Tournament selection procedure (https://en.wikipedia.org/wiki/Tournament_selection). 
  • Genetic operators:  Genetic crossover  can be accomplished with concept of RB- Tree ( https://en.wikipedia.org/wiki/Red%E2%80%93black_tree ),is heavily used in algorithm design to make heavy optimization and create efficient bucket for data storage,now by cross over we will get hidden relationship between the disjoint sets henceforth we accomplished crossover by crossover of disjoint sets. 
  • Termination: A new population as generated replaces the older population. This population would again form a newer population using mating and selection procedure. This whole procedure would be repeated again and again until the termination condition is met. 
*The AnyScale Learning For All (ALFA) Group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) aims to solve the most challenging big-data problems — questions that go beyond the scope of typical analytics. ALFA applies the latest machine learning and evolutionary computing concepts to target very complex problems that involve high dimensionality. http://news.mit.edu/2015/una-may-oreilly-evolutionary-approaches-big-data-problems-0114 *