Google+ Badge

Monday, 13 June 2016

NoSQL : Essentials and Short tutorial on Mongodb - Part 1

This post will provide insights into NoSQL and as continuation, in next post I will provide short and precise tutorial on Mongodb .

About Databases : Before diving into NoSQL , let me discuss the importance of databases.Databases are fundamental systems found in every organisations, databases provides storage,retrieving,manipulating and analysing data.Databases are very vital to business because it radically enhances the advantage of data offers.In other words , a database converts data into meaningful information.In my point of view, storing isn't hard to achieve but deriving value or information from stored data is a tedious process and optimal solutions are hard to find. 

RDBMS:The relational model was proposed by E.F.Codd’s 1970 paper "A relational model of data for large shared data banks" which made data modeling and application programming much easier. Relational model is well-suited to client-server programming.

NoSQL :NoSQL encompass modern techniques such as simple design,enabling storage of enormous amount of data in database management.NoSQL has become a popular architecture for handling large data volumes, because they can be more efficient with regard to the processing power required to handle large files.Relational databases simply don't perform well unless they are given structured data.NoSQL databases are new enough that many database engineers will have some difficulties in handling. Hence emergence of NoSQL databases applications such as Mongodb,Neo4j will make things easier and provide developers with agility and flexibility.In the RDBMS , the developer needs to design the data schema from the outset, and SQL queries are then run against the database and if that application/database undergoes any changes such as updation , the developer must be contacted again.Most NoSQL databases are open source and come with built-in communities.According to Daniel Doubrovkine,'s head of engineering, states thatNoSQL databases like MongoDB are simple to start with and get more complex over time. The syntax of NoSQL differs from SQL and needs some training to naive users. However NoSQL proivdes plenty of online checking forums and documentations. Migrating to NoSQL can be done through writing a bunch of SELECT * FROM statements against the database and then loading the data into your NoSQL document [or key/value, column, graph] model using the language of your choice. And you can rewrite into NoSQL statetments by using insert(),find().
Personal user information, social graphs, geo location data, user-generated content and machine logging data are just a few examples where the data has been increasing exponentially,hence SQL is not suitable for these types of data. CAP theorem is the principle when you talk about NoSQL databases or in fact when designing any distributed system.
CAP theorem states that there are three basic requirements: Consistency,Availability,Partition Tolerance

Practically speaking , it is impossible to satisfy all three requirements.

NoSQL Categories:
Each of the these categories have their limitations and attributes.

Key-Value: Key-Value are designed to handle large data and stored in form of  hash tables and it stores values like string,JSON,etc. For example "ANIMAL" is the key for the value"Lion" .Key-Value satisfies availability and partition of CAP theorem.

Column-oriented databases: stores data in column and every column is considered as individual one.Example :Simpledb,Cassandra,bigtable.

Document-oriented databases: collection of documents and data is stored in these documents. Example: Mongodb,Couchdb.

Graph-database:stores data in graph and each node represents entity and edge represents between these two entities or nodes.For example Orientdb,Neo4J.

Saturday, 4 June 2016


About HDFS:
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers , in other words it is a distributed Java-based  file system for storing large volumes of data.HDFS forms the management layer along with the YARN.HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.HDFS entails following features which ensures high availability , fault tolerance , scalability and efficient storage of data.

Rack awareness : Takes node's physical allocation                                          for scheduling tasks.

Standby NameNode: Main component for providing                                            redundancy and high availability

Less Data Movement: Processing of tasks takes place in the physical node where the data resides , hence data movement is reduced ,increasing high aggregate bandwidth

Features of HDFS


  • HDFS version 2.3.0 provides centralized cache management heterogeneous storage, implementation of openstack swift and HTTP supports
  • Version 2.4.0 provides metadata compatibility, rolling upgrades (allows upgrading individual HDFS daemons).
  • Version 2.5.0 provides incremental data copy , extended attributes for accessing metadata.