Today’s data traffic on the internet has come to a staggering level. According to scholars at the University of Leicester printing out all the internet would require not less than 136 billion A4 papers and counting. The world is constantly changing and so are the companies. Regular standalone BI solutions will become too simple and outdated and no longer be able to cope with the business needs.
The time for Big Data has come. I’ve heard many times the misconception of Big Data as being defined in terms of size. Nothing is more wrong. It is about size, variety and velocity. Streaming data is flowing constantly from all kind of devices in all kinds of formats. Adapt or fall behind. But where do we set the line between data and information? How good can we read and filter the information and turn it into valuable knowledge?
Knowledge is beautiful. I recently participated in the Hadoop conference in Dublin. Among tens of speakers there was one that I particularly enjoyed. A bold British journalist spoke to us about data and information and the power of visualization. We see numbers and graphics but how relevant are they and in rapport to what? How much is a billion and how do you represent it? Connecting all that information and make a sense out of it, it becomes even more difficult.
A normal day in the internet looks like this: 294 billions of emails sent, 2 million blog posts, 172 million different people visit Facebook, 4.7 billion minutes are spent on Facebook and Apple sales, more iPhones than babies are born. Connecting the dots and making a good representation of this facts is a big challenge. It requires a beast to deal with this or maybe just an elephant!? But you know what they say – a picture is worth more than a thousand words. Let’s have a look into our daily reality. We all had a rather difficult moment in choosing our favorite Friday night movie. So much information, comments, ratings, pros and cons makes it a difficult choice. But not anymore. Check out a Hollywood movie classification in a very clear and dynamic representation here. Cool right?
The little elephant… In 2003 Google released a paper to the public about Google File System, short after that another paper came called MapReduce. Soon the projects were taken over and further developed by Apache. Today is known as Apache Hadoop. In the past years the project has grown in popularity and today we have a Hadoop Ecosystem consisting of core components HDFS and MapReduce and other 64 projects around it.
HDFS – or Hadoop Distributed File System is the main storage system used by Hadoop applications and is fault tolerant. Data is being split into small blocks and loaded into two or more nodes and then replicated to 3 different nodes to achieve fault tolerance. HDFS is NOT a database!
MapReduce – is a programming paradigm that allows massive horizontal scalability and distributed computing through data sharding or “share nothing” architecture. Unlike relational data processing languages, in MapReduce the code is being brought to the data and not vice versa. In simple words the code is being generated depending on where the data is stored in the node.
Other projects that have a significant role in Hadoop Ecosystem are: Hive, Pig (what?!!), HBase, Spark, Sqoop (SQL-to-Hadoop), Impala, Zookeeper, Flume, Oozie and Ambari.
A very simplified version of HDFS and MapReduce looks like this:
Trends and future of Hadoop
1. Availability
2. Data governance
3. Integration with existing platforms
4. Performance
1. Just a few years ago dealing with HDFS and MapReduce would be a rather difficult task. Developers would need highly programming skills and the interface was not attractive or user friendly. Manipulating MapReduce jobs or retrieving data from HBase would require high Java programming skills. These days thanks to projects like Hive, Pig and Spark that act as platforms on top of HDFS , MapReduce and HBase it has become a lot easier to interact with them. For example Hive language is called HiveQL and it is SQL like. Writing a Hive sentence for querying HDFS would be broken down into MapReduce jobs but you don’t have to worry about that! Furthermore thanks to Apache Nifi project for creating data flows, following your data from the beginning to the end has become incredible handy.
2. The need for a more rigorous data governance arose as soon as companies across the globe began using Hadoop extensively. To deal with this problem Apache Ambari together with Apache Falcon project joined hands. What Apache Falcon does is:
• Define data pipelines
• Monitor data pipelines in coordination with Ambari
• Trace pipelines for dependencies, tagging, audits and lineage.
Apache Oozie is Hadoop’s workflow scheduler, but mature Hadoop clusters can have hundreds to thousands of Oozie coordinator jobs. At that level of complexity, it becomes difficult to manage so many data set and process definitions. This results in some common mistakes. Processes might use the wrong copies of data sets. Data sets and processes may be duplicated and it becomes increasingly more difficult to track down where a particular data set is originated. Falcon addresses these data governance challenges with high-level and reusable “entities” that can be defined once and re-used many times. Data management policies are defined in Falcon entities and manifested as Oozie workflows
3. It is a well-known fact that switching from a well-established technological solution to a new one can be perceived as a massive disruptive factor in the company’s activity. Novelty can be sometimes scary and difficult to accept but not taking the appropriate actions in good time will result in losing the near future competitive advantage. Open source technologies like Hadoop are a disruptive way of doing IT but definitely not opposing the existing ones. One of the main focuses at the Hadoop conference in Dublin was about ways of consolidating the traditional BI platforms with Hadoop Ecosystem. The advantage of doing that is pretty much obvious – getting a much clearer and broader picture of your current business. Imagine your sales are dropping and you cannot identify a pattern. Your BI solution tells you it went down but you may also want to know why. The answer may reside in your unstructured data captured with the help of Hadoop. And examples can go on forever. I strongly believe we all need to have an open mind and embrace the opportunities for improvements and make the most out of them.
4. Another focus at the conference in Dublin was performance. The Hadoop community is constantly working on improving data processing performance and projects like Hive are at high interest. Recently the new version of Hive was released and came with LLAP – a subsequent analytical queries platform which provides a significant better performance of execution for Hive queries. Last but not least projects like Spark and Impala are also contributing and focusing on speeding up data processing.
Apache Hadoop Ecosystem is growing and changing rapidly with new projects coming up almost every month. It is very difficult to keep up with the changes and therefore I believe it is important to focus on a specific area and to specialize in it. I also believe that in probably 10 years from now we will have a much more unified, clear and solid Hadoop Ecosystem. Let us dare to tame the beast and turn the odds into our favor!
If you have any questions about Hadoop please send an email to: Ionut.Strambeanu@advectas.se






