The conventional solution to big data analysis involved separating the analysis layer from the storage. The first step to analyzing big data required one to know exactly what one was looking for. It was almost like having to know the answer before asking the questions. One would need to make certain assumptions about the data, then build the ETL (extract, transform, load) logic to populate the analytics database – that was typically based on a relational database, and then build an interactive querying front-end consisting of dashboards and reports that would help validating one’s assumptions. This methodology sure did yield results, but it was quite difficult to find something one did not know existed in the data. The richness of the original information was lost in the ETL process. Further, as the volume of data increased, it became difficult to scale up the process. To accommodate the volume, one had to “archive” the data to make way for new data – in effect rendering that data to obscure dungeons too deep to pull out for analysis.
With the new big data technologies such as Hadoop and related technologies like Hive, Sqoop, Flume or Pig – (the informal names of which beguile their capabilities) – the entire process of analyzing big data has undergone a major transformation.
The fundamental difference in the big data approach is that one does not need to know what one is looking for while building the repository of data. If the repository is built using HDFS and data is processed using Map Reduce, it is possible to dynamically build analytical views without having to use ETL to build an intermediate database for analytics. Most analytical queries can be directly implemented using Map Reduce, and can be run directly on the HDFS repository. For those queries that require a relational database, one can dynamically populate such a database using ETL on the HDFS repository, when one wants to analyze the data rather than when the data is collected.
The following table neatly summarizes the difference between a conventional analytics approach and the big data approach:
|Analytics using RDBMS||Big Data Analytics|
|Schema must be created before data is loaded||Data is simply copied to the file store. No special transformation is required|
|Explicit load operation has to take place which transforms data to database internal structure||A SerDe (Serializer / Deserializer) is applied during read time to extract the required columns|
|New columns must be added explicitly before data for such columns can be loaded into the database||New data can start flowing any time and will appear retroactively once the SerDe is updated to parse them|
|Read is fast||Load is fast|
|Standards / Governance||Flexibility / Agility|