Enterprises generate different kind of log files that record events in various fields of activity:
- Web site traffic logs
- Email logs
- Employee attendance logs
- Sales call center logs
- Customer support logs
- System events on servers and networks
- Instrumentation logs
The problem of analyzing such log files is complicated due to the fact that the volume of data generated is quite large, and the structure of data is quite disparate. Conventional database solutions are not suitable for analyzing such log files because they are not capable of handling such a large volume of data efficiently. This is where big data technologies come to the rescue.
To illustrate how big data technologies can help analyze log files efficiently, consider the case of analyzing web traffic logs. These log files have tons of useful information for business. Analyzing these log files can give lots of insights that help understand website traffic patterns, user activity, there interest etc.
Examples of analytical data from web traffic logs:
- Users are interested in which part of website
- Users’ activity
- Language, country / territory, city
- Browser, OS, service provider
- New visitors / returning visitors
- Most / least viewed page
- Traffic sources (search / referral / direct)
From the above information, one can evaluate:
- What potential customers like
- Which part of website needs to be improved
- From which sources visitors are coming
- Country / territory information of visitors
These log data may range from terabytes to zetabytes; hence we need big data solution to analyze it. This problem can be solved using Hadoop, designed to process large volumes of data efficiently. There are several sources generating log files, we need to collect all the log data to a place where Hadoop can analyze it. This work of data collection can be done using Flume. After collecting data we can run MapReduce job to process / analyze log files.