Getting Big Data to Work for You
More than three-quarters of companies are investing or planning to invest in big data in the next two years, according to a recent survey of IT and business leaders by Gartner, Inc. A Capgemini survey across multiple companies revealed that only 13 percent respondents believed that their Big Data implementations are really in full-scale production and predictive insights are extensively integrated into business operations. While 35 percent respondents believed that their Big Data implementations are in “partial” production, rest 53 percent respondents confessed that they are either at POC stage (doing or thinking) or awaiting budgets for doing a POC!.
So, where’s the gap?
In my experience while doing multiple Big Data implementations and building componentized platforms for doing so, here are the key challenges which are impeding accelerated adoption of Big Data and suggested solutions for overcoming them.
While Data warehousing and ETL skills as well as newer NoSQL skills like MongoDB, Cassandra etc. and infrastructural skills like Amazon AWS, Apache Hadoop etc are reasonably available, one of the high ticket promises of Big Data – Predictive Modeling / Machine Learning which require strong “Data Science” skills, are scarce. Data Science is ideally the intersection of following three pillars: Strong mathematical & statistical background, Hacking (aka programming) skills, and (Deep) Domain Expertise
Integrating multi-channel and variety of data sources at the modern volume
Another challenge is handling volume, velocity & variety: from structured data sources from RDBMS/Data Warehouses to unstructured data like Social media to clickstream & sensors. There are concerns like ETL, Homogenization, clean-up, enrichment, and semantic associations.
A budget allocation for “Big Data Projects” in enterprises is one of the biggest challenges today where teams struggle in terms of justifying hard ROI on investing in Big Data projects. While this shall ease out as the field goes up the standard maturity continuum, a few ideas which can work well are - starting small (and not be too ambitious) to showcase quick results and using innovative “pay for results” business models that play in well with the psyche of people on the edge of decision making around investment in Big
Getting the right data and Infra architecture for performance and scalability
This is an obvious technical challenge which although should be easily addressable, but gets tremendously complex due to many variables. Recent investments in legacy infrastructure severely restricts coming up with an ideal “future-ready” architecture. The endeavor is to come up with a most “optimal” architecture which allows for as much re-use of legacy infrastructure
The Big Data tech stack has been evolving way too fast with technologies getting up on hype curve and suddenly losing favor due to a newer alternative (ex. STORM v/s SPARK).Until that happens, a detailed “assessment” of what is trending in market and more importantly of the internal stack and future needs, the technology architecture roadmap should be very thoughtfully crafted.
Turn-around time from Data acquisition to insights
One of the most common problems we’ve encountered is the high overall turn-around time which it takes from data acquisition, clean-up, modeling and deploying models at scale in production, in many cases high enough for data to not stay very relevant. A typical flow would look like:
• Data ingestion (from multiple sources)
• Data clean-up and transformations/ enrichment
• Iterative model development (Data Science)
• Deploying models in production
In more enterprises than not, this is a sequential process and some of the key bottlenecks are around data clean-up, and exporting/importing the data in and out of the statistical modeling tools like R, and eventually, coding the models into production applications. A 3-4 weeks time for getting data ready for statistical analysis and similar time-frame for manual analysis and iterative model building by the data scientists (for well formulated problems) is very typical. Again, more often than not, we’ve seen the models and recommendations from data science team being hard-coded into applications which take another few weeks of development, QA and release management time. By that time, in many use cases, the data may not stay much relevant anymore!
Data quality is one of the most under-estimated issue in Big Data Implementations esp vis-à-vis the schedule and cost implications it may cause. Our approach to data quality issues has been a pragmatic one which focuses on putting in realistic schedules and costs in planning for handling data quality issues.
Data Governance and Security
Historically, Data Governance has followed what would be a “waterfall” approach in software development parlance – i.e. data was governed as it was discovered or brought into the enterprise which would typically mean that enterprises would integrate data and govern it to the highest required standard. This is going to be slow especially in the “agile” world of Big Data. An “agile” governance would entail discovering and understanding and “profiling” the data and applying appropriate controls without inhibiting the speed and flexibility. A comprehensive yet “agile” data governance mechanism would not only ensure that enterprises protect their and customer’s information assets but also allow for flexibility to deploy innovative Big Data approaches and technologies.