By most accounts, the current data explosion is a good thing. From monitoring traffic congestion to forecasting epidemics, we’re told there are wondrous things that can be done with all the data that’s becoming available to us. However, the fact that there is so much of it - from log files to stock charts to customer profiles – means that despite the host of new products cropping up to help us manage – the sheer volume of information that is now hoarded and stored is bound to lead to problems.
A forthcoming book by Nassim Taleb, author of The Black Swan, deals with this very problem. In Antifragile, Taleb argues that data is toxic, and not just in large quantities:
“The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal)", he writes, "hence the higher the noise to signal ratio".
"Say you look at information on a yearly basis, for stock prices or the fertilizer sales of your father-in-law’s factory, or inflation numbers in Vladivostock. Assume further that for what you are observing, at the yearly frequency the ratio of signal to noise is about one to one (say half noise, half signal) — it means that about half of changes are real improvements or degradations, the other half comes from randomness. This ratio is what you get from yearly observations.
But if you look at the very same data on a daily basis, the composition would change to 95 per cent noise, 5 per cent signal. And if you observe data on an hourly basis, as people immersed in the news and markets price variations do, the split becomes 99.5 per cent noise to .5 per cent signal. That is two hundred times more noise than signal ...
...The best solution is to only look at very large changes in data or conditions, never small ones"
Additionally, most businesses feel far from confident in their ability to handle all this data. According to a survey last year by The Economist (PDF), the volume of corporate data is growing by up to 60 per cent each year, but a mere 17 per cent report using more than 75 per cent of their data, suggesting that most companies collect a lot of data but have no idea what to do with most of it.
However despite these assertations, there is undoubted value in Big Data, the problem, it seems, lies with our ability to regulate it and to know when we have enough data.
According to The Register, a former senior IT executive with one of Silicon Valley's largest web companies acknowledged that his company stores every log file - and does absolutely nothing with them. It’s been suggested that such data should be deleted to keep it manageable and to avoid security breaches or we should consider which data is likely to be of use, and focus on that. However there is a danger, then, that things that could be useful in the future may be deleted or simply not recorded. The data analysis company, Splunk, saw its shares soar last month on the premise that log files from machine data are a previously overlooked gold mine for business insight.
For security professionals, better metrics to analyse the data would provide a solution. A recent survey by Dimensional Research found that almost three-quarters of security researchers believe that the volume of data they have to deal with makes it hard to filter, analyze, and assess changes in risk. 46% blamed lack of adequate metrics to make the information actionable; 41% said they did not have a real-time view of their security standing and 81% believed security tools with improved metrics would increase overall security effectiveness
Mike Lloyd is CTO of Redseal Networks, a US security services solutions provider. He says it’s not just security professionals who are experiencing this problem; it’s shared by government and industry:
“We are drowning in details; we have mountains of facts but very little useful information,” he told Infosecurity.magazine. “People agree that the right way to deal with this is using metrics. “What the surveys have in common, in the public space and commercial space, is that there is a distinct hunger out there; even after years of talking about this issue, people are still struggling to find the right metrics solution.”
The problem also is not so much about big data but fast data, as companies of all sizes wrestle with making sense of diverse structured, semi-structured and unstructured data sets to help them make quick decisions.
However Dell believes it may have a solution. Its Quickstart Data Warehouse Appliance is based on new PowerEdge 12G servers and Microsoft's SQL Server 2012. The company says this will be the first data warehouse appliance running the Denali SQL Server 2012. Currently in beta testing, it’s due to be launched in the second quarter of this year.
Meanwhile IBM is betting heavily on business analytics as a key driver over the next five years. In July 2009 the company launched its Smart Systems, which are clusters of server nodes equipped with operating systems. Some ran Cognos modules and others IBM's InfoSphere Warehouse variant of its DB2 database, merging data warehousing and analytics all in one cluster.
IBM gradually fleshed out the boxes and created an entry machine called the Smart Analytics System 5710. Some mid-range companies have quite large data munching jobs, and for these customers IBM has created the Smart Analytics System 7700. This uses servers, based on the Power7 Risc processors, similar to the nodes used in IBM’s Watson machine.
In Europe, a fairly large company might only need an analytics system that would qualify as a mid-range box in the US. Which is why Netezza, IBM analytics arm, has created a cut-down version of its data warehousing appliance, called Skimmer which is sold as the Netezza 100 series.
The future of analytics in the mid range is not clear so IBM could be pointing the way. In February, it completed its $440m acquisition of retail analytics software provider DemandTec. DemandTec offered its software on private slices of its own cloud.
Google has also been quick to move into this sapace and has launched BigQuery which is currently in beta testing and available on an invitation-only basis.
Google says the BigQuery engine will be able to scan billions of rows of data in seconds, using an SQL-like query language and for those who require a front end, French start-up, We Are Cloud, has created Bime - a business analytics tool that runs on Amazon's Web Services compute cloud and stores data in Google BigQuery.
The company has 200 customers, most of them outside of France, and the service is available in French, English, Dutch and Chinese, with other languages in the works.According to co-founder Rachel Delacour, it is designed for sharing data and query results through dashboards and other graphical representations.
"Traditional on-premise business intelligence tools are not inherently collaborative or cost effective," she says. “Cloud solutions are, even though they are not necessarily good at delivering performance on all data sets."
It could be seen as a slowly inflating raft, for businesses who are finding themselves in the deep end and certainly beats trying to do the stats in Excel, which, it seems, is still surpsisingly common.
Also on this topic:
Tim Berners-Lee backs Open Data Institute
The next IT revolution is happening in the "I" - the information - not the "T"