A recent special section of the Wall Street Journal outlines a variety of aspects of “Big Data”. Some of the key observations include:
- We are talking really big … Facebook and Google have 100+ Petabytes of data (10^17th)
- Beneficial results of analysis include: Cutting medical costs at Harrah’s by encouraging employees to use Urgent Care facilities vs Emergency Rooms; Identifying the potential for new turn signals in Ford cars, and isolating breast cancer survival factors.
- The need for companies to know why they are collecting data (“haystacks without needles” is a delightful phrase from Darlan Shirazi), the lack of coherence — lots of collection, but no normalization/interconnection; and of course the inter-company race to be “THE” experts in applying big data (along with elbows in ribs, knifes in backs, and data un-shared)
- We don’t have enough professionals trained to analyze, or properly formulate the software to sift out the desired answers. (US demand could exceed supply by 50% circa 2018)
I first encountered big data many years ago when we were providing systems with nine and seven track tape drives to oil companies so they could copy their field records over to new media. This effort has continued perpetually since then to both keep the data available, to avoid “bit rot” and to perform new analysis. It is this last point that is actually key. Old data is not dead data. New techniques of evaluating old data can yield new insight and results. The aforementioned insight on breast cancer was a result of looking at previously analyzed results in a slightly different way. And of course old records can be invaluable. One of the larger cybercrimes publicized was the 2008 theft of oil company data logs, analysis and estimated values that may inform the acquisition by China of rights to a number potential oil fields.
There are some interesting big data characteristics to consider:
- Some of the data bases are coherent, collected by a single entity — albeit for multiple forms of analysis.
- However — cross connecting data sources may yield very interesting results — consider what might result from cross indexing genomic data with Facebook entries; or with “PatientsLikeMe” information.
- There is a good and growing business in tools that can provide various “analytics” capabilities. This will be a leading area for application of artificial intelligence techniques as well. (Of course we call things AI until they become mainstream, then give them different names.)
There will be many (many, many) applications here that will improve human health, welfare and the human condition. — And I almost was able to complete this without mentioning privacy issues, but I just failed.