Tim Estes

Subscribe to Tim Estes: eMailAlertsEmail Alerts
Get Tim Estes: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Cloud Computing, SOA & WOA Magazine, Cloud Data Analytics, Social Media Check-Ins

Article

Predicting Egypt – Could Cloud Computing Have Helped?

If “Big Data” can be seen as a cloud, it will be necessary for software to help us all make sense of all this data

On January 25, 2011 the overthrow of the Egyptian government began - quietly at first, with non-violent demonstrations of long-held grievances. But, by February 11 Mubarak had resigned, the government of a long-stable country had been overturned and the future of the entire region was altered forever. With the overwhelming amount of web content and social media we are compelled to ask: "could we have seen this coming?"

Yes, is the answer - provided we're willing to look at information in a new way. To a person analyzing "big data," this means that we must include the broadest range of data available. Increasingly, social media and other web-based content represent a rich source of timely information and would be very useful to a broader understanding of people, events, organizations, etc. However, these new data sources come with complications like volume, validity, and understanding in context.

We decided to explore this issue in more detail. We wanted to find out what connections could be uncovered when the data came from social media and web content.

Where to Find the Information
In today's Internet/web-centric environment the good news is that the answers to our questions are all around us. The bad news is that the information is random and unstructured, that is the actual meaning of the author is not easy to determine.

For our case study we started with a collection of 16 million blog entries that had been automatically collected by a Web information service over the past couple of years. These blog posts had no unifying topic or authors - simply a large amount of social media data. This social media data presented three key challenges that had to be managed in order to get meaningful results.

  • Context: With data from random blogs with no unifying topics or authors there was a large number of words that were identical but had different meanings dependent on the context.
  • Scale: In 2009 it was estimated that there was 800,000 petabytes of data and it is expected to grow by over 44 times in the next decade. More important is the fact that over 80% of this data is unstructured, leading to many challenges with automated understanding.
  • Associations: Once the challenges of context and scale are overcome there is a critical need to consider all of the analyzed data to look for associations, code words, linkages, etc. This becomes critical in order to understand the true intended meaning of the author.

Cloud-Scale Analytics Architecture
In order to address analytics in Big Data, a massively scalable architecture is needed to address the quantity, speed and variety of sources of the data being analyzed. Increasingly open source solutions such as Hadoop and Cassandra are forming the foundation of these architectures. The use of Hadoop is rapidly growing in cloud computing architectures and provides highly scalable computing environments. In order to process the millions of blogs into our analytic engine this computing environment was essential. Cassandra is a scalable data storage architecture used by some of the largest new media environments in the Internet. For our analytics, the scalability provided by Cassandra was necessary in order to handle the large amount of descriptive information (derived metadata) for this corpus. One of the challenges of analyzing large amounts of social media is to avoid "pruning" the data because it's hard to know in advance what might be necessary. In order to solve this problem it's necessary to keep everything, as well as all of the descriptive information, links, etc. that help the software, to fully understand all of the information in context.

What are the data challenges of this new data?

  • Many sources for today's information
  • Scalable infrastructure required
  • Automation required to understand entities in context

The Data Was Illuminating
We learned a lot about both the information buried in everyday blogs as well as the process for extracting useful information from these vast sources. Once the data was analyzed, it was presented in a link analysis map. This display graphically shows the connections among people, places and ideas/concepts. We were fascinated to see the large volume of messages connecting "Cairo" to phrases such as "powder keg" and related problematic phrases. By further drilling into the information it was possible to see the complete post and the author to better validate the seriousness of the information. By further probing the analyzed data it was possible to understand that the social media discussions happening 6-9 months prior to the social unrest were remarkably prescient not only to the impending nature of the unrest, but of the underlying social reasons for the desire for change.

Example of Social Media "Link Map"

"Big Data" Is the New "Oil"
The critical nuggets of information are out there in the world's ever-expanding data repositories but finding them represents one of the biggest challenges and opportunities of the coming decade. This has often been likened to the challenges of discovering oil. There are pools of useful information within the avalanche of data but finding them is the challenge. Let's face it, there is too much data for an army of humans of any size to be able to read and understand. In fact, the larger the team of people charged with reading, the more complex the task of maintaining some form of central intelligence of what has been discovered. Software running with scalable cloud computing technologies is the only way to consider data of this size and scope. This analytics software is collectively referred to as "text analytics." But most of the products in this emerging market still have limits in scalability and also use a form of dictionary (taxonomy or ontology) to understand the meaning of words. But these a priori dictionaries don't work well when the context is something new and unknown. Consider the phrase "That concert was sick." Was this a good concert or a bad one? The only way to know is to understand context and these dictionaries can leave you lacking real understanding. An even more dramatic but common example is when a person is intentionally speaking in code. Did "wedding" mean two people getting married or was it code for a bomb location? That would be an important nugget to get correct!

The Way Forward
The future of Big Data analytics is very exciting. As a society, we are standing at the door that is about to be opened so that the information we see as drowning us today will be transformed into actionable information guiding our decisions and actions tomorrow. But, these next-generation systems will require specific foundations in order to be successful.

Needless to say, the systems must be scalable. Even more important, these systems will require distributed intelligence. With the world's data growing at its current rate, the only solution is to distribute the intelligence and meet the data where it is - on its own terms.

These next-generation solutions must also be automated. The software must be capable of investigating with guidance but smart enough to retrieve the non-obvious bits of information that couldn't have been programmed in advance. We have all heard of the phrase "you don't know what you don't know." This next generation of text analytics solutions will confront this all the time. This reduced the importance of human-driven search and raises the importance of automated data gathering that's based on human guidance but largely automated.

Finally, this next generation of software must seek to understand data in context, the way humans do. The reason we learn that "sick" in the example above is "good" is through context. It is impossible to create dictionaries for every possible contextual situation for every word. Therefore software that learns to "read" and "understand" text without the need of taxonomies and dictionaries will have a distinct advantage.

There are a number of companies making exciting progress in this market characterized by an "automated understanding" of Big Data. In our recent analysis of social media, we were able to see the trouble brewing in Egypt months before it was exposed through protests in the street. Unfortunately, this was learned after the fact but, with increasing numbers of exercises like this, we know the potential is there. If "Big Data" can be seen as a cloud, it will be necessary for software to help us all make sense of all this data - lest we drown in it. We are already seeing solutions designed for health care, financial services, fraud detection, legal discovery, and government intelligence. These solutions hold great promise for our collective understanding of the world's vast repositories of data. As a leader of a company working to help evolve this market I definitely see the future as very bright.

More Stories By Tim Estes

Tim Estes serves as Chief Executive Officer at Digital Reasoning, the leader in unstructured data analytics at scale. His academic work at the University of Virginia focused in the areas of Philosophy of Language, Mathematical Logic, Semiotics, and related disciplines. Digital Reasoning was founded with a vision to create software-based intelligent learning systems. This led to the breakthrough technology and intellectual property which is at the core of Digital Reasoning.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.