Finding Patterns in Social Data a Big Problem — the Cloud Can Help

The race to find relevance in the reams of social data that flows past us every day is never-ending. To take just two examples, Facebook is busy trying to filter the “likes” of half a billion users and turn the results into a usable search engine, while Twitter (along with a number of third-party services) is attempting to figure out who follows who so it can make recommendations to them. The biggest problem for both is that analyzing that much unstructured data is extremely difficult. Now researchers say they have something that might help: software that can find complex patterns in billions of bits of data in a matter of seconds — using cloud computing.

The researchers, two from the University of Maryland and one from the University of Calabria in Italy, reported their results in a paper entitled “COSI: Cloud Oriented Subgraph Identification in Massive Social Networks,” which will be delivered at the Advances in Social Network Analysis and Mining conference to be held in Denmark in August. In the paper, they describe how the explosion of data from social networks has caused problems for services that want to find patterns in it:

A technical obstacle to all of these is the difficulty inherent in being able to find all parts of the social network that match a given query network pattern. This essential first step (called the “subgraph matching” step by computer scientists) is…enormously challenging and has long been known to be computationally very difficult, rising exponentially in complexity with the size of the network.

With that in mind, the three researchers developed an algorithm that could take such a problem and split it up into pieces, parcel out those pieces to a cloud computing platform such as Amazon’s EC2, search for patterns and then pull the data back together. According to the paper, the team managed to perform “subgraph pattern-matching queries” on real-world social network data with more than 750 million “edges,” or connections between individuals, in less than a second. More recent results have shown this is possible with databases that have more than a billion edges.

If you’ve ever looked at the Twitter follower graph of a high-profile user — one with hundreds of thousands or even millions of followers — try to imagine the complexity of that network when you move down a level and look at all the users who follow each one of that super-user’s followers, and then move another level down and look at all those followers, and so on. It’s easy to see how the number of relationships can become incredibly large. But the COSI research proves it’s possible to get meaningful data out of that giant pool of information, all thanks to the cloud.

Related content from GigaOM Pro (sub req’d): Big Data Marketplaces Put a Price on Finding Patterns

Post and thumbnail photos courtesy of Flickr user Argonne National Laboratory

Alcatel-Lucent NextGen Communications Spotlight — Learn More »

Yahoo Launches New Secure, Smarter Hadoop

Yahoo is taking advantage of its annual Hadoop Summit today by rolling out some new features for the open-source file system distribution that it created for handling huge amounts of data. The new features tackle security and workflow management, two areas that Yahoo believes need to improve as Hadoop continues its proliferation among mainstream users. But will Yahoo’s features make it harder for startups like Cloudera and Karmasphere to earn a living?

On the security front, Yahoo has integrated the Kerberos authentication standard into its distribution, resulting in the aptly named Hadoop with Security. This lets users consolidate data from multiple applications onto the same Hadoop cluster, while limiting access to each class of data only to authorized users. This isn’t a mainstream problem yet, but because of its large Hadoop infrastructure –- 34,000 servers and 170 petabytes of data spread across the globe -– Shelton Shugar, SVP of cloud computing at Yahoo, thinks his company is “probably at the forefront of running into this [problem].” He adds that it will become a big issue for enterprises as their usage expands in scope beyond small development teams and single applications.

The other newly available download is a workflow-management tool called Oozie, which Shugar calls the “elephant tamer.” Oozie should be in high demand from users outside Yahoo because it lets them manage and maintain a variety of different Hadoop job types and data dependencies without writing their own applications to do so. Shugar says it’s the de facto tool for extract, transform, load, or ETL, processing at Yahoo.

Both of these Yahoo innovations beg the question of how the Hadoop market will play out. Cloudera offers its own commercial Hadoop distribution and support services, and plans to release proprietary products in the near future. Karmasphere offers a desktop-based product for building, deploying and managing Hadoop applications. Other startups, like Datameer, are incorporating Hadoop into the guts of business intelligence products without requiring the user to learn any Hadoop programming.

There currently is a market for value-added commercial products (GigaOM Pro sub req’d), for Hadoop, but one wonders whether first-time users are more likely to pay for Hadoop software or experiment with Yahoo’s growing set of free tools (which actually might end up in commercial distributions, too). Shugar says Yahoo is investing serious resources into balancing CPU and storage requirements to maximize infrastructure usage in the face of skyrocketing storage needs, and is also looking to improve internal programmer support to help get data in and out of Hadoop via metadata.

As more Yahoo software makes its way into the Apache Hadoop community, and big data analysis requirements grow, it might be difficult to justify paying for value-added solutions rather than just downloading the increasingly feature-packed Yahoo distribution and learning Hadoop development. Should the startups building their business around Hadoop worry?

Image courtesy of Flickr user Erik Eldridge

Alcatel-Lucent NextGen Communications Spotlight — Learn More »

DemandTec Explains How It Keeps Big Data on Target

DemandTec, a retail-forecasting software provider, has convinced Target Corp. — an existing customer — to hand over even more of its shopping data in order to better set prices and forecast demand. Target is following a growing trend in retail that involves using more granular customer data to predict demand, thanks to advances in software and technology. But stores still need companies like DemandTec to help them turn the plentiful straw of digital data into predictive gold, a service that becomes more challenging as more data is introduced.

DemandTec has helped Target gather retail information at the store level for years, tracking how many cartons of Tropicana orange juice were sold in a single store, across a city or even statewide. It established causal relationships between that data and ads and promotions to forecast demand and the impact of pricing changes on buying habits. But now Target is using DemandTec’s Shopper Insights software to go another level deeper, tracking not just the items, but instead an entire basket of goods and associated demographic information. Now, for example, Target will know that a person who bought Tropicana also bought a brand-name cereal and a lawn chair.

Derek Smith, VP, Retail Industry Marketing with DemandTec, says that adding the data associated with tracking an entire basket of purchases caused an “exponential increase in the information” the company processed. To prepare, DemandTec purchased data-warehousing hardware from Netezza. But the hardware infrastructure for storing and processing this data isn’t as interesting as the statistics and knowledge that DemandTec says they have to apply in order to make real conclusions for their customers. It’s not enough to have a firehose, they have to figure out which streams affect purchase decisions and which are meaningless.

That job will soon get harder. Smith says the next level of data that retailers will want to incorporate will be from social-media sites and personalized ad campaigns delivered via mobile phones, as well as more data streams covering phenomena that have a proven impact on retail spending such as weather information. Unlike weather data, much of the social information is unstructured, which means DemandTec will look for partners to provide the structure DemandTec needs, or the company will have to do it themselves. Already companies like Microsoft and Infochimps are looking at ways to provide both meaning and a market for data (GigaOM Pro sub req’d).

The ability to store and process terabytes of information in order for Target to tweak its prices by 3 cents and drive a 5-percent increase in purchases by its most profitable customers is now here. When it comes to the future, DemandTec is looking for services that help it get a handle on the massive amounts of unstructured data out there, such as which of the 65 million tweets a day has a direct relationship to what a person will buy. Sounds like a golden opportunity.

Alcatel-Lucent NextGen Communications Spotlight — Learn More »

Here’s How The Web Reads Your Mind

Want to know how Apple’s Genius song recommendation system for iTunes works? Apple engineer Erik Goldman offered up some insights to users of answer service Quora in a post back in May. While Goldman’s post has since been deleted, Christopher Mims covered it in an MIT Technology Review story on Wednesday. Goldman’s answer on Quora offered a sneak peak into the way big data analytics and aggregated personal information combine to personalize song recommendations and create a custom long tail of content for iTune’s customers. The Genius services boosts revenue for Apple, but insights into its inner workings could also benefit the web as a whole.

Recommendation engines are the key to shoving the entire web in small devices like mobile phones and for creating a hyperpersonalized surfing experience. For consumers, the web has opened up billions of opportunities to find content , with much of it contained in the so-called long tail made famous by Wired’s Chris Anderson. But mere mortals can’t filter though all the possibilities to discover what the heck they want to read, watch or listen to. Hence the popularity of recommendation engines and discovery services from companies like Amazon, Apple, Netflix and even Google.

Despite the fact that Goldman’s original answer mysteriously disappeared the day the Technology Review’s post drew attention to it (if you want to see what may be Goldman’s original answer, check the screenshot at the end of the post), Mims unpacked the mystery of how the recommendation service works. The heart of the Genius recommendation systems are statistics applied against a mess of data. The initial goal is to take an individual’s playlist and measure the frequency of certain elements (such as the artist) and determine how significant that element might be in making a recommendation. To do that, the algorithms check the frequency of those elements in other Genius users’ playlist to see which ones occur widely and which ones don’t. This allows it to compare playlists between people who like the same obscure bands rather than trying to draw conclusions based on the hundreds of millions of playlists that include Lady Gaga’s “Bad Romance.”

The second element of figuring this out relies on assessing which rules the recommendation engine can apply to your playlist to reduce the amount of data it must parse through — the so-called “latent factors.” Christopher Mims writes:

Latent factors are what shakes out when you do a particular kind of statistical analysis, called a factor analysis, on a set of data, looking for the hidden, unseen variables that cause the variation in all the different variables you’re examining. Let’s say that the variability in a dozen different variables turns out to be caused by just four or five “hidden” variables — those are your latent factors. They cause many other variables to move in more or less lock-step. Discovering the hidden or “latent” factors in your data set is a handy way to reduce the size of the problem that you have to compute, and it works because humans are predictable: people who like Emo music are sad, and sad people also like the soundtracks to movie versions of vampire novels that are about yearning, etc. You might think of it as the mathematical expression of a stereotype — only it works.

These techniques aren’t rocket science — they’re statistics based (which, given my performance in stats as opposed to physics, is much harder). To learn more about how latent factors are uncovered, Goldman recommended folks turn to the site operated by the recent winners of the Netflix recommendation prize. For laypersons, I can recommend Wired’s awesome story covering the race to win the Netflix prize, which shows how most of the people trying to improve recommendation engines are doing so in the open and piggybacking on each others’ efforts — something Apple doesn’t seem to be endorsing, given that Goldman’s post was deleted.

As the devices on which we consume our information become smaller, the need for better recommendations has moved beyond a nicety for discovering long-tail content into a necessity for displaying optimal results quickly over a mobile connection and on a small screen. I discussed this problem with Elizabeth Churchill, principal research scientist and manager of the Internet Experiences Group at Yahoo, a while back, and she emphasized that tailored recommendations are important for mobile users not only because the screen sizes are small, but also because mobile connections are slower and people don’t have the patience to wait for a lot of results to load.

My theory is the ability to use compute clouds, access huge amounts of data, then crunch that data to make prescient recommendations and then deliver them in a format fit for mobile consumption will be the key stepping stones for the next generation of the web.

Alcatel-Lucent NextGen Communications Spotlight — Learn More »

Microsoft Wants to Build Its Business With Data

Everyone likes to talk about big data, but few know how to make use of it. Thanks to cloud computing and the efforts of several companies, however, the ability to access and make sense of huge chunks of information is here. The question is whether there’s a business in providing intelligible data sets to information workers, application developers and analysts in a world where turn-by-turn directions and real-time financial quotes — which used to be expensive — are now free.

Microsoft is hoping there is, and to that end has built out a storefront for data sets that range from geolocation data to weather information that’s codenamed Project Dallas. The project, which will become commercially available in the second half of the year, aims to provide access to data from information providers like InfoUSA, Zillow and Navteq so that developers can use it to build applications and information services. Other potential users of the information are researchers, analysts and information workers — from buyers at retail stores to competitive intelligence officers at big companies. Microsoft will take a cut of the fee charged by the information providers, but Dallas isn’t about profiting from data brokerage so much as it’s about showcasing Microsoft’s Azure cloud and making its Office products more compelling.

“The indirect monetization is potentially bigger than the direct monetization,” said Moe Khosravy, general product manager of Project Dallas, in a conversation last week. “That will cover some bandwidth and compute and the credit card surcharges for the transactions, but the real opportunity is that more developers will use Azure and Office because we’ve made it easy and will build support for Dallas into Office.”

I explore Microsoft’s efforts as well as those of a startup called Infochimps, which is also building a data marketplace, in a research note over on GigaOM Pro (sub req’d) called Big Data Marketplaces Put a Price on Funding Patterns. In it, I lay out how the ability to host and process large data on compute clouds has changed the way people can access and profit off of data.

And while I spend a lot of time in the research note talking about business models and how to charge for data by the slice, Infochimps and Microsoft will both provide some data for free, much like and a startup called Bixo Labs are doing. Specifically, Khosravy said Microsoft may try to provide some municipal and federal data as a public service — or at least refrain from charging the governments from hosting the data on Azure.

Figuring out how to get public information on data marketplaces is difficult. Governments have a lot of access to data, but it’s generally on paper or in old databases that may not translate automatically to the cloud. There’s a clear public interest in providing that data in a clean format for developers and citizens, but the costs could quickly add up — and governments don’t tend to have a lot of taxpayer dollars floating around to transfer their data to the cloud. That’s why Microsoft’s volunteering to host “a percentage” of public data for free might help.

And the benefits of such easy accessibility and the ability to mash up different data sets could be huge. As an example, Microsoft is working with the City of Miami on a new 3-1-1 line that uses mapping data and inputs from the city’s existing 3-1-1 hotline to create a map of where potholes and street problems ares so city officials can tackle the issues in an organized way.

As data marketplaces grow, questions about who owns the data and privacy issues will get resolved, because the financial incentive to address them is huge. Then folks can focus on what they can build using huge swaths of demographic, geographic, financial and even personal data. Read my full analysis.

Can Facebook or Twitter Spin Off the Next Hadoop?

Like most people, I suspect, I wasn’t too surprised to find out that Hadoop-focused startup Karmasphere has secured a $5 million initial funding round. After all, if Hadoop catches on like the evidence suggests it will, Karmasphere’s desktop-based Hadoop-management tools could pay off investors many times over. In some ways, though, the fact that Hadoop is mature enough to inspire commercial products means it’s yesterday’s news. Now, I’m wondering, which open-source, big-data-inspired product will be the next to launch a wave of startups and drive tens of millions in VC spending?

Big data has narrowed the gap between the needs of bleeding-edge web companies, their offspring and even traditional businesses. Hadoop has caught on across industry boundaries as an analytics tool for unstructured data sets, and it seems logical that other web-based tools will catch on in other parts of the data layer. In my weekly column over at GigaOM Pro (sub req’d) today, I took a look at the potential for Cassandra, which grew out of Facebook, and Gizzard, Twitter’s ill-named big-data baby.

Given its growing popularity and expanding functionality, Cassandra right now seems like a prime candidate. Rackspace has taken over its development reins, and its found varied applications within Digg, Twitter, Reddit, Cloudkick and Cisco to name a few. This diversity illustrates Cassandra’s versatility; it’s not just for the social media crowd. Furthermore, Cassandra graduated to a top-level Apache project in February, signifying the quality of the work done on it thus far and, most likely, a groundswell of new developers.

Twitter’s newly open-sourced Gizzard tool seems to have promise, as well. By eliminating some pain from the often difficult sharding process, Gizzard makes it easier to build and manage distributed data stores that can handle ultra-high query volumes without getting bogged down. Like Google, Yahoo and Facebook before it, Twitter has played a role in evolving how we use the web, and software developed within its walls should be a hot commodity for present and future Twitter-inspired sites and products.

Which do you think will take off?

Read the full article here.

Photo courtesy Flickr user zzzack

What You Should Read Today

I was just in New York for a week of fun and learning. It was my chance to meet people outside the Silicon Valley hothouse and get their perspective on the world. Of course, it gave me a chance to read Michael Lewis’ new book, The Big Short. More importantly, the trip allowed me to not bother my team and allow them to peacefully launch new looks for our blogs, TheAppleBlog and Earth2Tech. Congratulations, folks!

These redesign is part of a network-wide overhaulNewTeeVee will soon get a facelift as well — meant to surface content from our various blogs for your reading pleasure. Speaking of reading recommendations, here are some of the best pieces worth your attention.

Cody Willard: Cars, PCs and transistors– how to trade off a repetitive history of industry cycles. Willard, one of my favorite stock gurus, makes an interesting link between the ever-shrinking number of car makers and PC companies.

Di-Ann Eisnor: Why the Future of Location-Based Advertising Looks Like Twitter. If you want to understand the challenges of the mobile advertising marketplace, look no further than this in-depth essay by the CEO of recently shuttered Platial.

Startup Company Lawyer: A comparison of various different series of seed financing documents, including those from popular startup schools such as Y Combinator and TechStars.

Social Media Kills the Database: A new web startup guy laments the relational database and how companies of his generation are looking to open source software such as Hbase and Hadoop to break the tyranny of the relational database. It is a remarkably coherent and easy-to-understand essay and worth reading.

Paul Kedrosky: Hitchhiker’s Guide to Financial Regulation.