The first startup out of Stanford’s StartX student accelerator program, website form and content recognition engine Diffbot, is today announcing a $2 million round of funding from a select group of investors including Sky Dayton, founder of Earthlink and Boingo, Brad Garlinghouse, Any Bechtolsheim, Joi Ito and other web-tech all-stars. Diffbot founder and CEO Michael Tung told BetaKit that the round was actually oversubscribed, and the startup was careful to pick investors who were good at one thing in particular: scaling web-based companies in response to demand.
That’s because the Diffbot API is seeing over 100 million calls per month already, and that’s with only a fraction of Diffbot’s planned functionality out in the wild available for public use. What Diffbot offers is a combination of visual robotics and natural language processing (NLP) to help apps quickly identify on-demand what kind of content is represented by any given web page, and where the most relevant parts of that content are stored for any given application.
“Essentially we have this software that can analyze a web page and determine what are the objects on the page, like ‘this is an article, this is a product, this is a review, this an event, this is a location, etc.’,” Tung explained. “The idea is that by making this web page that was designed for humans machine-readable, we can then treat the web just like a big database of information.”
So far, Diffbot is able to identify two different content types, including front pages and article pages. Tung said that rolling out functionality slowly is part of the company’s plan to help it scale effectively, and also to help it make sure the tech is really ready for each intended purpose before it goes out. Overall, the Diffbot team has identified around 20 different distinct content types that basically cover the range of content available on the web, according to Tung.
“If you were to pull up an Israeli magazine and you don’t know Hebrew, or if you were to pull up a Japanese blog and you don’t know Japanese, you can still kind of tell how that page works, both editorially and in terms of navigation,” he said, describing how Diffbot can apply basically universally in terms of its optical recognition engine. “You can still tell ‘this is the headline, these are the comments and this is the picture that goes with it.’ That’s the level we’ve taught a computer how to do automatically.” Its content interpretation applies to different languages, too; Diffbot currently can handle about 250 different languages with the NLP side of its equation.
Because of its current focus on article content, Diffbot’s clients tend to be media companies. Some noteworthy examples include AOL Editions, which uses the tech to scan and prep the content for formatting in its Flipboard-style iPad reading application. Tung said that much of what it provides for AOL is a way to get all of its own content from various acquired properties, including TechCrunch, Engadget and the Huffington Post, into a single format that’s easy to work with, despite the fact that they all may come from different backend content management systems.
Diffbot represents a new way of approaching content, one that can be taught to handle variation and is equipped to evolve with the web. Tung and his team are already seeing strong uptake, and they’ve only just begun; once they expand to other types of content, Diffbot is a product that should appeal to almost anyone working on the web, from student researchers to multinational corporations. This new round of funding, and the advisory talent in brings with it, should definitely help the startup begin to better realize that potential.