Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. Meta information defined in the source can be unreliable, and sites with less than stellar reputation often use them as keyword carriers, attempting to get search engines to rank them higher. Isn’t what we, the humans, see in front of us what matters anyway?
If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is - a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.
After covering some theory, in this post we’ll do a demo API call at one of SitePoint’s posts.
PHP Library
The PHP library for Diffbot is somewhat out of date, and as such we won’t be using it in this demo. We’ll be performing raw API calls, and in some future posts we’ll build our own library for API interaction.
If you’d like to take a look at the PHP library nonetheless, see here, and if you’re interested in libraries for other languages, Diffbot has a directory.
Continue reading %Diffbot: Crawling with Visual Machine Learning%
more
{ 0 comments... » Diffbot: Crawling with Visual Machine Learning read them below or add one }
Post a Comment