Diffbot is a suite of ML-based products that make it easy to structure web data.
Diffbot’s Extract API is a service that structures and normalizes data from web pages.
Unlike traditional web scraping tools, Diffbot Extract
doesn’t require any rules to read the content on a page. It uses a computer vision model to classify a page into one of 20 possible types, and then transforms raw HTML markup into JSON. The resulting structured JSON follows a consistent type-based ontology, which makes it easy to extract data from multiple different web sources with the same schema.
.load()
method, you can see the documents loaded
DiffbotGraphTransformer
to extract entities and relationships into a graph.
DiffbotGraphTransformer
guide.