Article Extraction

Request example

If you requested an article extraction, and the extraction succeeds, then the article field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'http://example.com/article?id=24',
    'pageType': 'article'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['article'])

Available fields

The following fields are available for article:

headline: string

Article headline or title.

datePublished: string

Publication date. ISO-formatted with ‘T’ separator, may contain a timezone. If the actual publication date is not found, dateModified value is taken.

datePublishedRaw: string

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

dateModified: string

The date when the article was most recently modified. ISO-formatted with ‘T’ separator, may contain a timezone.

dateModifiedRaw: string

Same date as dateModified but before parsing/normalization, i.e. as it appears on the website.

author: string

Author (or authors) of the article.

authorsList: list of strings

All authors of the article split into separate strings. For example,

  • if author is "Alice and Bob", authorList would be ["Alice", "Bob"],

  • if author is "Alice Johnes" (a single author), authorList would be ["Alice Johnes"].

inLanguage: string

Language of the article, as an ISO 639-1 language code. Example: "en".

Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see General Web Page Information.

breadcrumbs: list of dictionaries with name and link optional string fields

A list of breadcrumbs (a specific navigation element) with optional name and URL. Example:

[
  {"name": "Foo", "link": "http://example.com/foo"},
  {"name": "Bar", "link": "http://example.com/foo/bar"},
  {"name": "Baz"},
]
mainImage: string

A URL or data URL value of the main image of the article. All URLs are absolute.

images: list of strings

A list of URL or data URL values of all images of the article (may include the main image). All URLs are absolute.

description: string

A short summary of the article. It can be either human-provided (if available), or auto-generated.

articleBody: string

Text of the article, including sub-headings, with newline separators.

articleBodyHtml: string

Simplified and standardized HTML of the article, including sub-headings, image captions and embedded content (videos, tweets, etc). See Format of articleBodyHtml field section for a detailed description.

articleBodyRaw: string

HTML of the article body as seen in the source page.

This field is sometimes large, and often is not needed, as articleBodyHtml is preferrable. articleBodyRaw field can be turned off when making an API request: it will not be returned if you pass "articleBodyRaw": false as a query parameter (see Requests).

videoUrls: list of strings

A list of URLs of all videos inside the article body.

audioUrls: list of strings

A list of URLs of all audios inside the article body.

probability: float

Probability that this is a single article page.

This number is close to 1.0 when a requested page looks like an individual news article page, blog post, etc. Otherwise this number is low, closer to 0.0 - for example, expect it to be low on pages with lists of news articles, on e-commerce pages, etc.

canonicalUrl: string

Canonical URL of the article, if available.

url: string

URL of a page where this article was extracted.

All fields are optional, except for url and probability. Fields without a valid value (null or empty array) are excluded from the extraction results.

Response example

Below is an example response with all article fields present:

[
  {
    "article": {
      "headline": "Article headline",
      "datePublished": "2019-06-19T00:00:00",
      "datePublishedRaw": "June 19, 2019",
      "dateModified": "2019-06-21T00:00:00",
      "dateModifiedRaw": "June 21, 2019",
      "author": "Article author",
      "authorsList": [
        "Article author"
      ],
      "inLanguage": "en",
      "breadcrumbs": [
        {
          "name": "Level 1",
          "link": "http://example.com"
        }
      ],
      "mainImage": "http://example.com/image.png",
      "images": [
        "http://example.com/image.png"
      ],
      "description": "Article summary",
      "articleBody": "Article body ...",
      "articleBodyHtml": "<article><p>Article body ... </p> ... </article>",
      "articleBodyRaw": "<div id=\"an-article\">Article body ...",
      "videoUrls": [
        "https://example.com/video.mp4"
      ],
      "audioUrls": [
        "https://example.com/audio.mp3"
      ],
      "probability": 0.95,
      "canonicalUrl": "https://example.com/article/article-about-something",
      "url": "https://example.com/article?id=24"
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "example.com",
      "userQuery": {
        "pageType": "article",
        "url": "http://example.com/article?id=24"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]

Format of articleBodyHtml field

The articleBodyHtml field in article extractions contains a normalized and simplified HTML version of the article body. It is easy to create your own CSS styles over this HTML so that the final look-and-feel is integrated with the rest of your app.

The normalized HTML also allows for automated HTML processing which is consistent across websites. For example:

  • To get all images with their captions you can run //figure xpath and then ./img and ./figcaption

  • h tags are normalized, making the article hierarchy easy to determine

  • Tables and lists can be extracted cleanly

  • Links are absolute

  • Only semantic HTML tags are returned - no generic divs/spans are included

The supported tags and attributes are normalized as follows:

Content Type

Normalization

Supported Elements/Attributes

Sectioning

All content is enclosed in a root article tag. Headings are normalized so that they always start with h2.

article (root only), h2, h3, h4, h5, h6, aside

Text

Paragraphs are enclosed with p tag. Tables, lists, definition lists and block quotes are supported.

p, table, tbody, thead, tfoot, th, tr, td, ul, ol, li, dl, dt, dd, blockquote

Inline text

b tag is translated to strong. i tag is translated to em.

a, br, strong, em, s, sup, sub, del, ins, u, cite

Pre-formatted text

None

pre, code

Multimedia elements

Multimedia elements are enclosed within figure generally. Captions for these elements are included within the figcaption tag when available. If multimedia elements appear in the text as inline elements within paragraphs they are kept as is (without enclosing them in a figure element).

figure, figcaption, img, video, audio, iframe, embed, object, source

Supported attributes

Tag attributes not in the suported list to the right are filtered out of the output.

data-*, alt, cite, colspan, datetime, dir, href, label, rowspan, src, srcset, sizes, start, title, type, value, vspace

Social media content

Content from social media platforms (Twitter, etc) will be rendered properly if the correct JavaScript files from the platform are included. The currently supported platforms and the JavaScript file to use to include them are as follows:

Platform

Script file

Twitter

<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Instagram

<script async src="//www.instagram.com/embed.js"></script>

Facebook

<div id="fb-root"></div>
<script async defer src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>

Example articleBodyHtml response

<article>

<p>The range of use cases for web data extraction is rapidly increasing and with it the necessary investment. Plus the number of websites continues to grow rapidly and is expected to exceed 2 billion by 2020.</p>

<p>Presented by <a href="https://www.zyte.com/">Zyte</a> (formerly Scrapinghub), the first Web Data Extraction Summit will be held in Dublin, Ireland on 17th September 2019. This is the first-ever event dedicated to web data and extraction and will be graced by over 100 CEOs, Founders, Data Scientists and Engineers.</p>

<figure><iframe src="https://play.vidyard.com/7hJbbWtiNgipRiYHhTCDf6?v=4.2.13&amp;viral_sharing=0&amp;embed_button=0&amp;hide_playlist=1&amp;color=FFFFFF&amp;playlist_color=FFFFFF&amp;play_button_color=2A2A2A&amp;gdpr_enabled=1&amp;type=inline&amp;new_player_ui=1&amp;vydata%5Butk%5D=d057931dfb8520abe024ef4b2f68d0ad&amp;vydata%5Bportal_id%5D=4367560&amp;vydata%5Bcontent_type%5D=blog-post&amp;vydata%5Bcanonical_url%5D=https%3A%2F%2Fblog.scrapinghub.com%2Fthe-first-web-data-extraction-summit&amp;vydata%5Bpage_id%5D=12510333185&amp;vydata%5Bcontent_page_id%5D=12510333185&amp;vydata%5Blegacy_page_id%5D=12510333185&amp;vydata%5Bcontent_folder_id%5D=null&amp;vydata%5Bcontent_group_id%5D=5623735666&amp;vydata%5Bab_test_id%5D=null&amp;vydata%5Blanguage_code%5D=null&amp;disable_popouts=1" title="Video"></iframe></figure>

<p>With a promising line-up of talks and discussions accompanied by interesting conversations and networking sessions with fellow data enthusiasts, followed by food and drinks at the magnificent Guinness Storehouse, there are no reasons to miss this event. What’s more, we are also giving out free swag! You will get your own Extract Summit T-shirts on the day!</p>

<figure><img src="https://blog.scrapinghub.com/hubfs/Extract-Summit-Emails-images-tee-aug2019-v1.gif" alt="Extract-Summit-Emails-images-tee-aug2019-v1"></figure>

</article>