Other enrichments

Scrape URLs with CSS Selectors and/or Regular Expressions - In Bulk

Cost: 0.10 credits for basic scraping. 0.50 credits/scrape with proxy. 500 free credits on signup => 5,000 free scrapings.

Scrape URLs with CSS selectors and/or Regular Expressions. Use the proxy option to scrape protected webpages, and configure multiple selectors to scrape several texts.

Step-by-step guide

Step 1: Load your CSV or Excel file on Datablist

Create a free account and import your data file. Datablist is a powerful CSV editor. Perfect for opening large CSV files or Excel files with a list of items.

Create a new collection and import your file.

Step 2: Select the "Bulk Scraper" enrichment

Click on the "Enrich" button, and search for "Bulk Scraper".

Configure CSS Selectors

The Bulk Scraper uses two ways to extract data from the HTML page: CSS Selectors and Regular Expressions.

CSS selectors allow you to target specific parts of an HTML document to extract information.

A CSS selector is defined with the following information:

CSS Selector - The CSS path to the HTML element. Read this guide to learn how to write CSS Selector.
CSS Selector Content - Data to extract for the HTML element.
- InnerText - Extract the text inside the HTML element. If the HTML element contains nested HTML elements, their texts are also extracted.
- HTML - Extract the outer HTML code for the HTML element
- Attribute - Extract a specific attribute text from the HTML element.
Selector Attribute - Available when the CSS Selector Content is set on Attribute. Define the attribute to extract. Example: href, rel, title.

Note: When several elements match a CSS selector, all the results are returned concatenated with a semicolon (;)

Selector Attribute field available on CSS Selector Content: Attribute.

Examples of CSS Selectors

To learn how to write CSS Selector paths, please read this guide.

Getting the text of an HTML element.

<div class="section product-data">
    <div class="product-name">New Phone</div>
</div>

The CSS selector would be .section.product-data .product-name with the CSS Selector Content to InnerText.

Getting the text of the first div after a custom HTML attribute.

<div data-testid="block-content">
    <div>Info To Scrape</div>
    <div>Useless Info</div>
    <div>Useless Info</div>
</div>

The CSS selector would be [data-testid="block-content"] > div:first-child with the CSS Selector Content to InnerText.

Getting the URLs for links:

<div class="social-media">
    <a href="https://fr.linkedin.com/company/datablist">Linkedin</a>
    <a href="https://www.twitter.com/datablist">Twitter</a>
</div>

The CSS selector would be .social-media with the CSS Selector Content to Attribute and the Attribute to href.

How to test CSS Selectors

An easy way to test your CSS Selectors before running them in bulk is to use your browser console.

To test for InnerText:

Array.from(document.querySelectorAll('{css-selector-path}')).map(elem => elem.textContent).join(';')

To test for HTML:

Array.from(document.querySelectorAll('{css-selector-path}')).map(elem => elem.outerHTML).join(';')

To test for the content of an Attribute:

Array.from(document.querySelectorAll('{css-selector-path}')).map(elem => elem.getAttribute('{attribute}')).join(';')

If you need help writing your CSS Selectors, please contact us.

Configure Regular Expressions

The second way to scrape data from several URLs is to use regular expressions. The bulk scraper matches the RegEx against the HTML code source.

If the pattern contains capturing groups, they are returned. And if there are no groups, the scraper returns the strings matching the whole pattern.

Capturing groups or pattern-matching

When writing a Regex, you can add a capturing group using parenthesis. When a capturing group is defined, the bulk scraper will return only the group text.

For example, in the HTML code:

To capture only the "US" text from the Shopify.country line, you would write:

Shopify\.country\s=\s"(\w+)";

Notice the parenthesis in (\w+).

To capture the whole line, you would write:

Shopify\.country\s=\s"\w+";

Notice I removed the parenthesis.

Use Cases

Lead Generation

Bulk scraping allows you to enrich URLs from various sources, such as directories, social media platforms, forums, etc.. Using CSS Selector, bulk scraping lets you get structured information from HTML pages.

Price Monitoring

E-commerce businesses can use URL scraping to monitor competitor websites and track product prices, discounts, and promotions. This information can be used for competitive intelligence and pricing strategies to stay competitive in the market.

Job Board Scraping

Job boards often contain valuable job postings information. Scraping URLs from job boards allows businesses to aggregate job postings automatically, providing valuable insights into hiring trends, job requirements, and competitor recruitment strategies.

Enrichment Reference

Settings

Use Proxy when needed (Boolean)

Use a proxy to avoid protected pages, bypass rate limit, CAPTCHA, etc. Enrichment Cost is 0.50 credit per scrape with proxy. Disabled by default.

Render HTML (Boolean)

Enable this setting to render the page in a headless browser before scraping. Use it for scraping JavaScript-rendered URLs. A proxy is automatically applied to each request. Costs 2 credits per scrape. Disabled by default.

Proxy Policy (Text)

When to enable proxy? Default: Proxy enabled on scraping error.

Disable cached data (Boolean)

Datablist caches the page content for 7 days. If the page is already in the cache, the cost is just 0.05 credits per cached URL (whether it comes from the proxy or not).

Css Selectors (MultipleValues)

Matches pattern query against HTML tree. See https://www.w3schools.com/cssref/css_selectors.asp . Test your CSS Selector on https://scrapfly.io/web-scraping-tools/css-xpath-tester

Extract using Regular Expressions (Boolean)

Matches Regular Expressions against HTML code.

Regular Expressions (MultipleValues)

Define one or more Regular Expressions. Regular Expressions are case insensitive. Test your Regex on https://regex101.com/.

Extract data from a JSON object inside the HTML code. (Boolean)

Some webpages have JSON available in their HTML code. Use this setting to define the path to the JSON data, and to define the JSONPaths to extract information.

Path to the JSON object (Text)

Use a CSS Selector to define the path to the <script> tag with the JSON data. Example: 'script#__NEXT_DATA__' to target NextJS data. 'script#ng-state' for Angular data.

JSONPath expressions (MultipleValues)

Define one or more JSONPath expressions.

Inputs

Url to parse

Outputs

URL Scraping Status (Text)

One of the following: success, unreachable, http_error (error returned from the server), invalid_url, or empty_url

Join the future of list management

No sign up required. And we have a generous free plan! 🚀

Open Datablist Talk to us