Extract email, URLs, domain, mentions, tags, etc. from unstructured texts

Datablist Data Extractor is a perfect tool to extract structured information from unstructured text.

For data cleaning, the Data Extractor helps you find entities in your CSV texts. When dealing with scraped data, this tool finds URLs and email addresses in scraped text to build your lead list.

The data extractor uses pattern recognition to find:

Email Addresses in texts
URLs in texts
Domains from email addresses
Domains from URLs
Mentions (ex: @name) in texts
Tags (ex: #tag) in texts

How to use the Data Extractor

Step 1: Open the Data Extractor

The Data Extractor tool works by selecting the items to process in this order:

If you have selected items in your collection, it will process them
If you have a filter or a full-text search term, it will process the filtered items
Otherwise, it will process all your collection items

Datablist Data Extractor is available from the "Edit" menu. Just click on the "Extract url, email, tag, etc." menu item.

Step 2: Select a property with unstructured text

Then, select the property from your collection you want to extract data from.

Step 2: Run the Data Extractor

A dry run lets you get a preview of the extraction on the first 10 items.

If the preview results suit you, click on the "Extract data" to process your items.

Notes: The dry run step is reinitialized each time the data extractor options change.

Once finished, a summary is displayed with the number of items processed. And your data table is updated with the results.

Dealing with multiple results

When multiple entities are found in your text, the Data Extractor returns all the entities separated with a comma.

Extractor descriptions

Extract the domain from an email address

This extractor takes a property with email addresses and returns the domain with the extension.

Examples:

contact@datablist.com -> datablist.com
jean.bond@gmail.uk.co -> jean.bond@gmail.uk.co

Note: If the email address is invalid, it returns an empty value.

Extract email addresses from a text

This extractor parses a text property to find one or several email addresses.

For example, from a text such as:

Please contact us at name@gmail.com for any inquiries about XXX.
Or use the following email address for customer support questions: support@xxx.com

The extractor returns: name@gmail.com,support@xxx.com

Extract URLs from a text

This extractor parses a text property to find one or several URLs.

For example, from a text such as:

Visit our online documentation at https://docs.datablist.com

The extractor returns: https://docs.datablist.com

Note: To be valid, the URLs must be absolute, with scheme (https, http, or ftp, etc). A partial URL like doc.datablist.com won't be returned.

Extract the domain from a URL

This extractor takes a property with an URL and returns the domain with the extension.

Examples:

https://www.datablist.com -> datablist.com
https://www.google.io/test/path/string.html -> google.io

Note: If the URL is invalid, it returns an empty value.

Extract mentions from a text

This extractor parses a text and returns the mentions in it. The @ character is also returned.

Examples:

Mum, friend with @pseudo and married with @pseudo2

Returns @pseudo,@pseudo2

Extract tags from a text

This extractor parses a text and returns the tags in it. The # character is also returned.

Examples:

Live in #paris. #workhardplayhard

Returns #paris,#workhardplayhard

Extract email, URLs, etc. from texts

How to use the Data Extractor

Step 1: Open the Data Extractor

Step 2: Select a property with unstructured text

Step 2: Run the Data Extractor

Dealing with multiple results

Extractor descriptions

Extract the domain from an email address

Extract email addresses from a text

Extract URLs from a text

Extract the domain from a URL

Extract mentions from a text

Extract tags from a text