Scraping E-Commerce Websites Products With TexAu



Document image

Scraping Ecommerce websites is very useful to find trends, compare pricing strategies, or for competitive analysis.

Most of you may think TexAu is only a scraping tool made for lead generation.

But did you know you could also extract the FULL inventory of most E-commerce websites like Shopify with it?

It sounds crazy, but yes, you can!

The scraping process takes only minutes. You can scrape product prices, product links, product details, product image links and gather all this to a Google Sheet.

First, we need to search for a trending eCommerce website we can scrape. But how to do so?

In this tutorial, we will focus on websites using the Shopify eCommerce platform. But, thankfully, you can apply the same logic to any shopping cart system from Woocommerce, Magento, Prestashop, Opencart, and the likes.

This will work on most stores having basic schema markup implemented on their website.

All the methods presented in this tutorial don't require any proxies or IP rotation, unlike massive eCommerce platforms like Amazon or Aliexpress.

How To Find Shopify Stores?

Method #1: Using Google Advanced Search

First, let's jump to Google Search and use the advanced search below:

Text
|

The above search is narrowing all Shopify sites from the "myshopify.com" URL for the keyword "shoes".

Document image

Note that the results will depend on your Google Search preferences. You can change the language and location here:

Text
|
Document image

The above search will give you the top-ranking Shopify stores selling shoes:

Document image

Here we see multiple results from that store which redirects to nobullproject.com:

Document image

At first glance, the organic site traffic is growing steadily from looking at it in Ahrefs:

Document image

So it could be interesting to dig deeper and see what are all the products from that store ranking from Google Search:

Document image

From there, you could scrape each product link from SERPs using TexAu Google Search spice:

Document image

Method #2: Using The Source Code Search Engine

Another way to find websites technologies is to look at their pages' source code. For this, you can use the source code search engine:

Text
|

Here, type "myshopify.com", and this will return all the websites having this URL in their source code:

Document image

From there, you can extract a CSV file with the results. The number of results is limited in the free plan but is still something exploitable.

Document image

Method #3: Using Nerdydata

Document image

Very similar to the source code search engine but with a better UI, Nerdydata allows to search into the source code of website pages to find the technologies in use:

Text
|
Document image

Method #4: Using MYIP.MS

Dropshippers have used this one for years to find trending Shopify stores: MYIP.MS.

Text
|
Document image

First, enter one of those Shopify IP addresses in the search bar:

Text
|

That way, we will get a list of all the E-commerce sites using Shopify. Then, at the bottom of the page, click "View All Records":

Document image

To narrow down top-ranking eCommerce sites, use the "Site Popular Rating" filter and set it between 1000 to 10000:

Document image

Finally, extract all the eCommerce sites list that we will use later in TexAu:

Document image

Method #5: Using BuiltWith

BuiltWith is one of the best tools out there to find technologies used by eCommerce sites.

From there, go to the "reports" menu then "Technology Report":

Document image

Here, you can select any auto-suggested results or get all sites results. Then create a report:

Document image

Select the obtained report:

Document image

Finally, sort results by technology spending, social media followers, or traffic. Then manually pick each eCommerce website you want to export in CSV file format:

Document image

Method #5: Using TechTracker

Last, another great tool is TechTracker:

Document image
Document image
Text
|
  • Select Shopify in the "Technology Name" field.
  • Pick the websites you want to export.
  • Hit "Save my Lead Report".
  • Then go to "Saved Leads".
Document image

In "Saved Leads", click "View my Lead Report":

Document image

Finally, download your report in CSV file:

Document image

Now that we have a bunch of interesting Shopify stores, how to scrape their inventory?

How To Scrape An Ecommerce Store Using Shopify?

Let's take a famous Shopify Store: MATT & NAT, the leading Vegan Bags brand.

Document image

There are multiple web scraping techniques to extract data on a website:

  • CSS selectors
  • JavaScript variables
  • XHR
  • Microdata
  • JSON-LD

Let's illustrate these techniques using the Chrome developer console (Chrome DevTool). You can access it by using FN Key + F12 or Command + Option + I on your computer keyboard.

CSS selectors:

The most common way to scrape an eCommerce website is going through its CSS selectors.

CSS Selectors give the location of an HTML element on the page.

This method works well when dealing with static content like text, images, and videos.

But for JavaScript, AJAX requests, microformats, JSON-LD, or Microdata, this isn't efficient.

Here is an example of a CSS class selector locating the product price on the page:

Document image
Text
|

Microdata:

Microdata is a markup language designed to describe structured data about web pages. Its development originated as part of Google's Open Data Protocol initiative.

It provides a standard vocabulary for describing products, people, organizations, events, or places. It also includes the relationships among them and allows metadata nesting.

According to w3techs.com, here is a list of the most popular sites using Microdata:

  • Google.com
  • Facebook
  • Youtube.com
  • Amazon.com
  • Microsoft.com
  • Office.com
  • Ebay.com
  • Msn.com
  • Dropbox.com

An example of Microdata is Facebook Opengraph that contains Microdata in the pixel installed in the element of the site:

Document image
HTML
|

JavaScript variables:

Javascript variables store information inside objects. Thus, they act as containers holding a value.

These variables can be product information like stock, SKUs, names, for instance.

An example here with the "Shopify" variable containing such information as the internal "myshopify" domain of the store, Shopify store ID, currency, language, and image CDN:

HTML
|
Document image
Document image
JSON
|

XHRs:

XMLHttpRequest (XHR) is a JavaScript API that initiates AJAX requests from a web browser to a server. XHRs allows scraping dynamic data on websites without visiting the page since the request is made from the XHR to the server via API.

A practical example is when the page loads products on an infinite scroll, click-next, or load more buttons without changing the page's URL.

JSON-LD:

Document image
JSON
|

We will use the JSON-LD method to scrape the entire eCommerce website inventory using TexAu.

Document image

What Is JSON-LD?

Document image

JSON-LD is the acronym of JavaScript Object Notation for Linked Data.

It uses the schema markup vocabulary as defined by Schema.org.

It originated in the early 2010s by Google, Bing, and Yahoo! to promote a more descriptive way to display structured data to search engines. By providing machine-readable data to search engines, it improves websites' indexability and user experience.

JSON-LD is an embeddable script usually placed in the section of a page instead of wrapping HTML elements in the body.

Google recommends using JSON-LD over older formatting options like Microdata and RDFa to better inform search engines about linked data on a web page.

It also plays a crucial role in SEO to help search engine bots better indexing your website.

In the case of our Shopify site, JSON-LD also contains real-time eCommerce data like stock quantity and product availability.

According to W3techs.com again, JSON-LD is used by more than 39% of websites today:

Document image

Advertising networks also use JSON-LD data to display those products details taken from the website:

Document image
Document image

JSON-LD lists the objects and their property entities. It also allows nesting and defines children and parent properties.

Document image

In our above example, "Product" is a property entity. The "Offer" entity (product variation) is the child of the "Product" entity (parent). The same is for the "price" entity, the child of the "Offer" entity.

Here is an excerpt of the JSON-LD for the above product. As you can see, it contains all the information for the "BLACK" product variation on the page:

JSON
|
Document image

Ok, but that's what we will scrape on each product page. But how to scrape ALL the product pages and find them? By scraping the XML sitemap of the site, of course!

What Is An XML Sitemap?

A sitemap index file is the collection of all the internal links within a single website. We often call those "static resources" because they do not change frequently.

The purpose of a sitemap is to provide the website navigation structure to users.

Sitemaps allow search engines crawlers to follow those links and discover new content.

Method #1: Extract An XML Sitemap With XML-Sitemaps

XML-Sitemaps is a super handy tool to harvest the sitemap of a site. It also has excellent tools like a sitemap generator to create a sitemap index file you can upload to your site and submit to Google Search Console and Bing Webmaster Tools.

There are also different types of sitemaps using the XML file format. RSS feeds, also called "News Sitemaps", are one of them. Those allow articles to be indexed by Google News, for instance.

XML-Sitemaps is free for 500 page URLs which is enough for most medium eCommerce websites.

Text
|

First, enter the website homepage URL in the search bar and hit the "Start" button to launch the crawl:

Document image

Wait 15-20 minutes until full crawl completion:

Document image

Once finished, you can export the results to a CSV file:

Document image

Or better, scroll down at the bottom of the result page and hit the "View HTML Sitemap" button. There, scroll down and copy all the product URLs located under the "products/" directory. For this, you can use this handy Chrome extension called "Copy Selected Links":

Document image
Document image
Text
|

Once you copied the links to the clipboard, paste those in a Google Sheet column:

Document image

Method #2: Extract an XML-Sitemap With Screaming Frog SEO Spider

Document image

Screaming Frog SEO Spider is the GOLD standard of technical SEO and site auditing.

It has many features under the hood like such as:

  • finding broken links
  • analyzing headers and metadata
  • discovering duplicate content
  • scraping via CSS selectors and XPath
  • generating sitemap files
  • and many more
Text
|

Like XML Sitemaps, Screaming Frog is free up to 500 URLs extraction. However, contrary to XML Sitemaps, Screaming Frog is a desktop application you have to install on your computer.

First, enter the website homepage URL in the search bar and click enter to launch the crawl:

Text
|
Document image

Wait 15-20 minutes until full crawl completion. Once done, head over the "Sitemaps" tab, and in the search filter field, enter this URL:

Text
|

Most Shopify stores follow the same pattern, e.g., "storedomain.com/products/".

Here we will export to Google Sheets all the product links under the "*/products/" directory path:

Document image
Document image

Upon completion, click "Export":

Document image

Here, choose Google Sheets as the export method, give your sheet a name, then click the "Manage" button to authorize your Google account then hit "Save":

Document image
Document image

After that, add the results will be sent to a new Google Sheet:

Document image

Once your Google Drive is authorized, name your export and click "Ok". It will upload the selected results to Google Sheets:

Document image

How To Scrape A Shopify Store With TexAu?

Now that we have all the product links on Google Sheet, we will use TexAu to extract the JSON-LD data of all these pages:

Document image

From the TexAu Spice Store, select "Get The JSON LD Of A Website" spice:

Document image

Copy the link of your Google Sheets containing the product links, and make sure it's shared in "Viewer" mode to be accessible publicly:

Document image

Then copy and paste the sheet link in the TexAu Google Sheet URL field, select the sheet column containing the product links and launch the spice:

Document image

After 10-20 minutes, you will get all the JSON-LD data extracted from the sitemap URLs.

Document image

Head over the TexAu result menu and download the data as CSV File:

Document image

Finally, let's import the CSV File in a new Google Sheet:

Document image
Document image

Once you imported the file, enlarge the first-row height to verify all the JSON-LD data is there (otherwise, you won't see it):

Document image

Now, select and copy the whole column containing the data we want (except the header), then paste the results in a text file in your notepad:

Document image

Be sure that the pasted data doesn't start and end with quotation marks" ", it should start and end with brackets like this { }.

Document image

Now we are reaching the end of our tutorial, and it's time to make the magic happen 🤩. We will use an online JSON to CSV data converter to decode all the JSON-LD data we collected.

Go to the website below:

Text
|

Copy and paste all the code you had on your notepad in the field below:

Document image

You will see a nice table preview with all the data in order. Then, download the results in CSV and upload it again to a new Google Sheet and enjoy!

Document image

You will have all the store product inventory with:

  • product links
  • product names
  • product SKUs
  • product availability
  • product quantity
  • product featured images
  • product variations
Document image

AND VOILA!

I hope you like this tutorial. Next, I'll show you how to use a similar method to extract more emails and phones from directories and collect them all in a clean table 😛.

As a side note, such an operation will be able to do natively in TexAu soon with our upcoming sitemap module. But until then 🤐.

Stay tuned.



Updated 11 May 2022
Did this page help you?
Yes
No