Scraping E-Commerce Websites Products With TexAu

Scraping Ecommerce websites is very useful to find trends, compare pricing strategies, or for competitive analysis.
Most of you may think TexAu is only a scraping tool made for lead generation.
But did you know you could also extract the FULL inventory of most E-commerce websites like Shopify with it?
It sounds crazy, but yes, you can!
The scraping process takes only minutes. You can scrape product prices, product links, product details, product image links and gather all this to a Google Sheet.
First, we need to search for a trending eCommerce website we can scrape. But how to do so?
In this tutorial, we will focus on websites using the Shopify eCommerce platform. But, thankfully, you can apply the same logic to any shopping cart system from Woocommerce, Magento, Prestashop, Opencart, and the likes.
This will work on most stores having basic schema markup implemented on their website.
All the methods presented in this tutorial don't require any proxies or IP rotation, unlike massive eCommerce platforms like Amazon or Aliexpress.
How To Find Shopify Stores?
Method #1: Using Google Advanced Search
First, let's jump to Google Search and use the advanced search below:
The above search is narrowing all Shopify sites from the "myshopify.com" URL for the keyword "shoes".

Note that the results will depend on your Google Search preferences. You can change the language and location here:

The above search will give you the top-ranking Shopify stores selling shoes:

Here we see multiple results from that store which redirects to nobullproject.com:

At first glance, the organic site traffic is growing steadily from looking at it in Ahrefs:

So it could be interesting to dig deeper and see what are all the products from that store ranking from Google Search:

From there, you could scrape each product link from SERPs using TexAu Google Search spice:

Method #2: Using The Source Code Search Engine
Another way to find websites technologies is to look at their pages' source code. For this, you can use the source code search engine:
Here, type "myshopify.com", and this will return all the websites having this URL in their source code:

From there, you can extract a CSV file with the results. The number of results is limited in the free plan but is still something exploitable.

Method #3: Using Nerdydata

Very similar to the source code search engine but with a better UI, Nerdydata allows to search into the source code of website pages to find the technologies in use:

Method #4: Using MYIP.MS
Dropshippers have used this one for years to find trending Shopify stores: MYIP.MS.

First, enter one of those Shopify IP addresses in the search bar:
That way, we will get a list of all the E-commerce sites using Shopify. Then, at the bottom of the page, click "View All Records":

To narrow down top-ranking eCommerce sites, use the "Site Popular Rating" filter and set it between 1000 to 10000:

Finally, extract all the eCommerce sites list that we will use later in TexAu:

Method #5: Using BuiltWith
BuiltWith is one of the best tools out there to find technologies used by eCommerce sites.
From there, go to the "reports" menu then "Technology Report":

Here, you can select any auto-suggested results or get all sites results. Then create a report:

Select the obtained report:

Finally, sort results by technology spending, social media followers, or traffic. Then manually pick each eCommerce website you want to export in CSV file format:

Method #5: Using TechTracker
Last, another great tool is TechTracker:


- Select Shopify in the "Technology Name" field.
- Pick the websites you want to export.
- Hit "Save my Lead Report".
- Then go to "Saved Leads".

In "Saved Leads", click "View my Lead Report":

Finally, download your report in CSV file:

Now that we have a bunch of interesting Shopify stores, how to scrape their inventory?
How To Scrape An Ecommerce Store Using Shopify?
Let's take a famous Shopify Store: MATT & NAT, the leading Vegan Bags brand.

There are multiple web scraping techniques to extract data on a website:
- CSS selectors
- JavaScript variables
- XHR
- Microdata
- JSON-LD
Let's illustrate these techniques using the Chrome developer console (Chrome DevTool). You can access it by using FN Key + F12 or Command + Option + I on your computer keyboard.
CSS selectors:
The most common way to scrape an eCommerce website is going through its CSS selectors.
CSS Selectors give the location of an HTML element on the page.
This method works well when dealing with static content like text, images, and videos.
But for JavaScript, AJAX requests, microformats, JSON-LD, or Microdata, this isn't efficient.
Here is an example of a CSS class selector locating the product price on the page:

Microdata:
Microdata is a markup language designed to describe structured data about web pages. Its development originated as part of Google's Open Data Protocol initiative.
It provides a standard vocabulary for describing products, people, organizations, events, or places. It also includes the relationships among them and allows metadata nesting.
According to w3techs.com, here is a list of the most popular sites using Microdata:
- Google.com
- Facebook
- Youtube.com
- Amazon.com
- Microsoft.com
- Office.com
- Ebay.com
- Msn.com
- Dropbox.com
An example of Microdata is Facebook Opengraph that contains Microdata in the pixel installed in the element of the site:

JavaScript variables:
Javascript variables store information inside objects. Thus, they act as containers holding a value.
These variables can be product information like stock, SKUs, names, for instance.
An example here with the "Shopify" variable containing such information as the internal "myshopify" domain of the store, Shopify store ID, currency, language, and image CDN:


XHRs:
XMLHttpRequest (XHR) is a JavaScript API that initiates AJAX requests from a web browser to a server. XHRs allows scraping dynamic data on websites without visiting the page since the request is made from the XHR to the server via API.
A practical example is when the page loads products on an infinite scroll, click-next, or load more buttons without changing the page's URL.
JSON-LD:

We will use the JSON-LD method to scrape the entire eCommerce website inventory using TexAu.

What Is JSON-LD?

JSON-LD is the acronym of JavaScript Object Notation for Linked Data.
It uses the schema markup vocabulary as defined by Schema.org.
It originated in the early 2010s by Google, Bing, and Yahoo! to promote a more descriptive way to display structured data to search engines. By providing machine-readable data to search engines, it improves websites' indexability and user experience.
JSON-LD is an embeddable script usually placed in the section of a page instead of wrapping HTML elements in the body.
Google recommends using JSON-LD over older formatting options like Microdata and RDFa to better inform search engines about linked data on a web page.
It also plays a crucial role in SEO to help search engine bots better indexing your website.
In the case of our Shopify site, JSON-LD also contains real-time eCommerce data like stock quantity and product availability.
According to W3techs.com again, JSON-LD is used by more than 39% of websites today:

Advertising networks also use JSON-LD data to display those products details taken from the website:


JSON-LD lists the objects and their property entities. It also allows nesting and defines children and parent properties.

In our above example, "Product" is a property entity. The "Offer" entity (product variation) is the child of the "Product" entity (parent). The same is for the "price" entity, the child of the "Offer" entity.
Here is an excerpt of the JSON-LD for the above product. As you can see, it contains all the information for the "BLACK" product variation on the page:

Ok, but that's what we will scrape on each product page. But how to scrape ALL the product pages and find them? By scraping the XML sitemap of the site, of course!
What Is An XML Sitemap?
A sitemap index file is the collection of all the internal links within a single website. We often call those "static resources" because they do not change frequently.
The purpose of a sitemap is to provide the website navigation structure to users.
Sitemaps allow search engines crawlers to follow those links and discover new content.
Method #1: Extract An XML Sitemap With XML-Sitemaps
XML-Sitemaps is a super handy tool to harvest the sitemap of a site. It also has excellent tools like a sitemap generator to create a sitemap index file you can upload to your site and submit to Google Search Console and Bing Webmaster Tools.
There are also different types of sitemaps using the XML file format. RSS feeds, also called "News Sitemaps", are one of them. Those allow articles to be indexed by Google News, for instance.
XML-Sitemaps is free for 500 page URLs which is enough for most medium eCommerce websites.
First, enter the website homepage URL in the search bar and hit the "Start" button to launch the crawl:

Wait 15-20 minutes until full crawl completion:

Once finished, you can export the results to a CSV file:

Or better, scroll down at the bottom of the result page and hit the "View HTML Sitemap" button. There, scroll down and copy all the product URLs located under the "products/" directory. For this, you can use this handy Chrome extension called "Copy Selected Links":


Once you copied the links to the clipboard, paste those in a Google Sheet column:

Method #2: Extract an XML-Sitemap With Screaming Frog SEO Spider

Screaming Frog SEO Spider is the GOLD standard of technical SEO and site auditing.
It has many features under the hood like such as:
- finding broken links
- analyzing headers and metadata
- discovering duplicate content
- scraping via CSS selectors and XPath
- generating sitemap files
- and many more
Like XML Sitemaps, Screaming Frog is free up to 500 URLs extraction. However, contrary to XML Sitemaps, Screaming Frog is a desktop application you have to install on your computer.
First, enter the website homepage URL in the search bar and click enter to launch the crawl:

Wait 15-20 minutes until full crawl completion. Once done, head over the "Sitemaps" tab, and in the search filter field, enter this URL:
Most Shopify stores follow the same pattern, e.g., "storedomain.com/products/".
Here we will export to Google Sheets all the product links under the "*/products/" directory path:


Upon completion, click "Export":

Here, choose Google Sheets as the export method, give your sheet a name, then click the "Manage" button to authorize your Google account then hit "Save":


After that, add the results will be sent to a new Google Sheet:

Once your Google Drive is authorized, name your export and click "Ok". It will upload the selected results to Google Sheets:

How To Scrape A Shopify Store With TexAu?
Now that we have all the product links on Google Sheet, we will use TexAu to extract the JSON-LD data of all these pages:

From the TexAu Spice Store, select "Get The JSON LD Of A Website" spice:

Copy the link of your Google Sheets containing the product links, and make sure it's shared in "Viewer" mode to be accessible publicly:

Then copy and paste the sheet link in the TexAu Google Sheet URL field, select the sheet column containing the product links and launch the spice:

After 10-20 minutes, you will get all the JSON-LD data extracted from the sitemap URLs.

Head over the TexAu result menu and download the data as CSV File:

Finally, let's import the CSV File in a new Google Sheet:


Once you imported the file, enlarge the first-row height to verify all the JSON-LD data is there (otherwise, you won't see it):

Now, select and copy the whole column containing the data we want (except the header), then paste the results in a text file in your notepad:

Be sure that the pasted data doesn't start and end with quotation marks" ", it should start and end with brackets like this { }.

Now we are reaching the end of our tutorial, and it's time to make the magic happen 🤩. We will use an online JSON to CSV data converter to decode all the JSON-LD data we collected.
Go to the website below:
Copy and paste all the code you had on your notepad in the field below:

You will see a nice table preview with all the data in order. Then, download the results in CSV and upload it again to a new Google Sheet and enjoy!

You will have all the store product inventory with:
- product links
- product names
- product SKUs
- product availability
- product quantity
- product featured images
- product variations

AND VOILA!
I hope you like this tutorial. Next, I'll show you how to use a similar method to extract more emails and phones from directories and collect them all in a clean table 😛.
As a side note, such an operation will be able to do natively in TexAu soon with our upcoming sitemap module. But until then 🤐.
Stay tuned.