Datadome is a Real-Time Bot Protection service. They are very proud of their bot detection technology and their blog has tons of information for the customers and scrapers.
I will share my viewpoint as someone who has been writing scraping scripts since I was a kid and also worked on several security aspects of different projects.
Web scraping and bot-protection, both are arts. They are two sides of the same coin. You need to collect data from others while protecting yours. It's a bit of a grey area if you ask me 😏.
NOTE: This is a post from an year ago, so it may not reflect current status of datadome and the bot protection.
Who is DataDome?
Big companies like TripAdvisor, Rakuten, classmates, Celio, Fnac, etc use Datadome to protect their website. Anyone who tried to create a bot for these websites had a little bit of hard time playing cat and mouse all the time.
The Datadome stuff regularly reads and analyses puppeteer-stealth and all different anti-bot detection posts on the internet, then applies it in their system. A pretty effective way to deal with the bots. 😎
They say they apply statistical and behavioral detection, can also detect playwright, implemented client-side detection, and so on 🔥.
It's very important to pick know your enemy and your tool carefully. But...
So I decided to put it on two little tests. One is with normal means, and another is with the latest web automation tools.
What inside the test?
For the sake of the test, it will go to their WordPress website hosted on the website at https://datadome.co
Once detected, it will simply show a page with a custom captcha that cannot be solved by normal captcha solutions. No fancy ReCaptcha or Hcaptcha. Pretty impressive.
Example of blocked version
Example of Non-blocked version
The Test --- Part 1
I will just use some screenshot and page speed testing services. Usually these services use headless browsers to create the data, but some of these are also advanced, use various techniques to avoid detection.
And the results were not that much shocking,
9 out of 12 was blocked by the datadome protection.
Here are the sites that worked and the ones that did not.
Performance Tools
✅ KeyCDN
✅ Pingdom
❌ Google page speed insights
❌ Gtmetrix
❌Webpagetest
Website Screenshot Tools
✅ Site Shot
❌ Screenshot Machine
❌Webcapture
❌Capturefullpage
❌Url2Png
❌SmallSeoTools
❌Page2Image
The protection is not without the tradeoff though.
They are expensive, even the starter package is $1190/mo. You cannot even protect your SPA or mobile until you pay $5990/mo. They only target big customers which is understandable. But it's a no go for small businesses.
The Test --- Part 2
The test will be very simple. We will need to write our script if needed, or maybe use a click and point solution where everything is handled behind the back.
😎 Bots are getting intelligent and with a combination of Residential IP and Stealth, it can get away normally.
😈 The only time it would get detected if the IP/fingerprint is already blacklisted or if the scraping was done very aggressively. Whatever is the case, getting aggressive is never good.
And the results were not that much shocking for scrapers, but might shock the datadome customers.
2 out of 2 services bypassed the datadome protection.
✅ ScraperAPI
🔗 Link 🕶 Failed once, passed all other times.
📜ScraperAPI service is pretty simple, you can use them as a proxy or normal scraper, it will return you the HTML source code of the service with residential IP, can render javascript and bypass lots of simple bot detection services.
🤓 Developer Friendly. Use their ready-made API and toolkit.
A simple script for ScraperAPI
It can be done with NodeJS or curl and many other SDK provided by their website. I'm doing the test with a curl request.
Their website is around 1.4mb without CSS and other stuff, the screenshot shows it downloaded everything even if it was slow due to proxies. Datadome could not even detect them on normal mode, or render mode.
✅ Apify
🔗 Link 🕶 Failed once without stealth, passed all other times.
📜 Apify is the one-stop shop for all your web scraping, data extraction, and robotic process automation (RPA) needs. They provide ready-made tools, lots of libraries, developer-friendly toolkit.
🤓 Developer Friendly. Write the code yourself using its robust library.
I had to make sure to turn on both the custom stealth and proxy mode or it was getting blocked instantly.
The output had a bit of a mismatch in the screenshots, but it still works.