Aman Ruhela

Aman Ruhela

Click edit button to change this text. Lorem ipsum dolor sit amet consectetur adipiscing elit dolor

Web Scraping with Python at Scale (Request, BeautifulSoup, Splash & Tesseract)

Share on facebook
Facebook
Share on facebook
Facebook

With data being at the heart of impactful decision making, web scraping becomes an indispensable tool, especially in the logistics space where tracking consignments from different sources form the backbone of many products. In this blog, I will discuss an efficient and scalable way to scrape data from different websites, with a special focus on the necessary tools required.

Tools

Flask (Web development framework for python)
Requests (Python library for network requests)
BeautifulSoup (Html parsing)
Splash (Javascript rendering engine)
Tesseract (Text based captcha)
PM2 (Manage Splash)

Installation

In order to scale our application, it is important that other services or applications cannot affect it. Hence let’s create a Dockerfile with all the necessary tools. Details are as follows:

Choose base image: We need to install a lot of different application not specific to python, hence we selected ubuntu16.04

 

Discover More Data Driven Logistics Insights

Leave a comment

Sign in to post your comment or sign-up if you don't have any account.