Web Scraping with Python at Scale (Request, BeautifulSoup, Splash & Tesseract)
With data being at the heart of impactful decision making, web scraping becomes an indispensable tool, especially in the logistics space where tracking consignments from different sources form the backbone of many products. In this blog, I will discuss an efficient and scalable way to scrape data from different websites, with a special focus on the necessary tools required.
Flask (Web development framework for python)
Requests (Python library for network requests)
BeautifulSoup (Html parsing)
Tesseract (Text based captcha)
PM2 (Manage Splash)
In order to scale our application, it is important that other services or applications cannot affect it. Hence let’s create a Dockerfile with all the necessary tools. Details are as follows:
Choose base image: We need to install a lot of different application not specific to python, hence we selected ubuntu16.04
Discover More Data Driven Logistics Insights
Microservices based software is suited in complex and fragmented industries like logistics. Read about its transformative potential for the industry here.
Using games to drive business goals The idea of competitive play has been an integral part of an individual’s evolutionary process. As