Background:
I had a client who had developed an in-house data collection tool capable of gathering all visible data from web pages. However, this tool required manual operation and alignment, leading to an employee spending several hours each day running it.
Requirements:
The client needed an automated solution that could mimic human behaviour, allowing the tool to run unattended and scrape websites while adhering to proper etiquette. Additionally, they required the ability to search for specific data points that met certain criteria.
Implementation:
To fulfill these requirements, I utilized Python and Selenium to develop an automated tool. This tool was designed to independently navigate websites and collect data, which would then be seamlessly inserted into the client’s SQL database. To avoid overburdening websites, the tool operated at a human-like pace. For ease of use, the tool was packaged as an .exe file that could be executed by a scheduled script on a nightly basis.
Challenges:
One of the primary challenges involved handling errors caused by variations in data display and website behavior. Robust error handling mechanisms were implemented to accommodate these variances, and the program was designed to save progress and resume operation when encountering issues.
Furthermore, the program needed to locate specific data points despite the limited search and sorting capabilities of the target websites. To overcome this limitation, the program incorporated additional sorting routines followed by a binary search algorithm, enhancing its data identification capabilities.
By addressing these challenges and delivering an effective automated data collection tool, I successfully enabled my client to save time and resources while efficiently gathering valuable data from various websites.