WebHarvy is a web scraping software that can help you scrape data from a website easily, such as text, links, images. But when scraping data, one IP sent too many requests in a short time, you may trigger the anti-crawler mechanism of the website, here we need to use a proxy. YiLu Proxy with 90M+ dynamic residential IP addresses can help you. The article will show you how to use WebHarvy and YiLu Proxy to scrape a web page.
Download and install WebHarvy: https://www.webharvy.com/download.html
WebHarvy offers 15-day free evaluation version.
Download and install official yilu proxy:https://yilu.us
1. Click “Settings”
2. Set up proxy
YiLu Proxy Settings
a. Select rotating residential IP or datacenter IP.
b. Select “change IP per request”.
c. Select a country, state, and city.
d. Then click “Refresh( HTTP/Socks5)”.
e. Right-click an IP and click “Copy IP and Port”.
WebHarvy Proxy Settings
a. Click “Proxy Settings”.
b. Select "Enable network connection via Proxy Server".
c. Select proxy type: Socks5 or HTTP (YiLu Proxy dynamic IP supports Socks5 and HTTP).
d. Then select "Requires Authentication".
e. Paste IP, port, password, and username from copied YiLu Proxy dynamic IP information.
f. Then click "+" icon.
Then the proxy IP will appear in "Proxy List".
Click "Apply".
3. Enter a URL that you want to scrape data from.
Such as YiLu Proxy blog web page: https://yilu.us/proxies
4. Click "Start"
5. Select the data you want to scrape
Select the desired attribute: blog title
Set the Capture Text: name "title"
Then these titles of blogs on the web page appear on the bottom.
Select the desired attribute: blog date
Set the Capture Text: name "date"
These blog dates also appear on the bottom.
6. Click "Stop"
7. Click "Start-Mine"
Click "Start”
The scraped data is shown.
8. Click "Export" -"Export as File"
Select file format(here I select xlsx), directory(desktop), and save rule, then click “Export”.
9. Open the file
The file will appear in the desktop, and open it.
You can see the scraped data including book title and date is in the file.