2365 views|2 replies

5

Posts

0

Resources
The OP
 

The idea of implementing crawlers [Copy link]

Web crawlers simulate the behavior of browsers requesting sites through programs, crawl the data returned by the website to the local computer, extract the data they need, and store it for use.

Reptile composition

1. Determine the target website

2. Analyze the data information of the target website

3. The program simulates the user to send an http request to obtain data

4. Save the acquired data to local storage and select the required related data

5. Use the acquired data according to your needs

imbkrmdb2m.png

Notice

Generally, crawlers will add request headers

User-agent: If there is no user-agent in the request header, the target website may treat you as an illegal user.

Cookies: Cookies are used to save login information

Crawler Practice

The following is a practical operation of web crawler data collection, which simulates user analysis of the website to collect data, parse data and save data. The code is for reference only:

import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;

public class FirefoxDriverProxyDemo
{
    // 代理隧道验证信息
    final static String proxyUser = "username";
    final static String proxyPass = "password";

    // 代理服务器
    final static String proxyHost = "t.16yun.cn";
    final static int proxyPort = 31111;

    final static String firefoxBin = "C:/Program Files/Mozilla Firefox/firefox.exe";

    public static void main(String[] args) throws JSONException
    {
        System.setProperty("webdriver.firefox.bin", firefoxBin);

        FirefoxProfile profile = new FirefoxProfile();

        profile.setPreference("network.proxy.type", 1);


        profile.setPreference("network.proxy.http", proxyHost);
        profile.setPreference("network.proxy.http_port", proxyPort);

        profile.setPreference("network.proxy.ssl", proxyHost);
        profile.setPreference("network.proxy.ssl_port", proxyPort);

        profile.setPreference("username", proxyUser);
        profile.setPreference("password", proxyPass);


        profile.setPreference("network.proxy.share_proxy_settings", true);


        profile.setPreference("network.proxy.no_proxies_on", "localhost");


        FirefoxDriver driver = new FirefoxDriver(profile);
    }
}              

This post is from Programming Basics

Latest reply

I'll block you after a while.   Details Published on 2020-9-21 20:38
 

78

Posts

0

Resources
2
 

Very good talk, I gained a lot from reading it, thank you very much

This post is from Programming Basics
 
 
 

7462

Posts

2

Resources
3
 

I'll block you after a while.

This post is from Programming Basics
 
Personal signature

默认摸鱼,再摸鱼。2022、9、28

 
 

Guess Your Favourite
Just looking around
Find a datasheet?

EEWorld Datasheet Technical Support

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京B2-20211791 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号
快速回复 返回顶部 Return list