Web crawlers simulate the behavior of browsers requesting sites through programs, crawl the data returned by the website to the local computer, extract the data they need, and store it for use.
Reptile composition
1. Determine the target website
2. Analyze the data information of the target website
3. The program simulates the user to send an http request to obtain data
4. Save the acquired data to local storage and select the required related data
5. Use the acquired data according to your needs
Notice
Generally, crawlers will add request headers
User-agent: If there is no user-agent in the request header, the target website may treat you as an illegal user.
Cookies: Cookies are used to save login information
Crawler Practice
The following is a practical operation of web crawler data collection, which simulates user analysis of the website to collect data, parse data and save data. The code is for reference only:
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;
public class FirefoxDriverProxyDemo
{
// 代理隧道验证信息
final static String proxyUser = "username";
final static String proxyPass = "password";
// 代理服务器
final static String proxyHost = "t.16yun.cn";
final static int proxyPort = 31111;
final static String firefoxBin = "C:/Program Files/Mozilla Firefox/firefox.exe";
public static void main(String[] args) throws JSONException
{
System.setProperty("webdriver.firefox.bin", firefoxBin);
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("network.proxy.type", 1);
profile.setPreference("network.proxy.http", proxyHost);
profile.setPreference("network.proxy.http_port", proxyPort);
profile.setPreference("network.proxy.ssl", proxyHost);
profile.setPreference("network.proxy.ssl_port", proxyPort);
profile.setPreference("username", proxyUser);
profile.setPreference("password", proxyPass);
profile.setPreference("network.proxy.share_proxy_settings", true);
profile.setPreference("network.proxy.no_proxies_on", "localhost");
FirefoxDriver driver = new FirefoxDriver(profile);
}
}
|