26 November 2015

之前说个数据抓取遇到的一个坎就是验证码,这次来说另外两个。我们知道web系统可以拿到客户请求信息,那么针对客户请求的频率,客户信息都会做限制。如果一个ip上的客户访问过于频繁,或者明显是用程序抓取,肯定是要禁止的。本文针对这两个问题说下解决方法。

其实针对上述两个问题,解决方法已经很成熟了,无非就是买代理和在http请求中加入头信息伪装为浏览器请求。本文说下具体操作

使用代理

  • 首先购买代理,这个网上卖代理的很多,自己搜索,而且价格也不贵。
  • 其次就是在程序中使用代理:

      HttpClient httpclient = new DefaultHttpClient();
      httpclient.getCredentialsProvider().setCredentials(
              new AuthScope("代理ip", "代理端口"),
              new UsernamePasswordCredentials("代理用户名","代理密码"));
    

http请求加入头信息

  • 同样在http请求中加入头信息也是很少代码搞定:

      HttpGet httpget = new HttpGet(url);
      // 加入头信息
      httpget.addHeader("Accept", "text/html");
      httpget.addHeader("Accept-Charset", "utf-8");
      httpget.addHeader("Accept-Encoding", "gzip");
      httpget.addHeader("Accept-Language", "zh-CN,zh");
      httpget.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
        
      HttpResponse response = httpclient.execute(httpget);
    
  • post请求同样的方式:

      HttpPost httppost = new HttpPost(url);        
      List<NameValuePair> formparams = param; 
      UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, reqEncoding);
      httppost.setEntity(uefEntity);
      // 加入头信息
      httppost.addHeader("Accept", "text/html");
      httppost.addHeader("Accept-Charset", "utf-8");
      httppost.addHeader("Accept-Encoding", "gzip");
      httppost.addHeader("Accept-Language", "en-US,en");
      httppost.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
    
      HttpResponse response = httpclient.execute(httppost);
    




Fork me on GitHub