EasyStore Cloud Intelligent Website Marketing System Platform!
[Search Q&A Theater] brings you a comprehensive analysis of "Baidu Spider" , helping you to clearly understand the role of "Baidu Spider" in various scenarios and lay the foundation for the overall operation of the website.
A: Baiduspider, also known as Baidu spider, is an automatic program of Baidu search engine. Its function is to visit web pages on the Internet and build an index database so that users can search for website-related content in Baidu search engine.
A: There are two ways to identify Baidu Spider.
Method 1: View UA information
If the UA information is incorrect, you can directly determine that it is not a Baidu search spider. Currently, UA is divided into three application scenarios: mobile, PC , and mini-programs . The UAs of these three channels are as follows:
Mobile UA:
Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko)Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0;+http://www.baidu.com/search/spider.html)
or
Mozilla/5.0 (iPhone;CPU iPhone OS 9_1 like Mac OS
PC UA:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
or
Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)
Mini Program UA:
Mozilla/5.0 (iPhone;CPU iPhone OS 9_1 like Mac OS
Method 2: Bidirectional DNS resolution authentication
Step 1: DNS reverse lookup of IP . Developers can determine whether a spider is from Baidu search engine by running reverse DNS lookup on the IP address of the access server in the log. Baiduspider's hostname is named in the format of *.baidu.com or *.baidu.jp. If it is not *.baidu.com or *.baidu.jp, it is an impersonation.
The verification methods vary according to the platform. For example, the verification methods for Linux, Windows, and OS are as follows:
① In Linux , you can use the host ip command to reverse the ip to determine whether it is from Baiduspider. Baiduspider's hostname is named in the format of *.baidu.com or *.baidu.jp. If it is not *.baidu.com or *.baidu.jp, it is an impersonation.
② On Windows or IBM OS/2 platforms , you can use the nslookup ip command to reverse the IP address to determine whether it is captured by Baiduspider.
③ On Mac OS , you can use the dig command to reverse the IP address to determine whether it is captured by Baiduspider.
Step 2: Run a forward DNS lookup on the domain name . Run a forward DNS lookup on the domain name retrieved by the command in step 1 to verify whether the domain name is consistent with the original IP address of the server accessed in your log. If the IP address is consistent, it can be confirmed that the spider comes from the Baidu search engine. If the IP address is inconsistent, it is impersonation.
For more details, please refer to the document: "Two easy steps to teach you how to quickly identify Baidu spiders"
A: Generally, if the website continues to produce new resources and update content, the spider will continue to crawl. It should be noted that if the website needs Baidu spider to crawl , do not block it . (For more information about blocking, please refer to the following content)
In addition, you can also check the website access log to promptly identify the correct Baidu spider to prevent someone from maliciously impersonating the Baidu spider to frequently crawl your website.
A: If you find that Baidu spider crawls frequently, it may be because
① The website has new production resources and updated content to be captured and updated;
② There may be malicious impersonation of Baidu Spider . You can troubleshoot the problem by following the method in Q2 “How to identify normal Baidu Spider” above;
If the Baidu spider crawling frequency is too high, causing website service abnormalities, you can adjust the frequency through the search resource platform - [Crawling Frequency] tool .
A: Baidu spiders comply with the Internet robots protocol. Webmasters can update the robots.txt file, clearly stating that they do not want Baidu spiders to access resources or directories, and submit the robots file in a timely manner through the search resource platform - [Robots ] tool .
It should be noted that after the robots file is updated and submitted, the search engine needs to gradually complete the update, so the Baidu spider will not stop crawling web pages immediately. Please wait patiently.
A: The website has high-quality resources and no other violations , but the following situations exist:
① No crawling records were found by Baidu Spider.
②It is not included or displayed in Baidu search.
③ The website/directory has an abnormal decrease in traffic .
④ The summary in the search results will be displayed as "robots blocked ".
If you find the above situation, you can first check whether there is a problem with blocking Baidu Spider, and lift the ban in time (refer to QA7 for lifting the ban), and wait for recovery.
A: Common blocking behaviors include robots blocking, Baidu UA blocking, and Baidu IP blocking . You can check and solve them one by one:
(1) Check the robots.txt file to see if there is any blocking record. (Generally, the robots.txt file is placed in the root directory of the website.)
(2) There is no abnormality in the robots file, and further investigation is conducted to determine whether Baidu UA has been banned;
Solution 1: Execute curl --head --user-agent 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' --request GET 'xxxxxxx'
Note: The normal return code is 200 , and other situations are abnormal.
Solution 2: Change browser UA verification;
(3) If there are no abnormalities in the above verifications, further check whether there is an IP-level ban;
Common IP blocking comes from the firewall system configuration of the website . You need to check the firewall configuration system background to check whether there are IP-level blocking measures for Baidu Spider.
For details, please refer to "How developers can unblock Baidu Spider"
Similar Recommendations