A crawler of the Internet (also known as a spider web robot or Web) is a program or script is automatically generated by the Internet to search for websites in the process.
Many applications for most search engines, the exploration of the sites daily, in order to update the data.
Most of the Internet crawler, a copy of the visited pages, so that they are easy to index later, and the rest of the exploration sites for the search page only, as the search for e-mails (SPAM).
How does it work?
A crawler needs a starting point, which is an address, a URL.
In order to surf the Internet, we use the HTTP protocol network, which allows us to speak with the Web and download or the transmission of data from and after.
The crawler searches this URL, and strives to hyperlinks (A day in the language HTML).
Secondly, the crawler looks at these links and moves in the same manner.
Previously, it was here in the basic idea. Well, how completely depends on the objective of the software itself.
If we want to retrieve only the mail, it could be the search to the text on every page (including hyperlinks), and the search for e-mail addresses. That is the kind of software is much easier to develop.
The search engines are much more difficult to develop.
When building a search engine, which we need in order to ensure some other things.
1st Size-some are very large and contains many files and directories. There must be a lot of time to collect all the data.
2nd A change frequencies-site can very often even several times a day. The pages can be deleted, and each day is added. We have to decide when they returned to each site and from each side.
3rd How can you do with the HTML? If we build a search engine, we want to the text, rather than to treat it as a text. We need the difference between a legend and a sentence. We need the text in bold or italics, colors, font, size of the letters, numbers and spreadsheets.

0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
Leave a Comment