Every time a page is fetched from the web, one of the following
function is called (they are all in src/interf/output.cc) :
void loaded (html *page) : This function is called when the
fetch ended with success. From the page object, you can
get the url of the page by calling the method getUrl()
get the content of the page by calling the method getPage()
get the list of the sons by calling the method getLinks() (if
options.h includes "#define LINKS_INFO")
get the http headers by calling the method getHeaders()
get the tag with getUrl()->tag (if options.h includes "#define URL_TAGS")
For more details, see src/fetcher/file.h (for html), src/utils/url.h,
src/utils/string.h, src/utils/Vector.h.
void loadedInteresting (html *page) : This function is called with
specific search every time an intersting page is fetched. You can get
the same information as in loaded.
void fetchFailInteresting (url *u, FetchError reason) : This function is
called when the fetch ended by an error, but the page has the good
mime type (only called with specificSearch). u describes the url of the
page. A description of its class can be found in
src/utils/url.h.
reason explains why the fetch failed. enum FetchError is defined in
src/types.h.
void fetchFail (url *u, FetchError reason) : This function is
called when the fetch ended by an error. u describes the url of the
page. A description of its class can be found in src/utils/url.h.
reason explains why the fetch failed. enum FetchError is defined in src/types.h.
The basic configurations are made in larbin.conf. Here are the
different fields of this file :
From : YOUR mail : sent with http headers : very usefull when someone
wants to complain about the robot :-(
UserAgent : name of the robot (sent with each request)
httpPort : port on which is launched the http statistic webserver
(see http://localhost:8081/ when larbin is launched)
inputPort : port on which you can submit urls to fetch. If this
line does not exist or if the port is 0, no input will be available.
pagesConnexions : Number of page you fetch in parallel (to adapt
depending of your network speed). Decrease this if you have too many
timeouts (see stats) : 10% seems to be a maximum.
dnsConnexions : Number of dns calls in parallel. 10 should be ok.
depthInSite : How deep do you want to go in a site.
waitDuration : time between 2 calls at the same server in
seconds. It should never be less than 30 s. However, even with 60 s,
it won't slow the crawler much, and it is a much
better behaviour.
proxy : if you want to connect through a proxy (host port).
StartUrl : Where the search starts. This appears not to be very
important, as soon as the page contains external urls.
limitToDomain : with this option enabled, you will only crawl
pages of some specific domain (.fr and .dk for example).
specificSearch : this option allows you to look for a specific
kind of page (wap page, xml page, mp3...)
forbiddenExtensions : What are the extensions you don't want ?
(write all of them and terminate your list with end)
In this file, you can define options which can change what will be
done. Here are the different thing you can define (you must recompile
larbin if you change one of those) :
LINKS_INFO : associate to each page the list of the
links it contains. This information can be used in
output.cc with page->getLinks().
FOLLOW_LINKS : if this option is not set, html pages
won't be parsed and links won't be followed. This can be usefull when
you feed larbin through the input system.
SAVE : If this option is set, pages that are fetched
will be stored on disk.
MIRROR_SITES : If this option is set, pages are stored
respecting the directory structure of the sites they come from (use one
directory per site). This option is only relevant if SAVE is
also set.
SPECIFICSAVE : If this option is set, specific pages are
stored on disk. Thanks to option maxSpecSize in src/types.h, you can save files bigger than
maxPageSize. Of course, this is only valid if you set
specificSearch in larbin.conf. This option
enables you to simulate resizable buffer (see pageContent and
deleteTemp in src/interf/output.cc), but all the memory management
problems are handled by the kernel.
NO_DUP : if this option is set, larbin does not return
success when a page with the same content than an old one is
encontered.
URL_TAGS : if this option is set, an int is associated
to every url (by default 0). If you use the input system, you'll have
to give an int and the url instead of just the url. When the pages is
fetched, you'll get it with the int (redirections are followed).
EXIT_AT_END : If this option is set, larbin exits when
there are not any more urls to get.
MAXBANDWIDTH : This option is followed by an integer
which indicates the maximum bandwidth larbin should use. Because of
the way bandwidth is limited, larbin might use 10 to 20 per cent more
bandwidth than expected. If this option is not set, there is no
bandwidth limitation.
THREAD_OUTPUT : This option must be set if the code in
output.cc (the code you add) can use blocking instructions (read/write
on network file descriptor...). If it is not set, there is only one
thread in the program, so no locking is needed.
RELOAD : If this option is enabled, larbin restarts from
where it last stopped when you launch it. This allows to stop and
restart larbin as needed (or restart after a crash). If you want to
restart from scratch, use the -scratch option.
GRAPH : Include nice histograms in the real time stat page.
STATS : Display stats on stdout every 8 seconds.
BIGSTATS : Display the name of every page that is
fetched on stdout.
NOSTATS : Disable stats information in the webserver.
NDEBUG : Disable debugging information in the webserver.
CRASH : Should only be used for reporting terrible
bugs.
If you want to tune larbin a little more, go and see this file (it is
supposed to be commented enough). Of course, for those changes to have
effects, you have to recompile larbin.
More customizations
If you need something more, you'll have to go into the code (or ask me
to do so :-)).