Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run
crawl url crawl solraddress depth level
You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.
Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.
If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.
Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.
The following summarizes what happens in Nutch:
seed list -> crawldb -> fetching raw data (download site contents)
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)->
sending the structured data to storage for usage (like ElasticSearch and Solr).
thanks for the reply. iam working on nutch1.9 using cygwin in windows. crawl/segments is not getting created for me. Only crawl/crawdb is getting created.And when i run the command ` ./crawl urls -crawl -depth 3 -topN 5` am getting following error. how to resolve!!! Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin64/home/apache-nutch-1.9-bin/apache-nutch-1.9/bin/-crawl/segments/crawl_generate
can you recommend a valid tutorial for setting up nutch in windows. am frustrated on this
i honestly will not recommend running Nutch in windows. It is just unpredictable. Try to run in virtual machine.
am using cygwin..isnt it same as unix environment..so the steps i have to follow must be same as unix env.. right..
Nutch uses hadoop and hadoop does not work that well in cygwin. I have not used Nutch in cygwin, so I can not help much here. Btw, can you accept the answer if it solves your original question.