How to: Configure Enterprise Search to crawl a web site

Posted Monday, July 20, 2009 2:49 PM by CoreyRoth

I had a few questions about this recently, so I thought I would write a quick post on it.  It’s actually pretty simple to set up, but I know people like to see instructions with pictures.  It’s really not much different than setting up any other content source such as a file share.  There are a few obvious things you have to take care of though.  First, you need to make sure that the index server has access to the site it’s crawling.  That means if you are behind a firewall, or you need to access your public facing web site using a different URL, you need to take that into consideration.  We’ll talk about that more in minute.

To crawl your web site, you first go to your content sources page to create a new content source.  Give your content source a name and then select web site and type in the URL to the web site that you want to crawl.  When you choose this option, it follows every link it acts as a simple spider, following each link it can find and adding it to the index.  In this example, I am going to crawl DotNetMafia.com.

EnterpriseSearchWebSiteContentSource1

You also have the capability to set how many links the spider will follow when crawling and whether or not server hops are allowed.

EnterpriseSearchWebSiteContentSource2

After you configure your content source, you can start a crawl.  When it completes view the crawl log to see if there are any issues crawling your site.  This can also help you find broken links as well.

If you want to crawl a site that requires authentication that can be done as well by creating a crawl rule.  You can specify credentials with a crawl rule from a variety of sources, such as a certificate, cookie, or even FBA.  I don’t have an example handy today for that though so I’ll cover it in a future post.

As I mentioned earlier, if you have to specify a different name for a server internally than externally, that can be handled with a server name mapping.  A server name mapping allows you to map a URL that was crawled and replace it with a different URL (i.e. the external URL of the site).  Here is what that would look like.

EnterpriseSearchServerNameMapping

The last thing I will point out is that there isn’t a way to exclude a portion of the page from being included in the index (at least as far as I know).  What this means, is if you have a common navigation on every page, those words on it will show up as results on every page.  For example, if you have a link called Contact Us, all pages with the Contact Us link are going to show up as a hit in the search results.  Here’s an example of what I mean.  There are way too many results, which doesn’t help the user at all in this case.

EnterpriseSearchWebSiteResults1

As you can see, crawling web sites with Enterprise Search is pretty easy to set up.  You may have to deal with some issues like the one above, but it’s still not a bad solution.  This is a great way to index your public facing corporate site and bring those results into SharePoint.

Comments

# re: How to: Configure Enterprise Search to crawl a web site

Monday, August 23, 2010 3:12 PM by JenW

I know this is an old post but I have a related question.

Did you notice that when crawling a site outside of Sharepoint the date displayed in the search results is equal to the 'last crawled date'?  I am currently looking for a workaround for this and thought I would ask if/how you  have dealt with this issue.

Thanks.

# re: How to: Configure Enterprise Search to crawl a web site

Monday, February 28, 2011 4:50 PM by Eric Xue

Good post and thanks for sharing. I guess, it's time for you to upgrade to SharePoint 2010 version.

Cheers

# re: How to: Configure Enterprise Search to crawl a web site

Friday, September 5, 2014 9:36 AM by Sam

Hi,

In my SharePoint 2013 environment I have configured external content source to a WebSite, in that WebSite there are bunch of ASPX pages. Among those pages few are not crawling which is making me nuts.

For example:

1) www.website.com/sd/servicedesk.aspx

2) www.website.com/sd/contentpages/hardware.aspx

From the above url list #2 is crwaling but #1 is not.

I do have few Crawl rule

1) http://.*website.com/*.* (include)

2) https://.*website.com/*.* (include)

3) http://*.* (exclude)

When I test "www.website.com/sd" from Crawl Rule page rule#1 satisfy but I don't see "www.website.com/sd/servicedesk.aspx" in crawl log not even in error/warning.

Please let me know if you have any idea. Thanks in advance for helping me.

Best,

Sam

Leave a Comment

(required) 
(required) 
(optional)
(required)