Born to Automate : Extending SharePoint 2013 search - Intro architecture and components

SharePoint 2013 introduces a new improved version of search that is different from the previous versions of SharePoint. The SharePoint search and FAST search has been combined into a single search platform. Instead of the different versions of search like WSS search, foundation etc., in 2013 there is only foundation search and SharePoint server search. Along with these there are lot more new components and topology changes to the search architecture of SharePoint 2013 search.

The Search architecture in SharePoint 2013 includes now components for crawling, indexing content, administration and executing search queries.

The main components of SharePoint 2013 search are:

Admin Component
Crawl Component
Content Process Component
Analytics Processing Component
Index Component
Query Processing Component

SharePoint 2013 search admin component

The admin component runs the system processes for search, and performs provisioning of other search components within the topology. The main responsibilities of the admin component includes, topology changes and search provisioning, manage the search admin DB, scheduling the crawling and content processing.

Crawl component

Crawling is simply a process of gathering documents from various sources/repositories, making sure they obey by various rules and sending them off for further processing to the Content Processing Component. The crawl component is responsible for crawling content sources in SharePoint 2013. The content sources can be SharePoint sites, Microsoft exchange server public folders, BCS external content sources, file shares, SharePoint sites etc. During the crawl process crawl component connects to the content sources, passing crawled items to the content processing component by invoking the appropriate indexing connector or protocol handler for retrieving information.

SharePoint 2013 supports three different kinds of crawls:

Full: During full crawl, the entire content source is indexed regardless of the fact that only specific items have changed since the last crawl. In short it crawls all content defined in the sources every time a crawl is scheduled
Incremental: It crawls content that has been modified since the last crawl based on either a timestamp or a change log.

Both full and incremental crawling are sequential and dedicated to a content source. It means once launched we'll not be able to launch a second crawl instance in parallel on the same content source, and therefore the changes in content has to wait till the crawling process is completed the be included in the index and searched.

Continuous : Continuous crawling is an option that can be used instead of an incremental crawl when we want a content to be continuously crawled. You can achieve maximum freshness of search index as the continuous crawling can be executed in parallel and does not expect the prior crawl to be completed before a new one is launched.

Some important points to consider in continuous crawling is:

Continuous crawling can only be enabled on content type SharePoint sites
The default interval is 15 minutes and can only be changed using the PowerShell cmdlet Set-SPEnterpriseSearchCrawlContentSource
Once started it can’t be stopped or paused.

Content processing component:

The Content Processing receives crawled content from the crawl component and performs does some analysis/processing on the content to prepare it for indexing and sends it off to the Indexing Component. It takes crawled properties as input from the Crawler and produces output in terms of Managed Properties for the Indexer to be indexed. The content processing component makes use of parsers to process the content to generate indexes. If the content processing component is unable to parse a file, the search index will only include the basic file properties.

Analytics processing component

The Analytics Processing Component performs search analytics and usage analytics to improve search relevance. Search analytics refer to the process of detecting analytic information like links, anchor test etc. from the crawled content. The component also processes user initiated analytics like clicks per item etc. which is referred to as usage analytics. Both these analytics output are used to create search reports and generate recommendations and deep links. The results from the analyses are added to the items in the search index. Additionally, results from usage analytics are stored in the analytics reporting database. This makes a lot of since to put this under the Search umbrella for the simple fact that post analytic processing, the analytic data is committed to the index and is used in a variety of ways like boosting relevance of search result or viewing the number of clicks when using the hover panel over a search result.

Index component

The index component is responsible for building the index file. The index file contains crawled properties from content sources, along with ACL that ensures that search results are displayed to users who has proper rights to view the content. The index component stores both crawled items and their associated properties. The component makes use of update groups to allow partial updates for the changes in the content which makes it more efficient as the change for the content is now only updated within the index of the associated update group instead of the entire content.

Query processing component

The Query Processing Component analyzes and processes queries and results to optimize precision, recall and relevance. It is tasked with taking a user query that comes from a search front-end and submits it to the Index Component. It routes incoming queries to index replicas, one from each index partition. Results are returned as a result set based on the processed query back to the component, which in turn processes the result set prior to sending it back to the search front-end. It also performs linguistics processing such as word breaking and stemming before submitting the query to the index component.

That's all for the architecture introduction to SharePoint 2013 search, in the future posts we'll look more into extending the SharePoint search infrastructure and details.

Born to Automate

Sunday, June 1, 2014

Extending SharePoint 2013 search - Intro architecture and components