:::: MENU ::::

Tuesday, December 8, 2009

SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast and efficient querying when you have large amounts of unstructured data. This is the first of a two-part article that explores the full-text language features that ship with SQL Server versions 7, 2000, and 2005, with particular focus on the new language features in SQL 2005.

Here, in part 1, we examine the index time language options: how words or tokens are broken out and stored in a full text index. In part 2, we will switch focus to the query time language options.

SQL FTS architecture

SQL FTS builds a full-text index through a process called population, which fills the index with words and the locations in which they occur in your tables and rows. The full text indexes are stored in catalogs. You can define multiple catalogs per database, but the catalog cannot span databases; it is database specific. Similarly a table can be full-text indexed in a single catalog; or to put it another way - a table's full-text index cannot span catalogs. However, in SQL 2005 you can full-text index views which may reside in different catalogs than the underlying base tables. This provides performance benefits and allows partitioning of tables.

There is a useful depiction of the SQL FTS architecture in BOL. I won't repeat it here, but I do encourage you to familiarize yourself with it, and how the various FTS components interact. Very briefly, during the population process the indexing engine (MSSearch in SQL 7 & 2000 and MSFTESQL in SQL 2005) connects to SQL Server using an OLE-DB provider (PKM-Publishing and Knowledge Management) and extracts the textual content from your table on a row by row basis. MSSearch uses the services of COM components called iFilters to extract a text stream from the columns you are indexing.

iFilters

The iFilter used depends on the data type of the column in which the data is stored and on how you have configured your full-text index:

  • For columns of the CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, and NTEXT data types the indexing engine applies the text iFilter. You can't override this iFilter.
  • For the columns of the XML data type the indexing engine applies the XML iFilter. You can't override the use of this iFilter.
  • For columns of the IMAGE, and VARBINARY data type, the indexing engine applies the iFilter that corresponds to the document extension this document would have if stored in the file system (i.e. for a Word document, this extension would be doc, for an Excel Spreadsheet this would be xls). Please refer to the later section on Indexing BLOBs for more information on this.

NOTE: It is important to understand that an iFilter can launch a different language word breaker from the one specified as the default in the full-text language setting for your Server, or from the word breaker you set for the column in the table you are full-text indexing. I cover this in more detail in the Word Breaker section.

A complete list of built-in iFilters is accessible from: http://www.indexserverfaq.com/sql2005iFilters.htm. You can also obtain a list of the signed iFilters that are currently accessible on your SQL Server 2005 by issuing the following query in any database:

select document_type, path from sys.fulltext_document_types

The iFilters are listed in the path and the document types that are indexed by the iFilters are listed in the document_type column. Note that although there are iFilters for asp and aspx pages, these pages will not be generated by the asp or aspx rendering engines, rather the HTML MetaTags and textual content between the body tags will be indexed and searchable.

NOTE: To see the text (and properties) that the iFilters emit from the documents, you can run them through FiltDump a utility available from the Platform SDK in the binn directory. It will show not only the text and properties the iFilters extract from the documents, but also the language tags applied to the text streams.

A note on iFilters and BLOBs

For SQL 2000, the correct iFilter for the Binary Large Object (BLOB) column you wish to index must be installed on your OS (use the filtreg utility, available from the Platform SDK in the binn folder, to tell which iFilters will be applied for a specific document type).

For SQL 2005, if you wish to index a document type for which a pre-installed iFilter does not exist, you must install the appropriate iFilter on the OS, and then load it using the following commands:

exec sp_fulltext_service 'load_os_resources', 1;
exec sp_fulltext_service 'verify_signature', 0;
go

So, for example, if you wish to store and index PDFs in columns of the IMAGE or VARBINARY(MAX) data type you must install the Adobe PDF iFilter and issue the two commands above. Otherwise you will get errors of the following form in the full-text gatherer log (found at C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Log):

2006-01-10 13:42:59.29 spid28s
Warning: Failed to get MSFTESQL indexer interface for full-text catalog
'ix_STS_ctcsppweb01_1' ('5') in database 'MyDatabase' ('11').
Error: 0x80000018.

SQL FTS Overview

In essence, in order to perform full text searching on a table you need to:

  1. Ensure that the table has a unique, not null column (e.g. primary key)
  2. Create a full text catalog in which to store full text indexes for a given table
  3. Create a full text index on the text column of interest

For a brief tutorial on how to create a full text index on SQL 2005 please click on the CODE DOWNLOAD link in the box to the right of the article title.

You can build full-text indexes on textual data stored in CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT columns as we as on IMAGE (SQL 200x), VARBINARY(MAX) (SQL 2005), and XML (SQL 2005) data type columns that contain formatted binary data (such as this word document, for example).

Indexing Text

The sophisticated language features of full-text search then allow you to perform a range of advanced searches on your textual data, using the CONTAINS and FREETEXT T-SQL predicates (as well as the CONTAINSTABLE, and FREETEXTTABLE functions), returning a list of rows that contain a word or a phrase in the columns that you are searching on. You can perform:

  • Simple searches for specific words or phrases
  • Thesaurus searches for synonymous forms of word – a search on IE might return hits to Internet Explorer and IE (thesaurus based expansion search); a search on Bombay might also return hits to Mumbai (thesaurus based replacement search)
  • Searches that will return all different linguistic forms of a word (called generations) – a search on bank would return hits to banking, banked, banks, banks' and bank's, etc. (all declensions and/or conjugations of the search term bank)
  • Accent insensitive searches – a search on café would return hits to cafe and café

NOTE: there are other forms of searches too, which we don't cover in this article – for example weighted searches and proximity searches. See MSDN2 for further details.

In short, SQL FTS provides a very powerful text searching mechanism. It is many orders of magnitude faster than a LIKE operator; especially for larger tables. This is because when you use the LIKE operator it does a byte-by-byte search through the contents of a row looking for a match to the search phrase. SQL FTS references the full text index to instantly return a list of matches. SQL FTS also supports a large number of different languages and language characters. As well as accents, it also seamlessly handles: compound words (in German and Chinese) and compound characters which occur in Chinese.

You may have noticed in some of the earlier CTPs (Community Technology Previews) for SQL Server 2005, 23 languages were supported, in the RTM version of SQL Server 2005 only 17 languages are supported. Please refer to http://support.microsoft.com/kb/908441/en-us for more information on how to enable the missing 6 languages.

Indexing BLOBs

With SQL 2000 and later, you also have the ability to full-text index BLOBs (Binary Large Objects or binary files) stored in the IMAGE (SQL 200x) and VARBINARY data type column (SQL 2005). SQL 7 did not have the capability to index BLOBs.

For BLOBs to be successfully full-text indexed you must ensure that the document extension is stored in an additional column (referred to as a document type column), and that you have configured your full-text index to use this document type column (i.e. the content of the document type column would be doc for rows containing Word documents, xls for rows containing Excel Spreadsheets, xml for rows containing XML documents, and so on).

Use VARCHAR(3) or VARCHAR(4) as the length of the document type column because, in some of the earlier versions of SQL FTS, if your document extension was less than the length of your document extension column the CHAR data type would pad the row contents with spaces, and consequently MSSearch would not launch the iFilter correctly. For example if your content was a Word document with the doc file extension, stored in a CHAR(4) document type column, MSSearch would attempt to launch the iFilter corresponding to "doc " instead of "doc" and fail. The VARCHAR data type does not pad the data with white space, and so this problem is circumvented.

You configure your full-text table to reference this document type column by using the sp_fulltext_column stored procedure (in SQL 2000) or by the CREATE FULL TEXT INDEX statement in SQL 2005.

Script 1 in the download code for this article provides a worked example of indexing a BLOB.

Further information on how to index BLOBs in SQL 2000 and SQL 2005 can be found at http://www.indexserverfaq.com/Blobs.htm and at http://www.indexserverfaq.com/BlobsSQL2005.htm.

SQL FTS Language Features

The language features of SQL FTS come into play both at "Indexing Time" and "Query Time". At index time we are essentially building an index of words or tokens on the textual data. A token is a character string, delineated from other character strings by white space or punctuation, which may or may not have linguistic significance but is not normally construed as a word, e.g. MSFT, XXX, PartNo 1231, 159-23-4364, or even http://www.microsoft.com.

During the indexing process, linguistic analysis is performed. Word boundaries are located by application to the text data of language specific components called word breakers, thus identifying our tokens. Basically, the word breakers will control how the textual content is broken into words and how these words are stored in the full-text index. Word breakers may tag the word as currency, a date, and for some languages (especially German, and Far East languages) will sometimes analyze the word and extract constituent words or characters.

At query time, the user supplies a query, such as the following:

SELECT * FROM TableName WHERE CONTAINS(ColumnName,'bank')

When the query is submitted, a language specific component called a stemmer is used to expand the search phrase, depending on language specific rules that are in place. These rules are controlled by SQL Server settings, or your SQL Full-text query arguments. So, in the above example, the search term, bank, would be expanded to incorporate variants such as banking, banked, banks, banks' and bank's, etc (all conjugations of the search term bank). The index will then be searched for these words and the matching rows will be returned.

Word Breakers

The iFilters return a text stream to the indexing engine which then applies language specific word breakers to the text stream to break the stream into words or tokens typically at white space or punctuation boundaries.

Specifying the Word Breaker

Different language word breakers are loaded depending on how you configure SQL Server, or how you configure full-text indexing or, sometimes, even on the content being indexed.

Via Full-Text Index Settings

The language word breaker will be used that corresponds to the LCID (LoCale IDentifier) specified in the sp_addfulltext_column (for SQL 7 and 2000) or the language specified in the CREATE FULL TEXT INDEX statement in SQL 2005, as illustrated below:

CREATE FULLTEXT INDEX ON FullTextTableImage 
(imageCol TYPE COLUMN DocumentTypeColumn LANGUAGE 'FRENCH')

See Script 2 in the download code for a full worked example

For SQL 2005, you can get a list of supported full-text languages in SQL 2005 by issuing the following query:

select * from sys.fulltext_languages

In SQL 2000 you would issue this query to xp_msfulltext which will return the language name and its LCID. If no language word breakers exist for an LCID, the neutral word breaker is used (LCID=0).

Via SQL Server Settings

If no language is specified in the sp_addfulltext_column or CREATE FULLTEXT INDEX statement then the default full-text language setting for your SQL Server is used. You can check/modify the default full-text language setting using:

sp_configure 'default full-text language'

You may have to set advanced options to "on" to see or use this option.

Via language tags in the content being indexed

Some documents have language tags and, if the iFilter understands these language tags, it might (depending on the iFilter implementation) launch a different word breaker than the one specified by any the Full-Text index or SQL Server settings.

The language-aware document formats that I know of are Word, XML, and HTML. illustrates these features. These language tags will override all of the above settings. For Word, the entire document, or a portion of the document, can be tagged as having a different language than the default for the machine on which they were created, and there can be multiple different language tags in the same document (i.e. French, German, and Chinese). You set the Word language tags in two ways:

  1. At the document level: open up your document in Word and click Format, in the Show: drop down box, select Custom, and click Styles, Custom, and select Modify. Click Format and select Language.
  2. At the text level: highlight a portion of text, and click Tools, click Language and Set Language and select the language with which you wish this portion of text to be tagged.

For HTML documents, you can use the ms.locale meta tag. For example, for Greek the tag would be:

<meta name="MS.LOCALE" content="el-gr" >

For XML documents, you can use the xml:lang tag. For example, for Ukrainian use:

<xml:lang="uk">

Note that the use of the Greek and Ukrainian languages is for illustrative purposes only. These languages are not currently supported in SQL Full-Text search.

Script 3 from the code download illustrates the use of language tags to set the word breaker.

How Words are stored in the index

Generally, the words broken at index time are stored in the catalogs exactly as broken by the word breakers. However, there are some exceptions. For example:

  • c# is indexed as c (if c is not in your noise word list, see more on noise word lists later), but C# is indexed as C# (in SQL 2005 and SQL 2000 running on Win2003 regardless if C or c is in your noise word list). It is not only C# that is stored as C#, but any capital letter followed by #. Conversely, c++ ( and any other lower-cased letter followed by a ++) is indexed as c (regardless of whether c is in your noise word list).
  • Dates are stored in multiple formats for example 1/1/06 is stored as 1/1/06 and 2006/1/1 (and also 1906/01/01! – 1/1/2006 is stored correctly as 1/1/2006 and 2006/01/01).
  • For some languages the word is stored more than once in the catalog – the first time as it appears in the content, and then with alternative word forms of that word. For example, using most word breakers, L'University is stored as University and also as L'University.
  • German compound words are stored as a compound word and the constituent words (for example, wanderlust is stored as wanderlust, wander, and lust) when indexed using the German word breaker. Using other word breakers it is stored as wanderlust.
  • Chinese and other Far East language "words" are stored as the word, characters, and constituent characters.

You can determine how the word breaker will break a word in your content by running the word through the lrtest (language resource test) tool. Lrtest ships with the SharePoint Portal Server resource kit, which you can obtain from SharePoint Portal Server CD in the Support\Tools directory.

You can find a description of how to use this tool here: http://support.microsoft.com/kb/890613. For SQL 2000 the language resource location is:

HKLM\SYSTEM\CurrentControlSet\Control\ContentIndex\Language

The relevant language resource location for SQL 2005 is:

HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\Language

Regrettably, lrtest does not work for most SQL 2005 word breakers except for Thai, Swedish, and Japanese, so the following examples use the SQL FTS 2000 word breakers. For the time being, you will have to resort to trial and error to determine if the behavior revealed using lrtest on the SQL 2000 work breakers still applies on SQL 2005.

For index time behavior the relevant key for your language is WBreakerClass. So, to see how C# is broken for US English you get the CLSID of the following key:

HKLM\SYSTEM\CurrentControlSet\Control\ContentIndex\Language\
English_US\WBreakerClass

And run the following:

lrtest /b /c:{80A3E9B0-A246-11D3-BB8C-0090272FA362} "C#"

The result (abbreviated) should look as follows:

IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 0, cwc 2, 'C#'

Trying c#:

lrtest /b /c:{80A3E9B0-A246-11D3-BB8C-0090272FA362} "c#"

The result (abbreviated) should look as follows:

IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 0, cwc 2, 'c'

Adding support for your own language word breaker

In SQL 2005 you can add support for your own language by adding a key to HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\Language. For example in order to add support for Arabic, follow these steps:

  1. Download and install the Arabic word breaker from:
    http://www.microsoft.com/middleeast/arabicdev/beta/search/
  2. Run the installation program
  3. Install Complex Script components – open up Regional Settings, click on the Languages tab, and select Install files for complex script and right to left languages.
  1. Add the Arabic key: HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\Language\Arabic-EG with the following values:

Locale (dword) 1025 (decimal)

NoiseFile (String) c:\temp\Arabic.txt

StemmerClass (String)

TsaurusFile (String) c:\temp\ArabicThesaurus.txt

WordBreakerClass (String) {3E0C67A6-38F8-43b6-BD88-3F3F3DAC9EC1}

  1. Add the following key
    HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\CLSID\{3E0C67A6-38F8-43b6-BD88-3F3F3DAC9EC1}
    with a string value of lrar32.dll.
  2. Copy lrar32.dll and lrar32.lex from C:\Program Files\SharePoint Portal Server\Bin\ to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn
  3. Issue a call to exec sp_fulltext_service 'update_languages' and restart SQL Server.

It is critical that you review your license, or contact your local Microsoft office, with regard to possible licensing issues surrounding use of the Arabic word breaker in your SQL Server search application. The Arabic word breaker seems to be in perpetual beta and it is not clear whether it is licensed for SQL Server 2005. It appears that there are no licensing restrictions on other third-party supplied SQL 2005 word breakers, but they are not supported by Microsoft.

This post on blogs.msdn.com provides more information on how to add custom language support for SQL 2005 FTS.

Token

SQL 2000 on Win2k (SP4)

SQL 2000 on Win2003

SQL 2005

AT&T

AT, T

AT, T

AT&T

C#

C

C#

C#

C++

C

C++

C++

c#

C

C

C

c++

C

C

C

.Net

NET

NET

NET

<Title>rats<Title>

<TITLE>R, ATS, <TITLE>

TITLE, RATS, TITLE

TITLE, RATS, TITLE

Note that the anomalies you see with the HTML tags, or the > and < characters, only occur when you are indexing text; they do not occur if you are using the HTML iFilter. Unless otherwise indicated, these tokens are all searchable using a placeholder in place of the non-alphanumeric character – so a search on AT&T will match with AT&T, AT!T, AT*T, AT T, and so on. To correct this behavior you may need to replace the problem token in your content with a non-breaking version. For example if you want searches on AT&T to be handled properly you need to e. replace AT&T with ATANDT wherever they occur in the tables you are full-text indexing and and then trap for a search phrases which include AT&T, and replace it with ATANDT. The replacement feature in the thesaurus option is ideal for this.

Chinese, Japanese, Korean and Thai

The Far East word breakers (Chinese, Japanese, Korean and Thai) break the text stream into individual characters, or even break the characters into constituent characters which correspond to words. The Chinese writing system is not phonetically based. Most ancient writing systems (Chinese, Sumerian, Egyptian, and Aztec for example) started as pictograms, whereby each character was represented by a picture. Due to the large number of characters required, most pictorial systems collapsed and/or evolved in to phonetic systems. However, Chinese retained a pictogram component and then evolved to incorporate ideograms, whereby the graphical symbols (characters) represent ideas.

There are actually five types of Chinese characters. The simplest are pictograms and ideograms, but these two types comprise only a small percentage of the total number of characters. Most Chinese characters are semantic-phonetic compounds where part of the character indicates the sound of the word, and another part indicates its meaning. The other two types are compound ideograms (where two or more characters are compounded into one, and their individual meanings contribute to the meaning of the whole) and phonetic loans (in which a pre-existing character is borrowed to represent a word whose sound is similar to that of the word the character originally represented).

Consider that the character for tree in Chinese is (http://www.simple-talk.com/iwritefor/articlefiles/130-image003.jpg), forest is (http://www.simple-talk.com/iwritefor/articlefiles/130-image004.jpg), and (http://www.simple-talk.com/iwritefor/articlefiles/130-image005.jpg) is a wood. If you look closely you will see the character for tree is present in the characters for wood and forest; in wood you see two tree characters stuck together, in forest you see three stuck together.

In Japanese or Chinese a character string has to be broken into its component characters to determine what the word is about and to determine what it matches (as the phrase among Natural Language researchers goes – you can tell a word by its neighbors – similarly you can tell a character by its neighbors). So if you were doing a FreeText search on the Chinese character wood you would expect to get hits to rows containing wood and forest.

However to index Chinese characters you need to derive their meaning, which requires further decomposition. Chinese characters consist or one or more strokes, termed radicals. There are 214 radicals used to build characters. By decomposing a character into its constituent characters, or radicals, you can derive its meaning. For more information on radicals consult: http://www.chinese-outpost.com/language/characters/char0035.asp.

The word breakers for Chinese, Japanese, Korean, and Thai will take the text stream, break at white space, and punctuation, and then make multiple passes of the tokens using context to extract the constituent characters. Indexing these languages is a CPU intensive process because of the multiple passes required in the word breaking process for these languages.

UK vs. US English at Index Time

SQL FTS supports two versions of English: UK English and US English. One of the questions I most frequently get asked is ""What is the difference between the UK and US English word breakers?" There are four points to keep in mind here:

  • In actual fact UK English does not refer to the Queen's English or the English used in the United Kingdom, but International English; the English that is used in all other English speaking countries other than US English.
  • They use the same word breaker.
  • Most frequently the UK word breaker is used when you want to use a different noise word list than the US one to implement a less precise search (we'll cover this on the section in noise words), or a different thesaurus.
  • Significant differences do occur at query time between the two language stemmers (we'll cover this in the next article, in the section on UK vs. US English at Query Time), but not in the word breakers. In fact in SQL 2000 FTS the UK English and US English word breakers are the same. In SQL 2005 they have a different CLSID. I did speak with a Microsoft developer who worked on many of the Microsoft Search products and he speculated that the differences are there to handle different word endings in UK English, i.e. Catalog versus catalogue; check versus cheque; bank versus banc, or banque; orientated versus oriented.

German

As mentioned previously, the German language has a large number of compound words. When these words are indexed they are stored as the compound word as well as its constituent words. The German word breaker also handles umlauts (the dots that appear in Mötley Crüe). Umlauts denote plural forms of nouns and sometimes different verb forms (e.g. present tense). Words with umlauts are indexed with the umlauts and the non variant of the word without the umlaut, for example häuser (German for houses) is indexed as häuser and haeuser. Mötley Crüe is stored as Mötley Crüe, and Moetley Cruee.

Noise Words

When computer scientists first started work in the field of information retrieval (the branch of computer science which involves search technology) hard disks were very expensive. Search scientists did anything they could to save on disk space. They recognized that there were a large number of words which occurred very frequently and which were not helpful in resolving searches. Such words are referred to as noise words. For example words which linguists class as articles, such as the, a, and at are for the most part irrelevant in a search. They add nothing to the search and don't help to resolve searches – for instance a search on University of California at Berkley returns the same results as a search on University of California Berkley. Similarly pronouns like me, my and mine don't help in a search either. Articles and pronouns (and other similar parts of speech) occur frequently in bodies of text, and by not indexing them search scientists were able to save disk space and improve search speed (very slightly), with only a marginal decrease in accuracy. Google does not index noise words, but MSN Search does to provide greater accuracy in searches!

Another example of a noise word is a word which may not be a part of speech, but occurs frequently in the content and does not help in resolving searches. For example, the word Microsoft occurs on every page of Microsoft.com. A search on Microsoft Windows XP is indistinguishable from a search on Windows XP. By adding such words to your noise word list your catalog size will decrease.

The problem with noise words is that they will break searches which contain them. So a search on the band "the the" will result in the following message on SQL 2005:

Informational: The full-text search condition contained noise word(s).

But results will be returned for matches to other search arguments, so a search on "the the albums" would return rows which contained the word albums in the full-text indexed columns. On SQL 2000 however, the message is:

Server: Msg 7619, Level 16, State 1, Line 1

Execution of a full-text operation failed. A clause of the query contained only ignored words.

And the query will not return any results. BOL states that this behavior can be fixed in SQL 2005 by using:

exec sp_configure 'transform noise words', 1
reconfigure

But as far as I can tell, this command has no effect.

Summary

In this article, we looked at the language options of SQL Full-Text Search which occur at index time. We started off by looking at the SQL full-text architecture with special emphases on iFilters and word breakers. We also looked at how language aware iFilters can override the server language settings or the word breaker you specify to break the text in these documents. We then discussed how the words are stored in the full-text catalogs and had a look at some of the language idiosyncrasies. In the next article in the series we will look at language options at query time; i.e. when you query. Till next time, I hope you find what you are looking for!

More

 

Thursday, November 12, 2009

Many posts (e.g.: http://blog.deploymentengineering.com/2007/06/dealing-with-very-large-number-of-files.html, http://www.wintellect.com/cs/blogs/jrobbins/archive/2007/10/19/wix-the-pain-of-wix-part-2-of-3.aspx, http://blog.deploymentengineering.com/2007/06/burning-in-hell-with-heat.html) have been written about the problem of adding large number of files to a WIX installer. This problem is the most painful when you want to add content files that do not really have any special purpose, but just have to be there (e.g. code samples or source code packages).

I also struggled with this problem, and finally I found myself creating a small MsBuild tool (WixFolderInclude.targets) that you can include in your WIX project and use to generate file installers for entire folders on the disk. :-)I call it a tool, as I don’t have a better name for it, but it is not (only) an MsBuild target, nor is it a Task. Actually it is a WIX MsBuild extension, but WIX already has a “WIX extension” term, which is something else. So let’s stick to “tool”. 

The WixFolderInclude tool

Let’s see how you can use this tool; it was tested with the latest WIX framework (v3.0.4220), but it probably works with older v3x versions as well. I’m assuming that you are more or less familiar with the WIX and MsBuild concepts. If not, you can grab the necessary information quickly from Gábor Deák Jahn's WiX Tutorial and the MSDN docs.

WIX projects (.wixproj) are MsBuild projects, and you can extend them with other MsBuild property definitions or targets. One option is to modify the wixproj file in a text editor… This is fine, but I like the WIX project to open in Visual Studio, and in this case modifying the project file is not easy. Instead, I usually always start by creating a “Build.properties” file in the WIX project (it has type “Content” so it does not modify the WIX compilation), where I can write my MsBuild extensions. I have to modify the project file only once, when I include the Build.properties file. I usually include it directly before the wix.targets import:

</ItemGroup>
<Import Project="$(MSBuildProjectDirectory)\Build.properties" />
<Import Project="$(WixTargetsPath)" />

But you can directly write into the project file as well, if you don’t use the VS integration.

Let’s take a very simple example: I would like to include two code samples in the installer. They are located in some folder (C:\Temp\ConsoleApplication1 and C:\Temp\WebApplication1) and I would like to install them in a “Samples” folder inside the program files entry of my installed application. Of course both samples contain sub-folders that I also want to include.

To achieve that with my tool,

  • you have to define MsBuild project items that describe the aspects of the installation of these folders
  • you have to define some property to utilize my tool
  • the tool generates some temporary WIX fragment files during compilation (and includes them in the compilation), which contain the definition of the Directory/Component/File structure and a component group that gathers the components generated for the files in the directory structure.
  • you have to include references to the generated component groups in the installation features of your choice in the normal wxs files (e.g. Program.wxs).

So first, let’s create the folder descriptions for my sample. The tool searches for project items called “WixFolderInclude”, so we have to create such items for the folders we want to include:

<ItemGroup>
  <
ConsoleApplication1Files Include="C:\temp\ConsoleApplication1\**" />
  <
WixFolderInclude Include="ConsoleApplication1Folder">
    <
SourceFiles>@(ConsoleApplication1Files)</SourceFiles>
    <
RootPath>C:\temp\ConsoleApplication1</RootPath>
    <
ParentDirectoryRef>Dir_Samples</ParentDirectoryRef>
  </
WixFolderInclude>
</
ItemGroup>

<ItemGroup>
  <
WebApplication1Files Include="C:\temp\WebApplication1\**" Exclude="*.scc" />
  <
WixFolderInclude Include="WebApplication1Folder">
    <
SourceFiles>@(WebApplication1Files)</SourceFiles>
    <
RootPath>C:\temp\WebApplication1Files</RootPath>
    <
ParentDirectoryRef>Dir_Samples</ParentDirectoryRef>
  </
WixFolderInclude>
</
ItemGroup>

As you can see, you can define the set of files to be included with the standard possibilities of MsBuild, so you can include deep folder structures, exclude files, or even list the files one-by-one. In the example here I have excluded the source-control info files (*.scc) from the second sample.

In the WixFolderInclude items, you have to note the following things.

  • The main entry (ConsoleApplication1Folder and WebApplication1Folder) describes the name of the folder installation. The generated component group ID will be based on this name, so you can use any meaningful name here, not necessarily the folder name.
  • The “SourceFiles” metadata should contain the files to be included in this set (unfortunately, you cannot use wildcards here directly, so you have to create a separate item for them).
  • The “RootPath” metadata contains the folder root of the folder set to be included in the installer. This could also be derived from the source file set (by taking the common root folder), but I like to have it more explicit, like this.
  • The “ParentDirectoryRef” metadata specifies the ID of the <Directory>, where the folder should be included in the installer. Now I have created a directory (Dir_Samples) for the Samples folder in the program files, so I have specified that as parent.

As we are ready with the definition, the next step is to set up the tool. It is very simple; you just have to include the following lines in the Build.properties (or the project file):

<Import Project="$(MSBuildProjectDirectory)\Microsoft.Sdc.Tasks\Microsoft.Sdc.Common.tasks" />

<PropertyGroup>

<CustomAfterWixTargets>$(MSBuildProjectDirectory)\WixFolderInclude.targets</CustomAfterWixTargets>
</
PropertyGroup>

The value of the CustomAfterWixTargets should point to the tool file. If you have it in the project folder, you can use the setting above directly. Also note that the tool uses the Microsofr.Sdc.Tasks library (http://www.codeplex.com/sdctasks). I have tested it with the latest version (2.1.3071.0), but it might work with older versions as well. You should import the Microsoft.Sdc.Common.tasks file only once, so if you have already imported it in your project, you can skip that line.

Now we are done with the entries in the Build.properties, so let’s include the folders in the installer itself. As I have mentioned, the tool generates fragments that contain a component group for each included folder. The component group is named as follows: CG_WixFolderInclude-name. In our case, these are CG_ConsoleApplication1Folder and CG_WebApplication1Folder. So let’s include them in the main feature now:

<Product ...>
  ...

  <!-- setup the folder structure -->
  <
Directory Id="TARGETDIR" Name="SourceDir">
    <
Directory Id="ProgramFilesFolder">
      <
Directory Id="INSTALLLOCATION" Name="WixProject1">
        <
Directory Id="Dir_Samples" Name="Samples">
        </
Directory>
      </
Directory>
    </
Directory>
  </
Directory>

  <!-- include the generated component groups to the main feature -->
  <
Feature Id="ProductFeature" Title="WixProject1" Level="1">
    <
ComponentGroupRef Id="CG_ConsoleApplication1Folder"/>
    <
ComponentGroupRef Id="CG_WebApplication1Folder "/>
  </
Feature>
</
Product>

And that’s it. We are ready to compile!

Fine tuning

The tool supports some additional configuration options, mainly for debugging purposes: you can specify the folder where the temporary files are stored (by default, it is the value of %TMP% environment variable) and whether it should keep the temp files (by default, it deletes them after compilation). These settings can be overridden by including the following lines in the Build.properties.

<PropertyGroup>
  <
WixFolderIncludeTempDir>C:\Temp</WixFolderIncludeTempDir>
  <
WixFolderIncludeKeepTempFiles>true</WixFolderIncludeKeepTempFiles>
</
PropertyGroup>

Possible problems

Of course, life is not that easy... so you might encounter problems with using this tool as well. One is that it kills MsBuild’s up-to-date detection, so it will recompile the project even if nothing has changed. I think this could be solved by specifying some smart output tags on the target, but it is not easy, and usually I want to be sure that the installer package is fully recompiled anyway.

The other – probably more painful – problem is that you cannot include additional files from WIX to a subfolder of an included directory. We had this problem when we wanted to create a shortcut to the solution files of the installed samples. The problem is that since the IDs that the Sdc Fragment task generates are GUIDs, you have no chance of guessing what the subfolder’s ID was.

I have extended the WixFolderInclude.targets to support generating deterministic names for some selected folders. The folders to be selected can be defined with the “DeterministicFolders” metadata tag of the WixFolderInclude item. The value should be a semicolon-separated list of folders relative to the RootPath. Please note that as these are folders, you cannot really use MsBuild’s wildcard support, but you have to type these folder names manually. Let’s suppose that we have a Documentation folder inside the ConsoleApplication1 sample, which we might be able to extend from WIX later. We have to define this as the following:

<ItemGroup>
  <
ConsoleApplication1Files Include="C:\temp\ConsoleApplication1\**" />
  <
WixFolderInclude Include="ConsoleApplication1Folder">
    <
SourceFiles>@(ConsoleApplication1Files)</SourceFiles>
    <
RootPath>C:\temp\ConsoleApplication1</RootPath>
    <
ParentDirectoryRef>Dir_Samples</ParentDirectoryRef>
    <
DeterministicFolders>Documentation</DeterministicFolders>
  </WixFolderInclude>
</
ItemGroup>

As a result, the ID for the Documentation’s <Directory> element will be: Dir_ConsoleApplication1Folder_Documentation, so we can extend it from our Product.wxs:

<DirectoryRef Id="Dir_ConsoleApplication1Folder_Documentation">
  <
Component Id="C_AdditionalFile" Guid="5D8142C1-...">
    <
File Name="AdditionalFile.txt" Source="C:\Temp\AdditionalFile.txt" />
  </
Component>
</
DirectoryRef>
 

Attachment

In the attached ZIP file, you will find the WixFolderInclude.targets file, and also the sample that I have used here to demonstrate the features (without the silly ConsoleApplication1 and WebApplication1 folders). Feel free to use them!

More

Monday, November 9, 2009

IIS6 uses the maxRequestLength config setting under the system.web section to specify maximum file upload size with a default of 4 MB.  IIS7 uses the maxAllowedContentLength config setting under the system.webServer section to specify maximum file upload size with a default of 28.6 MB.  This is something to watch out for when migrating your web application from IIS6 to IIS7.  Read on for more information how I found out about this new config setting for IIS7….

I have migrated several sites from IIS6 to IIS7 without much problems.  One gotcha that I did get caught on is the new IIS7 config settings section (system.webServer) and those settings for specifying the maximum file size to be uploaded to the website.  After migrating a certain web application from IIS6 to IIS7 everything appeared fine until a few customers began complaining about issues when uploading files to the website… in particular these were large files around 50MB.

In IIS 6.0 there is a config setting (attribute) called maxRequestLength located under the httpRuntime section in system.web that you can use to specify the maximum allowed request length (in other words the maximum uploaded file size).  In IIS 6.0 the default is 4096 which is number of kilobytes allowed… so a 4MB file is the default file upload size under IIS 6.0. 

A 4MB is pretty small these days so it is quite common to need to override the default and put in a different value here.  For the web application that I was migrating to IIS7, we had increased the maximum file size to 200MB (and told our customers 200MB was the max upload too).  This is what the httpRuntime section was set to:

<system.web>
    <httpRuntime maxRequestLength="204800" executionTimeout="7200"/>

So we migrated the web application to IIS7, tested some large file uploads (we tested with 20MB files… note this for later) and everything seemed great.  After rolling the website out to our customers, a couple weeks post release we got a couple complaints about customers not being able to upload files.  Their files were about 50MB in size.

At first this was puzzling because we clearly had the config setting in place that indicated 200MB was the new limit (or so we thought) AND files larger than 4MB were allowed (we had tested 20MB files).  But we could easily reproduce the customer issue with their 50MB files failing.  So what was going on?

Eventually we tracked it down to IIS7 and the new config section called system.webServer.  We had known that httpHandlers in IIS7 were now to be specified in the system.webServer/handlers section but what we did not know (and did not find out until our customers ran into it) was that the maximum request length setting for IIS7 is also in a new location.  In IIS7 you specify the maximum size of a file to be uploaded to the website by using the maxAllowedContentLength setting (system.webServer/security/requestFiltering/requestLimits >> maxAllowedContentLength).

<system.webServer>
  <security>
    <requestFiltering>
      <requestLimits maxAllowedContentLength="209715200" ></requestLimits>

Now why didn’t our tests with 20MB files show this?  Because in IIS7 the default value for the maxAllowedContentLength setting is 30000000 and that is in bytes: 30000000 bytes = 28.6 MB.  So in IIS7 the default had increased to 28 MB so we did not notice it since we were testing with only 20 MB files (and assuming the default was 4MB).  In the end we got this resolved fairly quickly and it showed an issue in our testing (we really should have been testing a 200MB file… the limits of what we tell our customers).

 

The ASP.NET Membership system has a static method built-in for this.  You can use the GeneratePassword static method from the Membership class to create a new password:

String strongPassword = System.Web.Security.Membership.GeneratePassword(8, 1);

From the MSDN documentation, the two parameters are:

  • length – Int32
    • The number of characters in the generated password. The length must be between 1 and 128 characters.
  • numberOfNonAlphanumericCharacters – Int32
    • The minimum number of punctuation characters in the generated password.

Also from the documentation: the generated password will contain alphanumeric characters and the following punctuation marks: !@#$%^&*()_-+=[{]};:<>|./?.

But also not included in the documentation is that the returned password will not be a “dangerous string”; in other words it won’t look like a block of script. 

The Membership.GeneratePassword checks the newly generated password string using an internal method called CrossSiteScriptingValidation.IsDangerousString() and will not return a password that does not pass this test.  It will just loop and continue to generate a new one until it is not considered a dangerous string. 

 

Monday, October 26, 2009

5 of the most common mistakes developers make when they manipulate the web.config file.

1. Custom Errors Disabled

When you disable custom errors as shown below, ASP.NET provides a detailed error message to clients by default.

Vulnerable configuration:

<configuration>

<system.web>

<customErrors mode="Off">

Secure configuration:

<configuration>

<system.web>

<customErrors mode="RemoteOnly">

In itself, knowing the source of an error may not seem like a risk to application security, but consider this: the more information a hacker can gather about a Web site, the more likely it is that he will be able to successfully attack it. An error message can be a gold mine of information to an attacker. A default ASP.NET error message lists the specific versions of ASP.NET and the .NET framework which are being used by the Web server, as well as the type of exception that was thrown. Just knowing which Web-based applications are used (in this case ASP.NET) compromises application security by telling the attacker that the server is running a relatively recent version of Microsoft Windows and that Microsoft Internet Information Server (IIS) 6.0 or later is being used as the Web server. The type of exception thrown may also help the attacker to profile Web-based applications; for example, if a "SqlException" is thrown, then the attacker knows that the application is using some version of Microsoft SQL Server.

You can build up application security to prevent such information leakage by modifying the mode attribute of the <customErrors> element to "On" or "RemoteOnly." This setting instructs Web-based applications to display a nondescript, generic error message when an unhandled exception is generated. Another way to circumvent this application security issue is to redirect the user to a new page when errors occur by setting the "defaultRedirect" attribute of the <customErrors> element. This approach can provide even better application security because the default generic error page still gives away too much information about the system (namely, that it's using a Web.config file, which reveals that the server is running ASP.NET).

2. Leaving Tracing Enabled in Web-Based Applications

The trace feature of ASP.NET is one of the most useful tools that you can use to ensure application security by debugging and profiling your Web-based applications. Unfortunately, it is also one of the most useful tools that a hacker can use to attack your Web-based applications if it is left enabled in a production environment.

Vulnerable configuration:

<configuration>

<system.web>

<trace enabled="true" localOnly="false">

Secure configuration:

<configuration>

<system.web>

<trace enabled="false" localOnly="true">

When the <trace> element is enabled for remote users of Web-based applications (localOnly="false"), any user can view an incredibly detailed list of recent requests to the application simply by browsing to the page "trace.axd." If a detailed exception message is like a gold mine to a hacker looking to circumvent application security, a trace log is like Fort Knox! A trace log presents a wealth of information: the .NET and ASP.NET versions that the server is running; a complete trace of all the page methods that the request caused, including their times of execution; the session state and application state keys; the request and response cookies; the complete set of request headers, form variables, and QueryString variables; and finally the complete set of server variables.

A hacker looking for a way around application security would obviously find the form variable histories useful because these might include email addresses that could be harvested and sold to spammers, IDs and passwords that could be used to impersonate the user, or credit card and bank account numbers. Even the most innocent-looking piece of data in the trace collection can be dangerous in the wrong hands. For example, the "APPL_PHYSICAL_PATH" server variable, which contains the physical path of Web-based applications on the server, could help an attacker perform directory traversal attacks against the system.

The best way to prevent a hacker from obtaining trace data from Web-based applications is to disable the trace viewer completely by setting the "enabled" attribute of the <trace> element to "false." If you have to have the trace viewer enabled, either to debug or to profile your application, then be sure to set the "localOnly" attribute of the <trace> element to "true." That allows users to access the trace viewer only from the Web server and disables viewing it from any remote machine, increasing your application security.

3. Debugging Enabled

Deploying Web-based applications in debug mode is a very common mistake. Virtually all Web-based applications require some debugging. Visual Studio 2005 will even automatically modify the Web.config file to allow debugging when you start to debug your application. And, since deploying ASP.NET applications is as simple as copying the files from the development folder into the deployment folder, it's easy to see how development configuration settings can accidentally make it into production, compromising application security.

Vulnerable configuration:

<configuration>

<system.web>

<compilation debug="true">

Secure configuration:

<configuration>

<system.web>

<compilation debug="false">

Like the first two application security vulnerabilities described in this list, leaving debugging enabled is dangerous because you are providing inside information to end users who shouldn't have access to it, and who may use it to attack your Web-based applications. For example, if you have enabled debugging and disabled custom errors in your application, then any error message displayed to an end user of your Web-based applications will include not only the server information, a detailed exception message, and a stack trace, but also the actual source code of the page where the error occurred.

Unfortunately, this configuration setting isn't the only way that source code might be displayed to the user. Here's a story that illustrates why developers shouldn't concentrate solely on one type of configuration setting to improve application security. In early versions of Microsoft's ASP.NET AJAX framework, some controls would return a stack trace with source code to the client browser whenever exceptions occurred. This behavior happened whenever debugging was enabled, regardless of the custom error setting in the configuration. So, even if you properly configured your Web-based applications to display non-descriptive messages when errors occurred, you could still have unexpectedly revealed your source code to your end users if you forgot to disable debugging.

To disable debugging, set the value of the "debug" attribute of the <compilation> element to "false." This is the default value of the setting, but as we will see in part two of this article, it's safer to explicitly set the desired value rather than relying on the defaults to protect application security.

4. Cookies Accessible through Client-Side Script

In Internet Explorer 6.0, Microsoft introduced a new cookie property called "HttpOnly". While you can set the property programmatically on a per-cookie basis, you also can set it globally in the site configuration.

Vulnerable configuration:

<configuration>

<system.web>

<httpCookies httpOnlyCookies="false">

Secure configuration:

<configuration>

<system.web>

<httpCookies httpOnlyCookies="true">

Any cookie marked with this property will be accessible only from server-side code, and not to any client-side scripting code like JavaScript or VBScript. This shielding of cookies from the client helps to protect Web-based applications from Cross-Site Scripting attacks. A hacker initiates a Cross-Site Scripting (also called CSS or XSS) attack by attempting to insert his own script code into the Web page to get around any application security in place. Any page that accepts input from a user and echoes that input back is potentially vulnerable. For example, a login page that prompts for a user name and password and then displays "Welcome back, <username>" on a successful login may be susceptible to an XSS attack.

Message boards, forums, and wikis are also often vulnerable to application security issues. In these sites, legitimate users post their thoughts or opinions, which are then visible to all other visitors to the site. But an attacker, rather than posting about the current topic, will instead post a message such as "<script>alert(document.cookie);</script>". The message board now includes the attacker's script code in its page code-and the browser then interprets and executes it for future site visitors. Usually attackers use such script code to try to obtain the user's authentication token (usually stored in a cookie), which they could then use to impersonate the user. When cookies are marked with the "HttpOnly" property, their values are hidden from the client, so this attack will fail.

As mentioned earlier, it is possible to enable "HttpOnly" programmatically on any individual cookie by setting the "HttpOnly" property of the "HttpCookie" object to "true." However, it is easier and more reliable to configure the application to automatically enable "HttpOnly" for all cookies. To do this, set the "httpOnlyCookies" attribute of the <httpCookies> element to "true."

5. Cookieless Session State Enabled

In the initial 1.0 release of ASP.NET, you had no choice about how to transmit the session token between requests when your Web application needed to maintain session state: it was always stored in a cookie. Unfortunately, this meant that users who would not accept cookies could not use your application. So, in ASP.NET 1.1, Microsoft added support for cookieless session tokens via use of the "cookieless" setting.

Vulnerable configuration:

<configuration>

<system.web>

<sessionState cookieless="UseUri">

Secure configuration:

<configuration>

<system.web>

<sessionState cookieless="UseCookies">

Web applications configured to use cookieless session state now stored the session token in the page URLs rather than a cookie. For example, the page URL might change from

http://myserver/MyApplication/default.aspx to http://myserver/MyApplication/(123456789ABCDEFG)/default.aspx. In this case, "123456789ABCDEFG" represents the current user's session token. A different user browsing the site at the same time would receive a completely different session token, resulting in a different URL, such as http://myserver/MyApplication/(ZYXWVU987654321)/default.aspx.

While adding support for cookieless session state did improve the usability of ASP.NET Web applications for users who would not accept cookies, it also had the side effect of making those applications much more vulnerable to session hijacking attacks. Session hijacking is basically a form of identity theft wherein a hacker impersonates a legitimate user by stealing his session token. When the session token is transmitted in a cookie, and the request is made on a secure channel (that is, it uses SSL), the token is secure. However, when the session token is included as part of the URL, it is much easier for a hacker to find and steal it. By using a network monitoring tool (also known as a "sniffer") or by obtaining a recent request log, hijacking the user's session becomes a simple matter of browsing to the URL containing the stolen unique session token. The Web application has no way of knowing that this new request with session token "123456789ABCDEFG" is not coming from the original, legitimate user. It happily loads the corresponding session state and returns the response back to the hacker, who has now effectively impersonated the user.

The most effective way to prevent these session hijacking attacks is to force your Web application to use cookies to store the session token. This is accomplished by setting the "cookieless" attribute of the <sessionState> element to "UseCookies" or "false." But what about the users who do not accept cookies? Do you have to choose between making your application available to all users versus ensuring that it operates securely for all users? A compromise between the two is possible in ASP.NET 2.0. By setting the "cookieless" attribute to "AutoDetect", the application will store the session token in a cookie for users who accept them and in the URL for those who won't. This means that only the users who use cookieless tokens will still be vulnerable to session hijacking. That's often acceptable, given the alternative-that users who deny cookies wouldn't be able to use the application at all. It is ironic that many users disable cookies because of privacy concerns when doing so can actually make them more prone to attack.

More