Stop Words

Akshay Nanavati on Unsplash

Imagine you have written search engine software. Now you want to speed up how fast your software searches the database. How do you do that? What are some trade-offs?

Let's say someone types this phrase in your search engine:

The rain falls mainly in the plain in Spain in the winter

Notice there are three instances of the word the in this phrase. What if you replaced the with an asterisk, like this:

* rain falls mainly in * plain in Spain in * winter

If you have 10 million records in your database used to provide search results, replacing the three characters of the word the with one character, an asterisk, could save you a lot of space in your database, plus make searches faster by having less data to parse.

Now notice this search phrase also uses the word in repeatedly. Let's replace that with an asterisk as well:

* rain falls mainly * * plain * Spain * * winter

Clearly this chunk of data has more unique words to parse: rain, falls, mainly, plain, Spain, winter. In theory, removing the and in will yield more accurate search results.

To search your database, you might run several queries, one for the word rain, another search for falls, another for the word plain, another for Spain, and another for winter. Each of these searches would be faster for not having to parse the words the and in.

Words like the, in, at, that, which, and on are called stop words. Coined by Hans Peter Luhn, an early pioneer of information retrieval techniques, stop words are words so common they can be excluded from searches because they increase the work required by software to parse them while providing minimal benefit. People rarely search only for the the word the, for example.

However, if you want to search for information about the band The Who, and any phrase that might include a stop word, your search engine may or may not return accurate results. Stop words can accidentally prevent correct results. Removing the word which from your search database might not cause problems. Removing the word the probably will.

One clever solution might be to mark the occurence and position of stop words while also removing them from a database. In our example above, you might replace the instances of the word the with the number 1 and instances of the word in with the number 2, like this:

1 rain falls mainly 2 1 plain 2 Spain 2 1 winter

This provides the benefits of not using stop words with the speed gained from removing stop words from the database. In a later step in your search results processing, you could include the words the and in by translating instances of the with 1 and in with 2. Instead of a dumb asterisk, you use a single character space in a more subtle and meaningful way.

Another solution for handling stop words has to do with how search terms are entered. Using double quotes around a phrase tells the search engine to treat the phrase as a single block. Your search engine code could look for double quotes and treat them as a single block. So this search phrase would return accurate results even as it uses a stop word:

"The Who” song lyrics

If you substituted instances of the word the in your database with the number 1, your search might look for "1 Who” with a search for song and another search for lyrics.

As with all the examples and possibilities in this article, what actually is coded and how a search engine is designed and built is extremely complex and hard to predict. These details are generalizations to explain the concept of stop words and how they impact search engines.

What search engines leave in and out of their databases depends on the informed opinions and experience of the programmers who design and create the engine. As with many parts of computing, there is no 100% best way to solve the problem of providing accurate search results quickly. Stop words is simply one approach among many. Think about that the next time you type the or at into a search engine.

Learn More

Wikipedia: Stop Words

http://en.wikipedia.org/wiki/Stop_words