News Focus
News Focus
icon url

Bob Zumbrunnen

03/10/05 1:13 AM

#51399 RE: NovoMira #51398

The database server was extremely busy for a while. Should be lots better now.

I finally figured out how I can make search eventually run much faster not only here, but on SI without segregating search tables by year.

There are a lot of words that're being indexed that aren't technically considered "noise" words (they're not in the file noise.eng), but that really should be.

I'd been looking all over the internet for some time for a way to peek under the hood of MSSearch, specifically to find out what words are indexed the most frequently, then take "noise" ones, add them to noise.eng, and rebuild the full-text catalogs.

A perfect example is "LOL". Nobody can conceivably want to find a post because it has the word "LOL" in it. Yet it's so often used, even with search running a LOT faster now, a search on LOL will time out because there are so many messages with that in them. Search for "ELN", and you get results in 2-3 seconds. Search for "LOL" and you get a timeout.

Anyway, I just figured out a way I can programmatically get a count of the number of times each word is indexed (well, more accurately, the number of messages that contain each word) then scan through perhaps the top 1000 of them manually, identify "noise" words like "LOL", add them to noise.eng, rebuild the catalogs, and probably have such an efficient system that even SI's 21MM+ messages can all be in one catalog rather than one for each year.

That'll be a kind of "pet" project to do in my spare time long-term, though. The immediate project is to get all variations of Search working much better with fewer timeouts.