News Focus
News Focus
Followers 210
Posts 7903
Boards Moderated 15
Alias Born 05/24/2001

Re: NovoMira post# 51398

Thursday, 03/10/2005 1:13:16 AM

Thursday, March 10, 2005 1:13:16 AM

Post# of 222629
The database server was extremely busy for a while. Should be lots better now.

I finally figured out how I can make search eventually run much faster not only here, but on SI without segregating search tables by year.

There are a lot of words that're being indexed that aren't technically considered "noise" words (they're not in the file noise.eng), but that really should be.

I'd been looking all over the internet for some time for a way to peek under the hood of MSSearch, specifically to find out what words are indexed the most frequently, then take "noise" ones, add them to noise.eng, and rebuild the full-text catalogs.

A perfect example is "LOL". Nobody can conceivably want to find a post because it has the word "LOL" in it. Yet it's so often used, even with search running a LOT faster now, a search on LOL will time out because there are so many messages with that in them. Search for "ELN", and you get results in 2-3 seconds. Search for "LOL" and you get a timeout.

Anyway, I just figured out a way I can programmatically get a count of the number of times each word is indexed (well, more accurately, the number of messages that contain each word) then scan through perhaps the top 1000 of them manually, identify "noise" words like "LOL", add them to noise.eng, rebuild the catalogs, and probably have such an efficient system that even SI's 21MM+ messages can all be in one catalog rather than one for each year.

That'll be a kind of "pet" project to do in my spare time long-term, though. The immediate project is to get all variations of Search working much better with fewer timeouts.

Discover What Traders Are Watching

Explore small cap ideas before they hit the headlines.

Join Today