The method I was suggesting could not generate a false positive for text that is not regarded as simply an image. The reason is that, the very objective of your software - to determine what text is being rendered, is actually acoomplish before the text even hits the screen. If there is any program anywhere on the computer that tries to display "MOSFET" using any DRAW-TEXT primitive in the system, my method would catch it. So in fact, I would get a 100% hit rate on text that is normally rendered by the system.
For text where the programmer first converted it to an image and told the GUI subsystem to render it, my method would fail with OCR. But then, the problem reverts to OCR anyway.
Now consider: we do not have an exhaustive list of fonts to be used, so your method would have to have that to approach a hit rate of 90% without help from the user. Of course, if the user tells you what the font face is, etc, and all of these things, then yes, your software would approach 100%.
However, as mentioned, my gut feel is that "in-line-interception-of-text" versus "snapshot-of-graphics" is superior. One has to imagine the headache vs. % effectiveness of using each model.
Which would you rather have? 100% hit rate on 95% (perhaps) of the situation by simply declaring what text needs to be sought or 98%+ hit rate on 98% of the situations with painstaking determination of color, font face, pitch, and foreground back ground color each time, not to mention the possibility that you will miss an "easy" true positive because you're taking snapshots?
-Le Chaud Lapin-