March 30th, 2004

Google & PDFs, partly a note to myself

When converting these to HTML, it doesn't look like it can cope with some ligatures - using a special character to represent particular combinations of two letters: '&' is the most common example (it's 'et', the Latin for 'and') and 'æ' would be another example. Google is ok with these.

But some programs use the ability in some fonts to do this for combinations of letters that otherwise clash visually like fi, fl, ft, tt etc. And google isn't ok with these.

So "This will be the best chance in fifty years to change things for the better" can become, in google's eyes, "This will be the best chance in fi y years to change things for the be er"!

It certainly can't cope with images in PDFs, which is odd. And text at an angle produces some interesting effects...

How often do people use this feature? Is it worth setting up the PDF to be usable in this way, or do people look at / print PDFs directly?
