Monday, May 4, 2009

Converting pdf files to Word or text format

I frequently have to try and convert pdf files to editable text, and it usually is a hassle. Acrobat has a tool to grab and copy text, which you can then paste into a Word document, but among the resulting problems is that every line of text has a forced return, and paragraphs are made by two forced returns. Here is a tip to fix this, using Word's Find and Replace function. (1) Type Ctrl-H to bring up the Find and Replace dialogue box. In Find, type ^l^l. (^l finds forced, or soft, returns, so you are asking Word to find instances where there are two forced returns.) In Replace, type some nonsense character, such as ##. Choose "Replace all". (2) In Find, type ^l. In Replace, replace the ## with a spacebar space. Choose "Replace all", and this will strip out all soft returns, replacing them with a space. (3) In Find, type ##, and in Replace type ^p (the character to find Paragraph returns). Choose "Replace all". (4) Finally, in Find, replace the previous characters with a double spacebar space. In Replace, type a single space. Choose "Replace all" and all double spaces will be converted to singles. Voila! It's clumsy, I know, but at least it works. In the same way, if you have a document that uses a paragraph return at the end of every line, you can use the same technique.
Knowing there must be a better way to convert pdf documents, I have searched high and low for a freeware program, completely without success. (Yes, there are plenty of commercial products, but I am dedicated to open-source where possible.) The best I have come up with - discovered today - is an internet service, KoolWIRE, which is rather nifty. Using the site, you email them a copy of the pdf file, and they quickly email you back a Word conversion (mine came back within two minutes, which is remarkable). The result was not perfect, but the closest to the originating document that I have seen in tests.

3 comments:

ExpertPDF said...

ExpertPDF has a PDF To Text utility: http://www.html-to-pdf.net/pdf-to-text.aspx

Html To Pdf said...

pdf to text

Hi from John. said...

Yes, there are number that simply extract the text. I was also looking for a tool that would keep the formating, which most of the freeware versions don't do. KoolWire seems to come closest to that.