I need to extract the font type of each word. I've been trying to extract the content of the pdf and categorizing them using the font type used on them. Can someone please help me with this. Thanks in advance

I tried using pdftool but the pdf_font function gives only the font types. But I want it to map with the word.

  >>name                  type         embedded file                           
    <chr>                 <chr>        <lgl>    <chr>                          
  1 ABCDEE+Cambria        truetype     TRUE     ""                             
  2 ABCDEE+Calibri        cid_truetype TRUE     ""                             
  3 ABCDEE+Calibri        truetype     TRUE     ""                             
  4 ABCDEE+Cambria        cid_truetype TRUE     ""                             
  5 SymbolMT              cid_truetype TRUE     ""                             
  6 ArialMT               truetype     FALSE    "C:\\WINDOWS\\Fonts\\arial.ttf"
  7 ABCDEE+CourierNewPSMT truetype     TRUE     ""                             
  8 ABCDEE+Calibri-Bold   cid_truetype TRUE     ""                             
  9 ABCDEE+Calibri-Bold   truetype     TRUE     ""                     

what I would like to see is

   word           Font
   The            ABCDEE+Cambria
   ground         ABCDEE+Cambria
   is             ABCDEE+Cambria
   shaking        ABCDEE+Calibri-Bold

That's not possible in general: a word in a PDF file could contain more than one font type. However, one approach to doing it could be to convert the PDF to some easier format like HTML, and then parse that, with some rule for handling font changes in the middle of a word.

I don't know of any easily available free utilities that can do the conversion. I believe the professional version of Adobe Acrobat can do it (but I don't have a copy). Online the web site https://www.zamzar.com/ can do conversions, and successfully converted a tiny PDF example to HTML for me.

  • Thanks for the tip. What if a word has only one font? do you think it still is impossible? – ap123 Jun 7 at 15:15
  • No, of course not. – user2554330 Jun 7 at 15:50

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.