This page last changed on Jul 22, 2008 by smaddox.

Sometimes a user can experience problems indexing large MSExcel or MSPowerPoint documents and the reindexing may cause potential Unknown Ptg warning messages that are harmless. There is already a request to Suppress these warnings from the re-indexing of unreadable documents by the POI library.

The error is usually not serious yet can sometimes cause problems when large attachments are used. So you may like to disable indexing of a particular type of document.

To do this, you can use one of the methods described below.

Method 1: Via the Administration Console

You can disable the relevant modules from the Attachment Extractors plugin, by going to Administration -> Configuration -> Plugins -> Attachment Extractors and disabling the relevant modules listed there:

  • PDF Content Extractor — For PDF attachments
  • MS Word Content Extractor — For DOC attachments
  • MS Excel Content Extractor — For XLS attachments
  • MS PowerPoint Content Extractor — For PPT attachments

    The search query will ignore all attachments of the type corresponding to the disabled module.

Method 2: Via editing the attachment-extractors.xml file

You need to modify the confluence\WEB-INF\classes\plugins\attachment-extractors.xml and comment out the relevant file type extractor. From Confluence 2.6, attachment-extractors.xml is packaged inside confluence-2.6.0.jar; we have instructions for Editing files within .jar archives if you're unfamiliar with the process.

The example below shows a pdfContentExtractor disabled which would cause PDF attachments not to be indexed.

Once the ContentExtractor for a file type is disabled, all files of that type become unsearchable.

<atlassian-plugin name='Attachment Extractors' key='confluence.extractors.attachments'>
    <plugin-info>
        <description>This library extracts searchable text from various attachment types.</description>
        <vendor name="Atlassian Software Systems" url="http://www.atlassian.com"/>
        <version>1.4</version>
    </plugin-info>

    <!--
    <extractor name="PDF Content Extractor" key="pdfContentExtractor" class="com.atlassian.bonnie.search.extractor.PdfContentExtractor" priority="1100">
        <description>Indexes contents of PDF files</description>
    </extractor>
    -->
    <extractor name="MS Word Content Extractor" key="msWordContentExtractor" class="com.atlassian.bonnie.search.extractor.MsWordContentExtractor" priority="1100">
        <description>Indexes contents of Microsoft Word files</description>
    </extractor>

    <extractor name="MS Excel Content Extractor" key="msExcelContentExtractor" class="com.atlassian.bonnie.search.extractor.MsExcelContentExtractor" priority="1100">
        <description>Indexes contents of Microsoft Excel files</description>
    </extractor>

    <extractor name="MS PowerPoint Content Extractor" key="msPowerpointContentExtractor" class="com.atlassian.bonnie.search.extractor.MsPowerpointContentExtractor" priority="1100">
        <description>Indexes contents of Microsoft PowerPoint files</description>
    </extractor>
</atlassian-plugin>
Document generated by Confluence on Aug 07, 2008 19:08