Skip to main content

OCR

This section describes OCR (Optical Character Recognition) configuration in Sync-in to extract text from images contained in PDFs and integrate it into full-text search.

Configuration is done in environment.yaml, see the OCR section.


Prerequisites

  • Content indexing must be enabled globally: applications.files.contentIndexing.enabled (true by default).
  • Indexing must be enabled in the relevant space (enabled by default).
  • In offline mode (offline: true), language files must be present locally and readable by the server.

Configuration

Minimal example (online mode, automatic download of languages if needed):

applications:
files:
contentIndexing:
enabled: true
ocr:
enabled: true
languages: [ eng,fra ]
offline: false
info

In online mode, language files are stored in the server/applications/files/assets/ocr-languages directory. The path may vary depending on whether the installation was done via NPM or Docker Compose.

Offline example with a dedicated local directory:

applications:
files:
contentIndexing:
enabled: true
ocr:
enabled: true
languages: [ eng,fra ]
offline: true
languagesPath: /app/ocr-lang

Offline mode

When offline: true, language files must be downloaded and provided locally.

Sync-in relies on Tesseract.js language data.

URL format:

https://cdn.jsdelivr.net/npm/@tesseract.js-data/<lang>@1.0.0/4.0.0_best_int/<lang>.traineddata.gz

Example for French and English:

mkdir -p /app/ocr-lang
curl -L -o /app/ocr-lang/fra.traineddata.gz \
https://cdn.jsdelivr.net/npm/@tesseract.js-data/fra@1.0.0/4.0.0_best_int/fra.traineddata.gz
curl -L -o /app/ocr-lang/eng.traineddata.gz \
https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0_best_int/eng.traineddata.gz

Corresponding configuration:

applications:
files:
contentIndexing:
ocr:
languages: [ fra, eng ]
offline: true
languagesPath: /app/ocr-lang
info

File names must match the codes defined in languages (example: fra.traineddata.gz for fra).


Best practices

  • Limit languages to the languages that are actually needed to reduce indexing time.
  • Enable offline: true in isolated environments or environments without Internet access.