OCR
This section describes OCR (Optical Character Recognition) configuration in Sync-in to extract text from images contained in PDFs and integrate it into full-text search.
Configuration is done in environment.yaml, see the OCR section.
Prerequisites
- Content indexing must be enabled globally:
applications.files.contentIndexing.enabled(trueby default). - Indexing must be enabled in the relevant space (enabled by default).
- In offline mode (
offline: true), language files must be present locally and readable by the server.
Configuration
Minimal example (online mode, automatic download of languages if needed):
applications:
files:
contentIndexing:
enabled: true
ocr:
enabled: true
languages: [ eng,fra ]
offline: false
In online mode, language files are stored in the server/applications/files/assets/ocr-languages directory. The path may
vary depending on whether the installation was done via NPM or Docker Compose.
Offline example with a dedicated local directory:
applications:
files:
contentIndexing:
enabled: true
ocr:
enabled: true
languages: [ eng,fra ]
offline: true
languagesPath: /app/ocr-lang
Offline mode
When offline: true, language files must be downloaded and provided locally.
Sync-in relies on Tesseract.js language data.
URL format:
https://cdn.jsdelivr.net/npm/@tesseract.js-data/<lang>@1.0.0/4.0.0_best_int/<lang>.traineddata.gz
Example for French and English:
mkdir -p /app/ocr-lang
curl -L -o /app/ocr-lang/fra.traineddata.gz \
https://cdn.jsdelivr.net/npm/@tesseract.js-data/fra@1.0.0/4.0.0_best_int/fra.traineddata.gz
curl -L -o /app/ocr-lang/eng.traineddata.gz \
https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0_best_int/eng.traineddata.gz
Corresponding configuration:
applications:
files:
contentIndexing:
ocr:
languages: [ fra, eng ]
offline: true
languagesPath: /app/ocr-lang
File names must match the codes defined in languages (example: fra.traineddata.gz for fra).
Best practices
- Limit
languagesto the languages that are actually needed to reduce indexing time. - Enable
offline: truein isolated environments or environments without Internet access.