Skip to content

Enrichment Configuration

Several file enricment modules in Nemesis are configurable, and a few must be explicitly enabled due to the performance impact they impart. This document details all current enrichment configuration options.

PII Detection

The PII file enrichment module uses Microsoft Presidio for PII detection across all scanned plaintext. Since this uses an English spaCy machine learning model, it is a performance hit, so PII scanning is disabled by default.

To enable PII scanning, run the following, uncomment the same value in the compose.yaml, or set the value in your .env :

export ENABLE_PII_DETECTION=true

The model takes into account context around a match, and emits a confidence score of 0.0-1.0. The default threshold is set to 0.7. To change this score, run the following, uncomment the same value in the compose.yaml, or set the value in your .env :

export PII_DETECTION_THRESHOLD=0.5

A higher score will return fewer false positives at the risk of increased false negatives.

Currently the PII module detects the following entity types: CREDIT_CARD, US_SSN, UK_NINO. To add or remove PII entity types (defined at https://microsoft.github.io/presidio/supported_entities/), modify the PII_ENTITY_CONFIG at the top of the PII file enrichment module.

Document Conversion

ENV Variables

The Document Conversion service has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in compose.yaml:

ENV Variable Default Value Description
DOCUMENTCONVERSION_MAX_PARALLEL_WORKFLOWS 5 Maxmimum number of parallel conversion workflows allows
MAX_WORKFLOW_EXECUTION_TIME 300 Maximum time (in seconds) workflows can run before being killed
TIKA_USE_OCR false Set to true to enable OCR support via Tessaract
TIKA_OCR_LANGUAGES eng Tika/Tesseract OCR languages supported.

If you want to have additional language packs supported (see https://github.com/tesseract-ocr/tessdata for a full list), run something like this before launching Nemesis or set the value in your .env file:

export TIKA_OCR_LANGUAGES="eng chi_sim chi_tra jpn rus deu spa"

NOTE: due to Docker's ENV variable substitution, setting TIKA_USE_OCR=false will be interpreted as true - either removing TIKA_USE_OCR from an .env file or setting TIKA_USE_OCR="" will disable OCR (the default). Enabling OCR significantly increases CPU as it will OCR standalone images as well as all images embedded in documents.

Nosey Parker

ENV Variables

The Nosey Parker scanner service has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in compose.yaml:

ENV Variable Default Value Description
SNIPPET_LENGTH 512 Bytes of context length around Nosey Parker matches to pull in for findings
MAX_CONCURRENT_FILES 2 Maximum number of concurrent files to scan (raising increases resources needed)
MAX_FILE_SIZE_MB 200 Maximum file size to scan (in megabytes)
DECOMPRESS_ZIPS true Whether to decompress+scan zips
MAX_EXTRACT_SIZE_MB 1000 Maximum number of megabytes to extract from ZIPs (if decompressing)

Custom Rules

Nemesis uses Nosey Parker wrapped through an customized Dapr pub/sub scanner implementation.

There are a number of custom rules that are specified at projects/noseyparker_scanner/custom_rules/rules.yaml.

rules:
  - name: sha256crypt Hash
    id: custom.sha256crypt
    pattern: '(\$5\$(?:rounds=\d+\$)?[\./A-Za-z0-9]{1,16}\$(?:(?:[\./A-Za-z0-9]{43})))'
    references:
      - https://akkadia.org/drepper/SHA-crypt.txt
      - https://hashcat.net/wiki/doku.php?id=example_hashes
    examples:
      - '$5$rounds=5000$GX7BopJZJxPc/KEK$le16UF8I2Anb.rOrn22AUPWvzUETDGefUmAV8AZkGcD'
      - '$5$B7RCoZun804NXFH3$PltCS6kymC/bJTQ21oQOMCLlItYP9uXvEaCV89jl5iB'
      - '$5$JzPB.C/yL0uBMMIK$/2Jr.LeQUg0Sgbm8UhF01d1X643/YHdmRzwlVmt3ut3'
      - '$5$rounds=80000$wnsT7Yr92oJoP28r$cKhJImk5mfuSKV9b3mumNzlbstFUplKtQXXMo4G6Ep5'
      - '$5$rounds=12345$q3hvJE5mn5jKRsW.$BbbYTFiaImz9rTy03GGi.Jf9YY5bmxN0LU3p3uI1iUB'

  - name: sha512crypt Hash
    id: custom.sha512crypt
    pattern: '(\$6\$(?:rounds=\d+\$)?[\./A-Za-z0-9]{1,16}\$(?:(?:[\./A-Za-z0-9]{43})))'
    references:
      - https://akkadia.org/drepper/SHA-crypt.txt
      - https://hashcat.net/wiki/doku.php?id=example_hashes
    examples:
      - '$6$52450745$k5ka2p8bFuSmoVT1tzOyyuaREkkKBcCNqoDKzYiJL9RaE8yMnPgh2XzzF0NDrUhgrcLwg78xs1w5pJiypEdFX/'
      - '$6$Blzt0pLMHZqPNTwR$jR4F0zo6hXipl/0Xs8do1YWRpr47mGcH49l.NCsJ6hH0VQdORfUP1K1HYar1a5XgH1/JFyTGnyrTPmKJBIoLx.'

...

If you want to add additional rules, just modify rules.yaml with the new rule (or add a new rules.yaml) and restart the noseyparker-scanner container.

.NET Service

The .NET scanning service has a single ENV variable to configure.

ENV Variables

ENV Variable Default Value Description
MAX_CONCURRENT_PROCESSING 5 Maximum number of concurrent files to process