Enrichment Configuration
Several file enricment modules in Nemesis are configurable, and a few must be explicitly enabled due to the performance impact they impart. This document details all current enrichment configuration options.
PII Detection
The PII file enrichment module uses Microsoft Presidio for PII detection across all scanned plaintext. Since this uses an English spaCy machine learning model, it is a performance hit, so PII scanning is disabled by default.
To enable PII scanning, run the following, uncomment the same value in the compose.yaml, or set the value in your .env :
export ENABLE_PII_DETECTION=true
The model takes into account context around a match, and emits a confidence score of 0.0-1.0. The default threshold is set to 0.7. To change this score, run the following, uncomment the same value in the compose.yaml, or set the value in your .env :
export PII_DETECTION_THRESHOLD=0.5
A higher score will return fewer false positives at the risk of increased false negatives.
Currently the PII module detects the following entity types: CREDIT_CARD, US_SSN, UK_NINO. To add or remove PII entity types (defined at https://microsoft.github.io/presidio/supported_entities/), modify the PII_ENTITY_CONFIG at the top of the PII file enrichment module.
Document Conversion
ENV Variables
The Document Conversion service has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in compose.yaml:
| ENV Variable | Default Value | Description |
|---|---|---|
DOCUMENTCONVERSION_MAX_PARALLEL_WORKFLOWS |
5 | Maxmimum number of parallel conversion workflows allows |
MAX_WORKFLOW_EXECUTION_TIME |
300 | Maximum time (in seconds) workflows can run before being killed |
TIKA_USE_OCR |
false | Set to true to enable OCR support via Tessaract |
TIKA_OCR_LANGUAGES |
eng | Tika/Tesseract OCR languages supported. |
If you want to have additional language packs supported (see https://github.com/tesseract-ocr/tessdata for a full list), run something like this before launching Nemesis or set the value in your .env file:
export TIKA_OCR_LANGUAGES="eng chi_sim chi_tra jpn rus deu spa"
NOTE: due to Docker's ENV variable substitution, setting TIKA_USE_OCR=false will be interpreted as true - either removing TIKA_USE_OCR from an .env file or setting TIKA_USE_OCR="" will disable OCR (the default). Enabling OCR significantly increases CPU as it will OCR standalone images as well as all images embedded in documents.
Nosey Parker
ENV Variables
The Nosey Parker scanner service has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in compose.yaml:
| ENV Variable | Default Value | Description |
|---|---|---|
SNIPPET_LENGTH |
512 | Bytes of context length around Nosey Parker matches to pull in for findings |
MAX_CONCURRENT_FILES |
2 | Maximum number of concurrent files to scan (raising increases resources needed) |
MAX_FILE_SIZE_MB |
200 | Maximum file size to scan (in megabytes) |
DECOMPRESS_ZIPS |
true | Whether to decompress+scan zips |
MAX_EXTRACT_SIZE_MB |
1000 | Maximum number of megabytes to extract from ZIPs (if decompressing) |
Custom Rules
Nemesis uses Nosey Parker wrapped through an customized Dapr pub/sub scanner implementation.
There are a number of custom rules that are specified at projects/noseyparker_scanner/custom_rules/rules.yaml.
rules:
- name: sha256crypt Hash
id: custom.sha256crypt
pattern: '(\$5\$(?:rounds=\d+\$)?[\./A-Za-z0-9]{1,16}\$(?:(?:[\./A-Za-z0-9]{43})))'
references:
- https://akkadia.org/drepper/SHA-crypt.txt
- https://hashcat.net/wiki/doku.php?id=example_hashes
examples:
- '$5$rounds=5000$GX7BopJZJxPc/KEK$le16UF8I2Anb.rOrn22AUPWvzUETDGefUmAV8AZkGcD'
- '$5$B7RCoZun804NXFH3$PltCS6kymC/bJTQ21oQOMCLlItYP9uXvEaCV89jl5iB'
- '$5$JzPB.C/yL0uBMMIK$/2Jr.LeQUg0Sgbm8UhF01d1X643/YHdmRzwlVmt3ut3'
- '$5$rounds=80000$wnsT7Yr92oJoP28r$cKhJImk5mfuSKV9b3mumNzlbstFUplKtQXXMo4G6Ep5'
- '$5$rounds=12345$q3hvJE5mn5jKRsW.$BbbYTFiaImz9rTy03GGi.Jf9YY5bmxN0LU3p3uI1iUB'
- name: sha512crypt Hash
id: custom.sha512crypt
pattern: '(\$6\$(?:rounds=\d+\$)?[\./A-Za-z0-9]{1,16}\$(?:(?:[\./A-Za-z0-9]{43})))'
references:
- https://akkadia.org/drepper/SHA-crypt.txt
- https://hashcat.net/wiki/doku.php?id=example_hashes
examples:
- '$6$52450745$k5ka2p8bFuSmoVT1tzOyyuaREkkKBcCNqoDKzYiJL9RaE8yMnPgh2XzzF0NDrUhgrcLwg78xs1w5pJiypEdFX/'
- '$6$Blzt0pLMHZqPNTwR$jR4F0zo6hXipl/0Xs8do1YWRpr47mGcH49l.NCsJ6hH0VQdORfUP1K1HYar1a5XgH1/JFyTGnyrTPmKJBIoLx.'
...
If you want to add additional rules, just modify rules.yaml with the new rule (or add a new rules.yaml) and restart the noseyparker-scanner container.
.NET Service
The .NET scanning service has a single ENV variable to configure.
ENV Variables
| ENV Variable | Default Value | Description |
|---|---|---|
MAX_CONCURRENT_PROCESSING |
5 | Maximum number of concurrent files to process |