The TRACE Function
How to retrieve AI explainability via the Aiceberg dashoard.
Every logged input/output analysis can be traced, inspected and explained.
1) Navigate to a monitoring page (Monitoring / Cannon / Playground / Bookmarks)
2) Click on a log entry to reveal the log detail view.
3) Click the Trace icon on top of the detail view

This will reveal the log entry's Trace details:

Trace sections:
Prompt / Response: Switch Trace telemetry between input / output.
Signals: Displays the signals that "fired" across the input (our output). The trailing number indicates the chunk that the issue was found in - in this case Chunk #1.
Chunks: Aiceberg's models are optimized for inputs up to 80 Tokens -When an input is larger than that, we semantically chunk the input into multiple parts, whereas all parts (chunks) are processed / analyzed in parallel. "Semantic" (chunking) means that an input will never be divided mid-sentence or mid-paragraph. Based on this, actual chunk size will be between 65 and 80 tokens (deoendent on sentence and paragraph boundaries). In the below example, a large input was divided into 12 chunks, with chunk 11 containing adversarial language (Jailbreak and Instruction override in this example) while Sentiment was derived from chunk 9.


Samples:
The sample section shows the actual samples across all models used for classification. Using the first prompt example, lets start with "Illegality - Drug related crimes".
We choose "MTVS" as the model in the left panel and inspect the samples / neighbors used for classification.

Samples used to classify this input (chunk) are sorted by importance - the model will weight each sample's importance based on its relevance relative to the input. The first sample actually contains parts of the actual input and therefore has a very small distance (0.41) and high relevance making it the most important sample for classification. The label below the sample represents the sample's class attribution (note: A sample can have multiple class attributions - a sample can be "racist" and "hate speech" at the same time for example).
All subsequent samples have a relatively short distance / high relevance resulting in 100% probability. Note: all models are threshold optimized - MTVS in this example has a threshold of 60% - any probability of >60% will trigger the signal.

Trace output for model "IOR" - Instruction Override

Explainability for True Negatives
In the examples shown so far, explainability is being provided as to why an input was classified as "Illegal" or "Adversarial". The opposite - Explaining why an input is not toxic or not adversarial, etc.." is equally possible.
The below examples shows:
a) "No sub-modules found: None of the samples in Toxicity, Illegality or Jailbreak for example where anywhere close to be considered.
b) All neighbors in both prompt and response are of "Generic" type - in other words, permitted.
PROMPT:

RESPONSE:

Roadmap:
Easy identification of which signal is provided by which model
Display of % values for weights versus only showing Relevance / Distance
Text and code formatting in Prompt / Response windows (maximized) to make longer inputs as well as mixed inputs (Text and Code) easier to read.
Last updated