The Story of Videntifier Nexus, Content Identification for Law Enforcement
A recent Police Foundation report warned that the volume of online child sexual abuse offenses is now so great that it has “simply overwhelmed the ability of law enforcement agencies to respond”. Police investigations are notoriously challenging. When digital content is seized to check for Child Abuse Sexual Material (CSAM), the amount of data that needs to be reviewed is often staggering. It is not uncommon for investigators to review content on multiple devices, such as hard drives and USB sticks, where each device can contain thousands, or even millions, of files.
Although digital forensic tools used by investigators may promise to expedite the review process, the fact is that officers spend a great deal of time manually reviewing videos and images. This is not only time-consuming, but also potentially traumatic when dealing with sensitive or disturbing material. This can lead to stress and burnout among investigators, which further hampers the review process.
This case study describes how a European law enforcement agency is tackling this issue using Videntifier’s visual content identification technology, the Videntifier Nexus Platform for Law Enforcement (LE).
About Videntifier Nexus
The Videntifier Nexus Platform is a platform for identification of known illegal visual content using hash databases from various sources. The Nexus LE edition contains several additional components to assist law enforcement in identifying illegal visual content on air-gapped systems. It can dramatically speed up case processing by automatically detecting known CSAM content, both images and videos, as well as identifying duplicates.
The law enforcement agency's identification tools before Videntifier Nexus
The law enforcement agency’s original operating environment and typical case workflow is as follows:
3 workstations running Griffeye Analyze DI. All these workstations are air-gapped.
When digital devices are seized, a forensic image file using the E01 format is created for each device. In some instances, more than one forensic image file needs to be created for a device. Each case being worked on can contain any number of forensic image files.
Each forensic image file is imported into Griffeye Analyze DI. Most of the time, each forensic file is processed individually.
After processing an image file within Griffeye Analyze DI, a report is generated on the content found. Once all the image files for the case have been processed, a final case report is created.
Challenges with the previous system
Typically, a single case requires multiple devices to be investigated, and a recent case involved more than 80TB of content spread across multiple devices. Reviewing this volume of data takes a long time, and processing all the digital content can take weeks or even months.
The agency uses hash-matching, with the Griffeye Analyze DI tool, to help automate and speed up the review process for visual content. However, a great deal of manual work is still required. The following challenges were identified in the investigation workflow::
1.Incomplete hash lists
The hash lists currently used by the agency are incomplete; there are no hashes for a large number of previously known items of CSAM, leading to missed opportunities to automatically detect the type of CSAM which should be easiest to find.
2. Manual review
Hash lists for content already known to be not relevant, such as operating system files, are also limited. So, much of this content needs to be manually reviewed, which is a clear waste of investigators’ time.
3. Flawed hash types
While the hash type used to identify known CSAM images (PhotoDNA) works reasonably well, the agency uses hash types for video identification which are known to have serious flaws; cryptographic hashes such as MD5 and SHA-1 can only be used to identify videos which are exact copies. This means that a previously known CSAM video will escape detection if it has trivial alterations to its content, such as using a different encoding method.
4. Lack of capable identification functions
It is not possible to perform partial video matches, so a video containing scenes from previously identified CSAM will also escape detection.
Additionally, there is no way to identify duplicate videos, which significantly adds to investigators’ workload, since duplicates can make up a significant proportion of the content needing to be reviewed. Duplicate images can be detected using PhotoDNA, but this process is not perfect. Due to the volumes that need to be processed, content review is sometimes performed in multiple batches, where each media device (such as a hard disk or USB stick) is processed individually. This limits the identification of duplicated media even further, as duplicated content is not identified across different devices.
5. No in-house hash lists
There are no in-house hash lists for content that has been identified during the processing of cases, so investigators may have to process the same content over again if it is found in a new case. This is true for both CSAM and non-relevant content.
Needlessly complex database access
As each workstation is air-gapped, accessing external hash databases is extremely complex.
Improving automatic content identification with Videntifier Nexus LE
While the Videntifier Nexus Platform includes the features that law enforcement needs to compare images and videos accurately and at scale, it was clear that to address the specific challenges of working with non-networked computers and identifying case-specific content, some additional components would be needed. To address these needs, Videntifier has created Videntifier Nexus LE.
Videntifier Nexus LE consists of the following components:
Videntifier Nano. A Hash Generation Tool with a graphical user interface that can scan disk volumes and generate Nexus Query Files for all visual content on the volumes.
Local Nexus LE server. A server that processes Nexus Query files. It connects to the in-house hash databases and can be connected to the Nexus Core platform to query external hash databases. When processing Nexus query files, it uses the connected hash databases to check for previously identified content, and it can identify duplicated visual content within the file. The server has a web interface.
Local CSAM database. This database contains information about CSAM that has previously been processed by the law enforcement agency. Content can be added to this database by creating a Nexus Case Export file using a Griffeye plug-in.
Local whitelist database. This database contains information about non-relevant (whitelisted) content that has previously been processed by the law enforcement agency. Content can be added to this database by creating a Nexus Case Export file using a Griffeye plug-in.
Case database. This database contains all the Nexus Query files for a case and is used when querying hash databases for known content and to identify duplicate files.
Griffeye import plug-in. This plug-in is used to import Nexus Result files that have been processed by the Nexus LE server. Nexus Result Files contain information about previously identified content (both CSAM and non-pertinent) as well as information about duplicate files in the case.
Griffeye export plug-in. This plug-in is used to create Nexus Case Export files that are used to add content to the local CSAM and local Whitelist databases.
The following diagram shows the main components of the Videntifier Nexus LE system and the updated workflow when using Videntifier Nexus LE:
To help identify and exclude non-relevant files, Videntifier will build a database containing hashes for known operating system files and make it available to system users.
Although the current solution integrates with Griffeye Analyze DI, integrations with other forensic tools will be made available.
Benefits of using Videntifier Nexus LE
There are clear advantages to enhancing the automation of content identification. With Videntifier Nexus’ significantly upgraded video identification features, coupled with access to extensive hash databases for previously known CSAM and the ability to identify duplicate content at the case level, the automation of known content identification reaches a new milestone:
Content can be accurately identified even if it has been altered, meaning that no opportunities to detect CSAM are missed.
Less time is spent reviewing irrelevant content and duplicates, resulting in faster case processing.
Less exposure to harmful content greatly improves staff welfare and morale.
Preliminary findings can be acquired within hours, a critical success factor during the evidence gathering stage of an investigation.
Using Nexus LE also enables an agency’s local database of previously identified CSAM hashes to be safely shared with other law enforcement bodies. The Nexus system architecture supports various configurations, such as allowing agencies in different districts to query each other's databases, thereby opening up new avenues for collaboration and process improvement.
Comments