Machine learning (ML) and artificial intelligence (AI) are becoming very useful technologies in cybersecurity. However, before you can model, validate, and visualize security data that will actually be useful, you need to prepare the data properly for input. This can be a difficult and complicated process – something data scientists wrestle with often. More than just traditional data preparation, which includes cleansing the data and de-duping, ML algorithms often require numerical rather than standard text input. The challenge is finding an efficient and accurate way to convert your data to numerical values that can be consumed by the ML model or algorithm.
The Security Intelligence Team within Adobe’s Security Coordination Center uses ML/AI to help more quickly recognize and identify potential threats to Adobe’s infrastructure. In order to keep up with new methods of attack and find potential “needles in the haystack,” we continually run new data models. We also need to make sure that the data prep process is not a hindrance to our efforts.
Tripod is a tool and model for computing latent representations for large sequences. It can be used for several potential applications including:
- Malicious code detection
- Sentiment analysis
- Information/code indexing and retrieval
- Anomaly Detection/ Unsupervised Learning
Tripod automatically computes latent representations of data in code and logs that ML/AI algorithms can use. By implementing three different methodologies—self-attention, global style tokens (GST), and “memory-based” representations—Tripod can more quickly turn traditional text-based data into numerical input that ML/AI algorithms can ingest.
Here are a few examples of how we use Tripod to help in our efforts here at Adobe.
By feeding both “good” JavaScript and “bad” JavaScript code into Tripod, we can get a dataset that from which we can determine a single “classifier” – an attribute that distinguishes the malicious code – which we use to “train” the ML algorithm. Without Tripod, this process can be time-consuming and tedious.
Image may be NSFW.
Clik here to view.
Tripod can also help uncover anomalous code, which is one of the most common ways malicious attacks infiltrate systems. After running raw log data through Tripod, we get a vectorized dataset that is easily ingested by the ML algorithm.
Image may be NSFW.
Clik here to view.
Tripod can also assist with log analysis. From a logging perspective, anomalies are events that occur very rarely in a dataset. Malicious events are similar – if infrastructure is well-secured, they are infrequent when compared to mainstream events. The direct approach is to identify the anomalies in a dataset and then search for malicious activities in this subset of data. For log sources that can generate up to several million events each hour, Tripod can help more quickly identify the subset of events that may be worth investigation.
Data is critical to machine learning and the Tripod project can help you get more useful information quickly from complex datasets. Machine learning through tools like Tripod can help you find the answers you need to respond to potential issues and threats more quickly. You can download Tripod for yourself today from Adobe’s GitHub repository.
Tiberiu Boros
Data Scientist & Machine Learning Engineer
Andrei Cotaie
Sr. Security Engineer