Wednesday, March 19, 2014

Uber Data Source: Holy Grail or Final Fantasy?

In August 2011, I gave a talk at the GFIRST conference entitled "Uber Data Source: Holy Grail or Final Fantasy?". In this talk, I proposed that given the volume and complexity that a larger number of highly specialized data sources brings to security operations, it makes sense to think about moving towards a smaller number of more generalized data sources. One could also imagine taking this concept further, ultimately resulting in an "uber data source". I would like to discuss this concept in more detail in this blog post. For the purposes of this post, I am working within the context of network traffic data sources. I consider host (e.g., AV logs), system (e.g., Linux syslogs), and/or application level (e.g., web server logs) data sources beyond the context of this blog posting.

Let's begin by first looking at the current state of security operations in most organizations, specifically as it relates to network traffic data log collection. In most organizations, a large number of highly specialized network traffic data sources are collected. This creates a complex ecosystem of logs that clouds the operational workflow. In my experience, the first question asked by an analyst when developing new alerting content or performing incident response is "To which data source or data sources do I go to find the data I need?". I would suggest, based on my experience, that this wastes precious resources and time. Rather, the analyst's first question should be "What questions do I need to ask of the data in order to accomplish what I have set out to do?". This necessitates a "go to" data source -- the "uber data source".

Additionally, it is helpful here to highlight the difference between data value and data volume. Each data source that an organization collects will have a certain value, relevance, and usefulness to security operations. Similarly, each data source will also produce a certain volume of data when collected and warehoused. Data value and data volume do not necessarily correlate. For example, firewall logs often consume 80% of an organization's log storage resources, but actually prove quite difficult to work with when developing alerting content or performing incident response. Conversely, as an illustrative example, DHCP logs provide valuable insight to security operations, but are relatively low volume.

There is also another angle to the data value vs. data volume point. As you can imagine, collecting a large volume of less valuable logs creates two issues, among others:
  • Storage is consumed more quickly, thus reducing the retention period (this can have a detrimental effect on security operations when performing incident response, particularly around intrusions that have been present on the network for quite some time)
  • Queries return more slowly due to the larger volume of data (this can have a detrimental effect on security operations when performing incident response, since answers to important questions come more slowly)
Those who disagree with me will argue: "I can't articulate what it is, but I know that when it comes time to perform incident response, I will need something from those other data sources." To those people, I would ask this question: If you're collecting so much data, irrespective of its value to security operations, that your retention period is cut to less than 30 days and your queries take hours or days to run, are you really able to use that data you've collected for incident response? I would think not.

If we take a step back, we see that the current "give me everything" approach to log collection involves collecting a large number of highly specialized data sources. This is for a variety of reasons, but history and lack of understanding regarding each data source's value to security operations are among them. If we think about what these data sources are conceptually, we see that they are essentially meta-data from layer 4 of the OSI model (the transport layer) enriched with specific data from layer 7 of the OSI model (the application later) suiting the purpose of that particular data source. For example, DNS logs are essentially meta-data from layer 4 of the OSI model enriched with additional contextual information regarding DNS queries and responses found in layer 7 of the OSI model. I would assert that there is a better way to operate without adversely affecting network visibility.

The question I asked back in 2011 was "Why not generalize this?". For example, why collect DNS logs as a specialized data source when the same visibility can be provided as part of a more generalized data source of higher value to security operations? In fact, this has been happening steadily over the last few years. It is now possible to architect network instrumentation to collect fewer data sources of higher value to security operations. This has several benefits:
  • Less redundancy and wastefulness across data sources
  • Less confusion surrounding where to go to get the required data
  • Reduced storage cost or increased retention period at the same storage cost
  • Improved query performance
The "uber data source" is a concept that I believe the security world is coming around to and moving towards. It may be a little uncomfortable to move away from the "give me everything" approach to log collection, but if you think about it, that's really the only way forward in the era of big data. Uber me, baby.

No comments:

Post a Comment