An Analytical Approach: March 2014

Monday, March 31, 2014

Throw me a Bone

When I attend conferences, I'm always amazed at how many of the talks hype up the challenges and problems facing the information security community. Some of these talks remind me more of a Las Vegas show than a serious security talk. In my experience, most security professionals are already well aware of many of the challenges, as they face them head-on daily. Granted, there is a place for raising awareness, both within and outside the security community. For example, business executives may not be aware of the risks, dangers, and challenges facing their organizations from a security perspective. In my opinion, there are better forums than a security conference to educate those audiences. Similarly, within the security community, there are always new topics about which we need to be educated. Unfortunately, I'm not seeing a lot of that these days, but rather, a lot of the same stuff given over again. In my opinion, part of the reason this occurs is that people don't have a lot of great answers, and so it's just easier to discuss the hype. Unfortunately, that won't solve any operational security problems for us.

Year after year, the consensus from the operational side seems to be "throw me a bone." In other words, enough of the hype -- give me some practical, sensible, tangible advice and insights that I can evaluate and consider implementing. This blog, and other blogs, forums, and publications like it, strive to provide that practical, sensible, and tangible advice that operational users want to hear. But sometimes, it is difficult for the "hands-on" information to be heard above the marketing, noise, and hype that pervades our profession.

FUD, marketing, and entertainment, unfortunately, will probably always get the press and lauds. Fortunately, news readers, strong peer networks, and trusted information sharing communities provide us good tools that we can use to share and consume the information we really need. My hope is that talks will become less hype and more hands-on in the coming years, but either way, we'll likely have to keep throwing each other those bones.

Friday, March 28, 2014

DNS Blind Spot

A recent DarkReading article discussed the subject of the DNS "blind spot". That is a topic that has always interested me, and I would like to discuss further in this post. The DarkReading article, for reference, can be found here: http://www.darkreading.com/analytics/attacks-rise-on-network-blind-spot/d/d-id/1141552?

Essentially, as the article discusses, DNS is often an under-monitored or a completely unmonitored application protocol within an organization. As such, it's not surprising that attackers leverage it for command and control, data exfiltration, and other purposes. Attackers are always looking for ways to persist/not to get caught, and unmonitored application protocols provide them a great way to do that. I have worked with malicious code that uses DNS to move binary files in and out of a network. The malware accomplished this through a series of Base64 encoded strings that were sent via DNS TXT records. Pretty scary stuff.

There are a number of angles one could take on this subject, but I would like to share a few reasons, in my experience, why organizations struggle with monitoring DNS. These reasons include:

Logging challenges: Some DNS implementations do not support logging very well. For example, some implementations log DNS requests, but not their corresponding responses. Others may log both requests and responses, but may not "match up" requests with their corresponding responses. Yet others may not support logging at all. All of these situations present challenges to an organization, as it leaves the organization with an incomplete or non-existent data set that is extremely difficult to monitor from a security operations perspective.
Decentralized implementation: Many organizations have a diverse, scattered DNS implementation. In these organizations, end users are not forced through a centralized DNS infrastructure. Before the organization can even entertain a discussion on monitoring DNS, that organization needs to identify and collect logs from all of the various DNS servers. This can quickly become an overwhelming challenge that usually results in DNS remaining unmonitored.
Retention issues: DNS is an integral part of network communication, and thus, DNS logs can be quite voluminous. Often, this results in an organization making the decision not to collect DNS logs, even though they provide high value to security operations when implemented properly.
Lack of awareness: Some people are simply not aware of the risk that unmonitored DNS presents and the value to security operations that monitoring DNS presents. Without this awareness, organizations are missing the initial "spark" necessary to infuse their security operations program with DNS monitoring.

These challenges may seem overwhelming, but there are some ways an organization can work around them. One way forward is a way that I have discussed previously on this blog, namely, the philosophy of collecting fewer, more generalized data sources of higher value to security operations. For example, a network forensics solution can bring us all the DNS logging we need, along with logging of a number of other application protocols we are likely interested in. This presents a centralized, easier to manage approach to monitoring DNS logs and eliminating that DNS blind spot. Whatever the solution, eliminating the DNS blind spot is critical.

Thursday, March 27, 2014

Don't Let Your Security Program Get Hit by a Bus

Some people are tinkerers by nature. Whether it be cars, electronics, or security technology, some people just love to "roll their own". While rolling your own can serve a purpose as a way to meet certain niche operational needs, it does present certain challenges from a business perspective. It is certainly possible that operational needs may exist that necessitate home grown solutions, but it is important to consider a number of factors before going that route:

Continuity: Continuity of operations is an important part of security operations. Stringing together a variety of home grown solutions may work well in the moment, but what if the one guy who knows how they work and how to use them gets hit by a bus or resigns? What if operational requirements change to beyond what the home grown solution was designed to do? Further, what if there is a prolonged outage that causes a network visibility "blind spot"? It's all fun and games until you can't see what's actually going across your network when you need to.
Documentation: Documentation is not particularly fun to put together, but it is absolutely critical for effective security operations. Documentation needs to cover all aspects of security operations and incident response, including technologies that support the mission. If a home grown solution is necessary, it also needs to be documented in detail. Unfortunately, this is seldom the case in my experience.
TCO: Total cost of ownership is another important factor to consider. Labor is a sly, hidden cost that stealthily eats away at the efficiency of an organization. I once sat through a presentation on building a $100 flow sensor. What the presenter strategically omitted from the talk was that it took him six months to build that sensor. With benefits and overhead, that's more like a $150,000 flow sensor. And that's before we include any O&M costs. Not so cheap after all. If scarce analyst resources are spending time tinkering instead of performing incident response, that comes at a large cost to the organization.
O&M: I've previously blogged about how people often forget to include Operations and Maintenance (O&M) in their costing models. Every technology solution needs to be operated and maintained. It's merely a question of by whom and at what resource level. Solutions that include O&M by others or minimal O&M internally would seem to be the clear winners here.
Scale: If I can "MacGyver" a solution to a problem in a lab or as a proof-of-concept, great. But what about scaling that out to an enterprise-wide deployment? Probably a bit more difficult and resource intensive of an undertaking.
Loss of Focus: In security operations and incident response, skilled, qualified resources are incredibly scarce. Shouldn't those resources be focused on security operations and incident response rather than endeavors that distract from that?

Technology solutions cannot meet every operational requirement, but they can meet many. Before rolling your own solution, it is important to consider all of the variables factoring into the decision, including those that may be less tangible or somewhat hidden.

Wednesday, March 26, 2014

Crime Does Pay

When I was a child, I learned the slogan "crime doesn't pay" in school. This statement was part of a campaign to dissuade children from entering a life of crime. As I've gotten older though, I've realized that this statement is, in fact, wrong. Perhaps a more accurate statement would be "crime does pay, but you have to be prepared for the consequences". In essence, there is a risk/reward ratio at play here. Putting aside morals and ethics for a moment, if an individual is intent on committing a crime and calculates that the reward outweighs the risk, the individual will decide that committing the crime is a good business decision.

In the physical world, the risk/reward ratio is relatively straightforward to understand. For example, if I rob a bank, there is a very good chance I will get caught. If I do get caught, not only will I not get to keep the money I stole, but I will also go to jail for a long time. In that scenario, the risk is high, and while the reward potential is also high, it could very well be zero.

Unfortunately, in the on-line world, the risk/reward ratio breaks down completely, or more accurately, tips very much in favor of the criminal. It is very difficult to catch those who commit on-line crimes, for a variety of reasons. At the same time, it is extremely easy to commit on-line crimes, and the potential for reward is enormous. When people ask why criminal miscreants are so intent on intruding into business networks, they must only look at the calculation from the attacker's perspective to fully understand: High reward and low risk. It's the perfect storm of mathematics that fuels much of the intrusion activity we see today.

Because of this, we now find ourselves accepting the realization that breaches are going to happen routinely and regularly. As a community, we are moving towards devoting more resources to the practice of Continuous Security Monitoring (CSM) because of this realization. The game is less about "how can I stop the next attack" and more about "how can I detect, analyze, and contain the next attack rapidly". Of course, we should still ensure that our organizations are as secure and protected as possible. No matter how thorough we are though, the attackers will still find a way in. It's a question of when, not if.

One of the key priorities inside a security organization should be ensuring that the organization practices CSM and is prepared to perform incident response. Many organizations are making good progress with this, but some still lag behind. If your organization is not yet practicing CSM, now is a great time to start. It's only a matter of time until your organization suffers a breach. That is, of course, if there isn't already a breach inside your organization that we aren't aware of....

Tuesday, March 25, 2014

Jumping Off Points

The incident handling/incident response life cycle consists of the detection, analysis, containment, remediation, recovery, and lessons learned stages. These stages have been discussed at length elsewhere, including in previous posts on this blog. I've also discussed the subject of workflow as it supports this life cycle, as well as what goes into producing quality alerting and detection. One question people often ask me is one that complements these topics nicely -- "What is the best way to enter into the incident response process in an efficient and focused manner?". This is an excellent question. I have worked with many good analysts during the course of my career, but very few of them have been able to work efficiently without being led into the incident response process in some manner.

I believe that this answer to this question lies in the creation of "jumping off points". I have been guiding organizations in this direction for a little over a decade, and good results from a diverse array of organizations indicate to me that this is a winning approach.

The jumping off points approach assumes that due to the velocity, volume, and variety of data found on a large network, knowing the ground truth regarding all of the traffic on the network is essentially impossible. Instead, the approach seeks to identify incisive, targeted questions to ask of the data. These questions are designed to surgically extract behavior and activity of concern based on business needs, operational needs, organizational risk, management priorities, threat assessment, and other factors. The answers produced by these questions form the basis of alerting. There is no specific number of questions that an organization should aim to have. Rather, the organization should strive to produce reliable, high fidelity, actionable alerting at a reasonable enough volume that each alert can be reviewed by an analyst. Further, the work here is never "done". All of the factors mentioned above likely change continually. Existing questions should be revised, and new questions should be created as appropriate given the changing landscape.

The result of the jumping off points approach is a steady stream of reliable, high fidelity, actionable alerts. This stream has the intended consequence of providing analysts an efficient and focused manner in which to enter into the incident handling/incident response life cycle.

Jumping off points have proven to be a good approach for the largest enterprises and government agencies. If you aren't already using this approach, are you ready to jump in?

Monday, March 24, 2014

Thoughts on Intelligence Retention

My previous post entitled "Measuring Security Intelligence Value" was quite popular, and I'm glad that I was able to put together a blog post that interested so many people. Recently, I was asked for my thoughts on intelligence retention. This catalyzed me to put together this post.

It is well known that intelligence is an important component of a successful security operations program. Ideally we would like to retain our intelligence forever, but what if that is not possible? In some organizations, it may be necessary to discard or "age-off" intelligence after some amount of time. In this post, I am writing from the perspective of an organization functioning as a consumer of intelligence. I am also assuming that the reader understands the difference between intelligence and information. The intelligence vs. information discussion is beyond the scope of this post, and I will therefore assume that the reader is acutely aware of the difference.

Before I discuss different approaches for aging-off intelligence, I would like to briefly discuss the concept of vetting. As I discussed in the "Measuring Security Intelligence Value" post and in earlier posts, in my experience, quality of intelligence is more important than quantity of intelligence. In other words, properly vetting intelligence before it is added to the organization's security repository has a number of benefits. Aside from the improved signal-to-noise ratio (ratio of true positives to false positives) resulting from improved intelligence, it also helps with the retention issue. When there is less "garbage" consuming retention resources, it allows us to retain our intelligence longer using the same amount of retention resources.

If it becomes necessary to discard intelligence, there are a number of different approaches one could employ. While not an exhaustive list, I have listed a few approaches here:

Simple time-based: In the simple time-based approach, intelligence is discarded after N days of retention (e.g., 180 days). This is probably the simplest approach to implement, but does not account for any of the other dimensions of each piece or source of intelligence.
Fidelity-based: In the fidelity-based approach, when it becomes necessary to discard intelligence, it is discarded from lowest fidelity to highest fidelity after the minimal retention period. For example, say we measure fidelity on a 1 to 10 scale, with a value of 1 indicating that the indicator is not very reliable/not of high fidelity, and a value of 10 indicating that the indicator is extremely reliable/of extremely high fidelity. In this approach, after the minimal retention period, we begin by discarding intelligence of fidelity 1, then 2, and so on until it is no longer necessary to discard intelligence. Using this approach enables us to retain our highest fidelity intelligence for longer than our lowest fidelity intelligence.
Source-based: In the source-based approach, when it becomes necessary to discard intelligence, it is discarded from the least reliable source to the the most reliable source after the minimal retention period. For example, say we measure source reliability on a 1 to 10 scale, with a value of 1 indicating that the source is not very reliable, and a value of 10 indicating that the source is extremely reliable. In this approach, after the minimal retention period, intelligence is discarded from sources of reliability 1, then 2, and so on until it is no longer necessary to discard intelligence. Using this approach enables us to retain intelligence from our best sources for longer than intelligence from our not so great sources.
Attack stage-based: In the attack stage-based approach, we look at discarding intelligence based on the particular attack stage it is relevant to. For example, we may value intelligence related to command and control (C2) sites more than we value intelligence related to exploit sites. As such, we can build a prioritized list of attack stages per the needs of our organization. After the minimal retention period, we can discard intelligence from the attack stage of least priority, followed by the attack stage of second least priority, and so on, until it is no longer necessary to discard intelligence. Proceeding in this manner allows us to retain intelligence related to the attack stages we are most concerned with for longer than intelligence related to the attack stages we are least concerned with.
Type-based: In the type-based approach, we look at discarding intelligence based on the type of intelligence it is. For example, we may value URL patterns more than we value IP addresses. As such, we can make a prioritized list of intelligence types. After the minimal retention period, we can discard intelligence from the type of least priority, followed by the type of second least priority, and so on, until it is no longer necessary to discard intelligence. This approach enables us to retain certain types of intelligence longer than others per our organizational needs.
Flag-based: In the flag-based approach, we "flag" specific intelligence that is of high value to us. After the minimal retention period, we discard intelligence that is not flagged. This allows us to retain specific pieces of intelligence that are of high value to us beyond the minimal retention period.

As I'm sure you've realized by now, it is also possible to use combinations of the above approaches as suits the organization's needs and goals. It is possible to combine various approaches to create a retention approach that leaves an organization with more of the intelligence it values most and less of the intelligence it values least.

Intelligence is not a linear undertaking. So why should intelligence retention be approached only linearly?

Friday, March 21, 2014

Answers at the Speed of Business

Imagine yourself as the lead incident responder during a breach response. If you've been in this position, as I have, you know that it can feel a bit like being in the hot seat. During the breach response, key stakeholders will have important, time-sensitive questions they want answered. Those questions will be aimed directly at you, and you will be expected to provide answers quickly -- answers at the speed of business. The stakeholders don't just need answers -- they need them now -- or better yet, make that yesterday. These stakeholders may include executives, legal, privacy, public relations, clients, partners, and others. The questions they will ask are designed to quickly assess damage and risk to the organization, as well as what follow-on actions need to be taken from a legal, privacy, and/or public relations standpoint.

There are many questions these stakeholders might pose, but a few of the more common ones are:

How did this happen?
When did this begin?
Is this activity still occurring?
How many systems/brands/products have been affected?
What sensitive, proprietary, and/or confidential/private data has been taken?
What can be done to stop this activity/prevent it from happening again?

Performing network forensics allows us to query, interrogate, and study the data to obtain accurate answers to important stakeholder questions. As you can imagine, every moment is critical during this process. Given this, it always frustrated me that I seemed to spend a majority of my time waiting for queries to return or "munging" data (due to tool limitations), rather than actually doing analysis. I could never understand why a) vendors sold technology that didn't meet the needs of incident responders, b) the organizations I was supporting bought that technology, and c) I was expected to use something that was not properly designed for the purposes I was being forced to use it for. It always seemed like the technologies I was using were fighting me, rather than enabling and empowering me.

I've been in the hot seat enough times to know that enough is enough. The time has come for network forensics technology that meets the needs of incident responders. Anything less simply fails them. With the stakes as high as they are today, failure is not an option.

Thursday, March 20, 2014

Ask a Stupid Question....

As the saying goes, ask a stupid question, get a stupid answer. Security professionals know that in order to properly run security operations and perform incident response, we need to be able to ask intelligent questions of our data. We need to be able to issue precise, targeted, incisive queries to hone in on the most relevant data, while minimizing or eliminating time spent with data that is irrelevant. With the velocity, volume, and variety of data confronting us, this concept is more central than ever to effective security operations and incident response. Given this, I am often surprised at how few technologies truly empower the analyst to ask those intelligent questions. If your technologies only allow you to ask stupid questions, what kind of answers do you think you'll get?

Wednesday, March 19, 2014

Uber Data Source: Holy Grail or Final Fantasy?

In August 2011, I gave a talk at the GFIRST conference entitled "Uber Data Source: Holy Grail or Final Fantasy?". In this talk, I proposed that given the volume and complexity that a larger number of highly specialized data sources brings to security operations, it makes sense to think about moving towards a smaller number of more generalized data sources. One could also imagine taking this concept further, ultimately resulting in an "uber data source". I would like to discuss this concept in more detail in this blog post. For the purposes of this post, I am working within the context of network traffic data sources. I consider host (e.g., AV logs), system (e.g., Linux syslogs), and/or application level (e.g., web server logs) data sources beyond the context of this blog posting.

Let's begin by first looking at the current state of security operations in most organizations, specifically as it relates to network traffic data log collection. In most organizations, a large number of highly specialized network traffic data sources are collected. This creates a complex ecosystem of logs that clouds the operational workflow. In my experience, the first question asked by an analyst when developing new alerting content or performing incident response is "To which data source or data sources do I go to find the data I need?". I would suggest, based on my experience, that this wastes precious resources and time. Rather, the analyst's first question should be "What questions do I need to ask of the data in order to accomplish what I have set out to do?". This necessitates a "go to" data source -- the "uber data source".

Additionally, it is helpful here to highlight the difference between data value and data volume. Each data source that an organization collects will have a certain value, relevance, and usefulness to security operations. Similarly, each data source will also produce a certain volume of data when collected and warehoused. Data value and data volume do not necessarily correlate. For example, firewall logs often consume 80% of an organization's log storage resources, but actually prove quite difficult to work with when developing alerting content or performing incident response. Conversely, as an illustrative example, DHCP logs provide valuable insight to security operations, but are relatively low volume.

There is also another angle to the data value vs. data volume point. As you can imagine, collecting a large volume of less valuable logs creates two issues, among others:

Storage is consumed more quickly, thus reducing the retention period (this can have a detrimental effect on security operations when performing incident response, particularly around intrusions that have been present on the network for quite some time)
Queries return more slowly due to the larger volume of data (this can have a detrimental effect on security operations when performing incident response, since answers to important questions come more slowly)

Those who disagree with me will argue: "I can't articulate what it is, but I know that when it comes time to perform incident response, I will need something from those other data sources." To those people, I would ask this question: If you're collecting so much data, irrespective of its value to security operations, that your retention period is cut to less than 30 days and your queries take hours or days to run, are you really able to use that data you've collected for incident response? I would think not.

If we take a step back, we see that the current "give me everything" approach to log collection involves collecting a large number of highly specialized data sources. This is for a variety of reasons, but history and lack of understanding regarding each data source's value to security operations are among them. If we think about what these data sources are conceptually, we see that they are essentially meta-data from layer 4 of the OSI model (the transport layer) enriched with specific data from layer 7 of the OSI model (the application later) suiting the purpose of that particular data source. For example, DNS logs are essentially meta-data from layer 4 of the OSI model enriched with additional contextual information regarding DNS queries and responses found in layer 7 of the OSI model. I would assert that there is a better way to operate without adversely affecting network visibility.

The question I asked back in 2011 was "Why not generalize this?". For example, why collect DNS logs as a specialized data source when the same visibility can be provided as part of a more generalized data source of higher value to security operations? In fact, this has been happening steadily over the last few years. It is now possible to architect network instrumentation to collect fewer data sources of higher value to security operations. This has several benefits:

Less redundancy and wastefulness across data sources
Less confusion surrounding where to go to get the required data
Reduced storage cost or increased retention period at the same storage cost
Improved query performance

The "uber data source" is a concept that I believe the security world is coming around to and moving towards. It may be a little uncomfortable to move away from the "give me everything" approach to log collection, but if you think about it, that's really the only way forward in the era of big data. Uber me, baby.

Tuesday, March 18, 2014

The Question

When I speak at conferences or in private meetings, I inevitably get "the question" immediately after presenting:

"How do you understand our pain so well?"

The answer is simple -- I lived that pain for over a decade on the operational side before moving over to the vendor side. I've seen what enables, empowers, and facilitates security operations and incident response and what doesn't. I've seen how vendors struggle with fitting their technology into the operational workflow, rather than forcing the operational workflow to fit their technology. I've also seen where vendors typically fall short of the needs of the analysts and incident responders.

All of that pain and experience influence my professional world view, which in turn, results in a better, more operationally useful product. The best vendors I worked with while on the operational side were those that came from an operational background. Those were the vendors that best understood operational issues, gaps, and needs and sought to address them.

If you are working with vendors that don't approach your challenges from the perspective of an operational background, how can you be certain that they will truly understand your pain and deliver solutions that meet your operational needs? I'd suggest that this is something important to think about as you evaluate different technologies. I'm sure you'd prefer that your vendors were educated previously on somebody else's dime, rather than your own.

Monday, March 17, 2014

Signal-to-Noise Ratio

Recent media reports discussing the Target and Nieman Marcus breaches have indicated that, in both cases, numerous alerts fired as a result of the intrusion activity. In both cases, the alerts were not properly handled, causing the breaches to remain undetected. I'm sure there are many angles in which these reports can be dissected. Rather than play the blame game, I would like to discuss a subject that remains a challenge for our profession as a whole: the signal-to-noise ratio.

Wikipedia defines the signal-to-noise ratio as "a measure used in science and engineering that compares the level of a desired signal to the level of background noise." In other words, the more you have of what you want, and the less you have of what you don't want, the easier it is to measure something. Let's illustrate this concept by imagining a conversation between two people in a noisy cafe. If I record that conversation from the next table, upon playback, it will be very difficult for me to truly understand what was discussed. Conversely, if I record that conversation in a quiet room, it will be much easier to understand what was discussed upon playback. The signal-to-noise ratio in the second scenario is much higher than in the first scenario.

The same concept applies to security operations and incident response. In security operations, true positives are the signal, and false positives are the noise. Consider the case of two different Security Operations Centers (SOCs), SOC A and SOC B. In SOC A, the daily work queue contains approximately 100 reliable, high fidelity, actionable alerts. Each alert is reviewed by an analyst. If incident response is necessary for a given alert, it is performed. In SOC B, the daily work queue contains approximately 100,000 alerts, almost all of which are false positives. Analysts attempt to review the alerts of the highest priority. Because of the large volume of even the highest priority alerts, analysts are not able to successfully review all of the highest priority alerts. Additionally, because of the large number of false positives, SOC B's analysts become desensitized to alerts and do not take them particularly seriously.

One day, 10 additional alerts relating to payment card stealing malware fire within a few minutes of each other.

In SOC A, where every alert is reviewed by an analyst, where the signal-to-noise ratio is high, and where 10 additional alerts seems like a lot, analysts successfully identify the breach less than 24 hours after it occurs. SOC A's team is able to perform analysis, containment, and remediation within the first 24 hours of the breach. The team is able to stop the bleeding before any payment card data is exfiltrated. Although there has been some damage, it can be controlled. The organization can assess the damage, respond appropriately, and return to normal business operations.

In SOC B, where an extremely small percentage of the alerts are reviewed by an analyst, where the signal-to-noise ratio is low, and where 10 additional alerts doesn't even raise an eyebrow, the breach remains undetected. Months later, SOC B will learn of the breach from a third party. The damage will be extensive, and it will take the organization months or years to fully recover.

Unfortunately, in my experience, there are a lot more SOC B's out there than there are SOC A's. It is relatively straightforward to turn a SOC B into a SOC A, but it does require experienced professionals, organizational will, and focus. How do I know? I've turned SOC B's into SOC A's several times during my career.

We are fortunate to have some great technology choices these days that we can leverage to improve our security operations and incident response functions. These technology choices can enable us to learn of and respond to breaches soon after they occur. Before purchasing any technology intended to produce alerts destined for the work queue, we should ensure that it supports the ability to issue very precise, targeted, incisive questions of the data. This enables us to hone in on the activity we want to identify (the true positives/the signal), while minimizing the activity we do not want to identify (the false positives/the noise). As always, these technologies are tools that need to be properly leveraged as part of a larger people, process, and technology picture.

What is your signal-to-noise ratio? Is it high enough to detect the next breach, or could it stand to be strengthened? I would posit that the ratio of true positives to false positives (the signal-to-noise ratio) is an important metric that all organizations should review. Not doing so could have dire consequences.

Friday, March 14, 2014

Year of the Data Breach or Year of the Cloud?

Some people have been calling 2014 the year of the data breach. It's not difficult to understand why -- it seems that there is another breach in the news weekly, if not more often than that. People often ask me why there are so many breaches in the news of late. I can't say for sure, but I suspect it is some combination of these factors, among others:

Crime does pay (attackers profit by compromising organizations)
Difficulty in tracking down and prosecuting the attackers (for a variety of reasons)
Better detection techniques
Better information sharing
Decrease in stigma for owning up to a compromise
Greater security awareness among business leaders and executives

My thought is that 2014 will actually be remembered as the year of the cloud. Time will tell for sure, but I am already seeing a few indications that this may be the case:

Small and medium-sized businesses are becoming more acutely concerned by the risks and threat landscape, causing them to seek economically viable security solutions for the SMB market (reference earlier "Security as a Line Item" blog posting).
Tightening budgets inside enterprises and governments, causing those organizations to seek economies of scale for security solutions
Shortage of qualified analytical talent, causing organizations to consider de facto analyst "time-sharing" arrangements
Movement towards a "SOC Center of Excellence" model, allowing organizations to focus on their primary business (which is most often not security)
Vastly increased interest in publications and blogs discussing the cloud

If anything, I would argue that the recent press on breaches has helped to accelerate the move to the cloud that was already underway. Each new breach that comes to light likely causes several organizations to move from the thought stage to the action stage. Perhaps the year of the cloud is upon us?

Thursday, March 13, 2014

New TLDs

Recently, ICANN has delegated 100 new top level domains (TLDs). For example, it is now possible to register and use domains ending in .best, . fish, .vacations, and many others. Additional TLDs are on the way in the near future as well. The complete list of domains that have been delegated, and to whom they have been delegated can be found here: http://newgtlds.icann.org/en/program-status/delegated-strings.

There are many reasons why the list of TLDs was expanded. Instead of discussing the reasons behind TLD expansion, I would like to discuss the implications of this TLD expansion to security operations.

For starters, TLD expansion means that it is now even easier than it already was for attackers to register and use malicious domains to carry out attacks against organizations. For example, there are now an even greater number of options for registering exploit, payload delivery, callback, update, and drop site domains. Previously, we had seen attackers leverage the "user-friendly" .cc and .ms TLDs (among others) extensively because of this. I'm sure that the list of "user-friendly" domains has now been expanded considerably.

So what can an organization do to try and stay ahead of, or at least current with, the threat? Fortunately, network traffic data can be used to provide us an analytical approach to tackling this challenge. Let's take a look at some steps we might be able to take proactively to assess what TLDs are required for business operations versus for which TLDs we can consider putting controls in place:

Begin by running an aggregate query over several weeks or one month of network traffic data and aggregating by TLD with count. The idea here is to cover a large enough period of time so as to get as complete a picture as possible regarding normal business operations.
Note all TLDs that do not appear in the query results but do appear in the TLD expansion list referenced above (i.e., there is no network traffic data to those TLDs). For example, we might not see .best, .fish, or .vacations in the query results. Because it does not appear that these TLDs are necessary for business operations, controls can be put in place to block/deny traffic to and from these TLDs.
Note all TLDs that do appear in the list and have a high count (a large amount of traffic) to them (e.g., .com, .org, .net, etc.). A large amount of traffic indicates that the TLDs are important for business operations and should be left untouched. Note that I am only talking about controls at the TLD level here -- specific known malicious domains can and should still be blocked.
Note all TLDs that appear in the list and have a low count (a small amount of traffic) to them. Drill down into this traffic and analyze it more deeply. Determine whether the traffic is legitimate (i.e., necessary for business operations), recreational, suspicious, or malicious. If the traffic is not required for business operations, consider putting controls in place to block/deny traffic to and from these domains.

The threat landscape is continuously evolving. As security professionals, we continually seek opportunities to proactively protect the enterprise. In the case of the new TLDs, we can use the network traffic data and our analytical skills to allow the data to guide us towards better controls that protect our organizations without negatively impacting business operations. The data is your friend. Use it.

Wednesday, March 12, 2014

Dialogue

Information sharing through trusted, vetted channels is an integral part of a successful security operations program. For the purpose of this blog posting, let's assume that an organization already has in place the ability to leverage their host and network forensics infrastructure to both identify information worth sharing and capitalize upon information they receive through trusted, vetted channels. Even with this in place, it can still be difficult for an organization to share information. What could be limiting the sharing? There may be many factors, but one such factor I've seen repeatedly is not a technical limitation, but rather, an organizational limitation.

Legal and privacy professionals have an obligation to protect the organizations and data they represent. Most legal and privacy professionals come from rigorous legal and/or regulatory backgrounds, but they are not necessarily technical, and they don't usually have an operational background in security. Thus, when security professionals within an organization try to gain approval for an information sharing program, a game of telephone often ensues. Allow me to explain:

As security professionals, we might say "we would like to share lists of domain names we have observed engaged in malicious activity". Legal and privacy professionals might hear "they want to share lists that may include our clients' or partners' domain names". Or, we might say "we would like to share lists of email addresses we have observed sending phishing emails into the enterprise". Legal and privacy professionals might hear "they want to share lists of internal email addresses and potentially contents of email".

And so on -- there is no shortage of examples that I could bring here. As you can see, each party comes from their respective angle, and each party has difficulty understanding where the other party is coming from. This can easily lead to impasse, frustration, and deadlock within an organization, to the detriment of security operations. What can be done to remedy this? As security professionals, it is our duty to engage legal and privacy professionals in a dialogue. Will we have to educate them? Yes, absolutely. Will we have to be educated on certain issues ourselves and possibly change some of our policies and procedures? Of course. Will we reach a mutual understanding in the end that leads to better security operations and reduced risk for the enterprise? I truly believe so, and in fact, I have seen this with my own eyes. Because of this, it is incumbent upon us as security professionals to engage legal and privacy professionals in a dialogue. It may not come as naturally to us as other aspects of our jobs, but the stakes are too high for us not to.

Tuesday, March 11, 2014

100 a Day

One of the goals of an incident response team should be to handle no more than 100 alerts a day. At first, this may sound like a ridiculous assertion. However, I think that if we examine this more closely, you will agree that it makes sense. Let's take an analytical approach and go to the numbers.

As previously discussed on this blog and elsewhere, one hour from detection to containment should be the goal in incident response. Put another way, one hour should be the time allotted to work an alert, perform all required analysis, forensics, and investigation, and take any necessary containment actions. Let's say we have each of our analysts working an eight hour shift. Assuming 100% productivity for each analyst, that allows each analyst to work approximately eight incidents per day. Let's assume that we want to work 96 alerts properly each day (since 100 is not divisible by eight). That works out to a requirement to have 12 analysts on shift (or spread across multiple shifts) to give proper attention to each alert. What happens if analyst cycles are taken away from incident response and lent to other tasks? The numbers look worse. What happens if the necessary analysis, forensics, and investigation take more than an hour (due to technology, process, or other limitations)? The numbers look even worse yet.

So, if you're the type of enterprise that has 500 analysts sitting in your SOC or Incident Response Center, you can probably stop reading this blog post and get back to your daily routine. What's that you say? The analyst is the scarcest resource, and you don't have enough of them? Yes, of course. I know.

Let's face it -- the numbers are sobering. Even a large enterprise with a large incident response team can realistically handle no more than 100-200 alerts in a given day. Sometimes I meet people who tell me that "we handle 5,000 incidents per day". I don't believe that for a second (putting aside, for now, the fact that incidents, events, and alerts are not the same thing). Either that organization is not paying each alert the attention it deserves, or the alerts are of such low value to security operations that it wouldn't make much difference whether they fired or not. One need only look to the recent Nieman Marcus intrusion to see the devastating effects of having too large a volume of noisy, low fidelity, false-positive prone alerts that drown out any activity of true concern (http://www.businessweek.com/articles/2014-02-21/neiman-marcus-hackers-set-off-60-000-alerts-while-bagging-credit-card-data).

Clearly, the challenge becomes populating the alerting queue with reliable, high fidelity, actionable alerts for analysts to review in priority order (priority will be the subject of an upcoming blog post). This process is sometimes referred to as content development and can be outlined at a high level as follows:

Collect the data of highest value and relevance to security operations and incident response. As previously discussed on this blog, fewer data sources providing higher value at lower volume/size, while still maintaining the required visibility are desired.
Identify goals and priorities for detection and alerting in line with business needs, security needs, management/executive priorities, risk/exposure, and the threat landscape. Use cases can be particularly helpful here.
Craft human language logic designed to extract only the events relevant to the goals and priorities identified in the previous step.
Convert the human language logic into precise, incisive, targeted queries designed to surgically extract reliable, high fidelity, actionable alerts with few to no false positives
Continually iterate through this process, identifying new goals and priorities, developing new content, and adjusting existing content based on feedback obtained through the incident response process.

Resources are limited. Every alert counts. Make every alert worth the analyst's attention.

Monday, March 10, 2014

Buyer Beware

A couple of weeks ago, I attended the RSA conference in San Francisco.
I always enjoy attending RSA, as it provides a unique opportunity to
engage many different aspects of the larger security community at the
same time. The conference is attended by vendors, practitioners/enterprises, researchers, industry analysts, journalists, investors, and others. I was fortunate enough to take part in several interesting and engaging discussions during the week. I would like to discuss one observation I made during the conference in this posting.

I took some time during the week to walk the vendor expo two or three
times. What I saw there inspired this blog, though it didn't necessarily surprise me. Not every vendor on the floor was guilty of this, but many, many vendors proffered a technology or solution for "big data", "security analytics", and/or "big data security analytics". In other words, many (though not all) vendors said they provided a solution for the same "space". Since I spent over a decade
on the enterprise/operational side, I can sympathize with the confusion this can bring to the enterprise audience. Leaders in the enterprise have many responsibilities, and it is difficult for them to keep track of the large number of vendors and what each vendor's specialty is.

Marketing is unlikely to change in the near future, and as such, it appears that the words "buyer beware" are important words for the enterprise. Many enterprises want to be doing "big data" and "security analytics", and thus, it's not particularly surprising that many vendors are offering "big data" and "security analytics" solutions. But what does it actually mean to do "big data" and "security analytics"? I think it's helpful to take a step back and think a level deeper about this in order to better understand it.

At a high level, "big data" and "security analytics" are about the two very different, but equally important concepts of collection and analysis. Allow me to explain. Before it is possible to run analytics, one needs the right data upon which to run those analytics. Before "big data" emerged as a buzzword, this was called "collection" or "instrumentation of the network". Further, in order to run analytics, one also needs a high performance platform upon which to issue the precise, targeted, incisive queries required by analytics. Before "security analytics" emerged as a buzzword, this was sometimes called analysis or forensics, among other terms. Collection and analysis, at enterprise speeds, are both equally important. If you think about it, you can't really have one without the other. Or, to put it another way, what good does the greatest collection capability provide without a way to analyze that data in a timely and accurate manner? Similarly, what good does the greatest analytical capability provide without the underlying data to support it?

As I walked around the expo floor, two families of "big data security analytics" products jumped out at me:

1) Analysis platforms that struggle with collection/consumption of data
2) Collection platforms that struggle with the analysis component (either because of performance, analytical capability, or both)

So, what about a platform that can do both collection and analysis at enterprise speeds? That's what I call a real "big data security analytics" platform -- one that lives up to the intent and spirit of the marketing buzzwords. Think about the ramifications of a single platform that provides excellent collection and excellent analysis. That's a great way to bring "big data security analytics" to your organization with reduced complexity and at a reduced cost.

If you're going to do "big data", it's worth thinking about how to do it right.

Friday, March 7, 2014

Measuring Security Intelligence Value

There was recently a discussion around measuring the value of a security intelligence program in the Twitterverse. Several well-known security experts took part in the discussion, and it was quite interesting to see everyone's thoughts on the subject. Further, this is a discussion that I am hearing more and more in the security operations space, and rightfully so. Security intelligence is a complex topic that requires more elaboration than Twitter's character limit allows for, and in fact, it requires more elaboration than I can realistically put into a blog posting. That being said, I will give it a shot. I have a large amount of operational experience in this area, so if you'll indulge me, I'll provide some thoughts on the topic here in this blog posting.

Generating security intelligence data (i.e., functioning as a source of intelligence) is an interesting topic, but not one that I will discuss in this post. Threat assessment is another fascinating topic, but also not one that I will discuss in this post. Additionally, intelligence sharing is also a hot topic, but again, not one that I will discuss in this post. Instead, I will focus on security intelligence as it relates to defending a large network in the context of a broader security operations program (i.e., functioning as a consumer of intelligence). In this context, at a high level, security intelligence involves consuming a piece of information (e.g., domain name, URL pattern, file name, MD5 hash, etc.), along with some context (e.g., exploit site, callback domain, drop site, malicious attachment MD5, etc.), and subsequently leveraging that information, in the right context, against host data and/or network traffic data.

Over the course of my career, I have seen security intelligence programs that work well, along with those that do not work as well. Before I get into a discussion of metrics around security intelligence programs, here are a few observations relating to challenges that organizations often encounter when implementing a security intelligence program:

Information lacks context: The best information in the world is useless unless we know in which context to use it -- context is key. Remember, only information with the proper context can qualify as intelligence.
Confusion of quantity with quality: If we have 5,000 "malicious" domain names, but 4,995 of them generate almost entirely false positives, that provides far less value than 10 reliable, high fidelity malicious domain names. Not only does the first example detect less true positives, but the volume of false positives overwhelm the work queue to the detriment of security operations.
Lack of indicator reliability and fidelity: It is important to vet indicators and sources before they are introduced into the alerting queue and workflow. Failure to do this properly can result in an overwhelming volume of false positives that dominate the work queue and squander valuable analyst cycles.
Lack of appropriate data: The best intelligence in the world is useless if we can't search for it over large quantities of host and network data over long periods of time rapidly.
Improper tracking of intelligence and sources: This leads well into the metrics discussion -- it is extremely important to warehouse and track intelligence and its sources with enough granularity to enable metrics and measurement.
Lack of integration with the workflow: If leveraging security intelligence is a pain, analysts will do it less, or they won't do it at all. This is a workflow/efficiency issue that can come at a great cost to an organization's overall security posture.

Measuring the value of a security intelligence program can be a difficult task. I'm sure there are many ways to approach the challenge. As mentioned above, proper warehousing and tracking of intelligence and its sources is a necessary precursor. Assuming that is in place, here are a few measurement approaches that I have found helpful over the course of my career:

Overall number of incidents/percentage of incidents identified via security intelligence vs. identified via other means.
Percentage of false positives per intelligence source.
Percentage of overall false positives resulting from security intelligence.
Percentage of incidents identified through security intelligence per intelligence source (also known as percentage of true positives per intelligence source).
Percentage of overall true positives resulting from security intelligence.
Mean time to detection (should decrease as security intelligence program matures).
Number of long-time (i.e., long undetected) intrusions uncovered via security intelligence.

This is not an exhaustive list, but it does provide a few metrics that I have found helpful for measuring and showing the value of a security intelligence program. Although it's implied here, it's perhaps worth stating explicitly that proper instrumentation of the network, for both collection and analysis is critical here. If an organization does not have total visibility into the network traffic and host data, along with the ability to incisively query that data rapidly, that organization will not be particularly successful in building a security intelligence program nor in measuring it.

A security intelligence program is a great thing, and every large organization should have one. It pays to consider how to make your security intelligence program the best one it can be.

Thursday, March 6, 2014

It's All About the Workflow

In a previous blog post entitled "The Scarcest Resource", I discussed how, of all the resources necessary for security operations and incident response, human analyst cycles are the most scarce. Recently, HP echoed the same sentiment in a report entitled "State of Security Operations" (https://ssl.www8.hp.com/ww/en/secure/pdf/4aa5-0501enw.pdf). The following quote from that report is particularly poignant:

"In SOCs, this results in minimal investment in the most expensive CPU in the room: the analyst."

The issue is clear, but what can an organization do to address it? There are many possible approaches one could take here, but I would like to discuss one of my favorites: workflow. Workflow is a concept that, in my experience, has the greatest return on investment for security operations when implemented correctly. With the volume, velocity, and variety of data coming at an analyst these days, it's more important than ever to focus the analyst via a single, unified work queue containing actionable, high fidelity items. Further, it's crucial that the analyst be able to perform all necessary analysis, investigation, and pivots and work each item to resolution from within the workflow. Let's have a look at what this workflow might look like and how each step of it corresponds to the incident response process:

On a continual basis, intelligent alerting content is developed across all sensing and instrumentation platforms using incisive, precise, targeted, finely-tuned queries designed to extract reliable, actionable, high fidelity events from the vast quantity of data. These events are the items that populate the work queue. This corresponds to the detection stage of the incident response process.
Working through the items in the work queue, analysts investigate each one, pivoting into and out of relevant platforms as appropriate to support the investigation. All investigation is documented within the work queue, and once analysis is complete, the analyst draws a conclusions about what has occurred. This corresponds to the analysis stage of the incident response process.
The analyst then proceeds through the containment, remediation, and recovery stages of the incident response process, pivoting into and out of relevant supporting systems as necessary. The stages are guided by the conclusions drawn during the analysis stage.
Lessons learned are gathered and documented, and detection techniques are improved accordingly. This completes the incident response process and provides a virtuous feedback loop as an added bonus.

It's interesting to note that this workflow is incredibly reliant on the population of the work queue with a sensible volume of reliable, actionable, high fidelity events. This requires sensing and instrumentation platforms designed to support incisive, precise, targeted, finely-tuned queries to extract the most relevant events, while minimizing false positives. I can't emphasize enough how critical this is to the operational workflow as a whole.

It's not easy to master your organization's workflow, but in my experience, it is the single greatest return on investment one can gain organizationally. How do you workflow?

Wednesday, March 5, 2014

The Forgotten Servers

In the enterprise, there is often a separation between network segments containing endpoint/workstation systems (e.g., laptops), network segments containing internal-facing servers (e.g., Exchange), and network segments containing external-facing servers (e.g., web servers). Fundamentally, this makes a lot of sense. Each of these segments serves a very different purpose, and as such, we would expect the traffic transiting each segment to behave differently. Further, each segment should have its own controls that permit traffic necessary for business operations, while denying traffic not befitting of that particular segment.

Although enterprises separate the various types of assets reasonably well, detection and alerting are predominantly focused on network segments containing endpoint/workstation systems. This is for several reasons, but primary among them are:

Identifying compromised/infected endpoint/workstation systems is relatively well understood and fairly mature, while identifying compromised/infected server systems is less well understood and not particularly mature.
Network segments containing servers, and particularly external-facing servers, are generally less well instrumented than network segments containing endpoint/workstation systems.

It is true that server compromises happen less often than endpoint/workstation compromises. But, it is also true that when server compromises do happen, they are often far more serious and consume far more incident response resources than endpoint/workstation compromises. Server compromises have the potential to lead to additional intrusions, data loss, theft of intellectual property, fraudulent activity, and other malicious activity. Furthermore, server compromises tend to go undetected for long periods of time, mainly because of the two reasons I outlined above.

So, given the risk, and the continued evidence that server compromises lead to bad things, it's a wonder enterprises don't study their server network data more closely. I would recommend two initial steps here, based on my own experience monitoring server networks:

Ensure the server network segments are properly instrumented, as it is difficult to monitor network segments for which data collection is incomplete/inadequate.
Dedicate some well-trained, highly-skilled analyst cycles to study the traffic on the server network segments. When reliable, high fidelity approaches are discovered, they can be automated as appropriate.

On server network segments, the stakes are high. So why is it that enterprises almost never pay them the attention they are due?

Tuesday, March 4, 2014

Security as a Line Item

The world of security operations and incident response has traditionally been the bailiwick of governments and large enterprises. The reasons for this are fairly straightforward. Security operations and incident response are relatively resource-intensive undertakings, and large organizations have the ability to bring the necessary people, process, and technology to the table. Many small and medium-sized businesses understand the threat and see the need to perform security operations and incident response, but they do not have the necessary resources available to do so.

As we all know, attackers do not limit themselves to governments and large enterprises. While it may be true that the most prized targets are located within large organizations, small and medium-sized businesses also offer a lucrative bounty for the attacker. But how can small and medium-sized businesses practice security operations and incident response given their resource limitations? I believe that the move to the cloud plays a critical role in the solution.

Small and medium-sized businesses often outsource HR, benefits, IT, and other critical business functions to benefit from the economies of scale afforded by outsourcing. Those same organizations can also outsource security operations and incident response to leverage the same economies of scale. In other words, for certain organizations, security can be thought of as a line item on the menu of services they purchase from the cloud. Small and medium-sized businesses cannot dedicate their own people, process, and technology to security functions, but they can purchase access to a cloud provider's people, process, and technology to meet their business needs and security goals. In fact, this is already starting to happen, and the model seems to be a good one.

For cloud providers looking to sell their people, process, and technology, it is important to think about how you will differentiate yourselves and persuade your customers to choose you over another provider. Are your people adequately trained, do they have the necessary skills, and are they trustworthy? Is your process organized, well-documented, timely, accurate, and does it follow industry best-practices and guidance? Does your technology support your operational workflow, does it scale to modern speeds and data volumes, and does it enable you to exploit the value of the data you possess?

For small and medium-sized business looking to improve security via a line item, it is important to understand what you are buying. Ask to meet the people who will be reviewing your data. Ask them questions based on your priorities and business needs to understand how they think and what their world view is. Ask to review the provider's processes and understand how they will respond when an incident hits. Ask the provider what technology they use, how it scales under load and volume, and what unique capabilities that technology brings them over their competitors. Be a tough customer -- after all, it is important to remember that you can manage risk, but you cannot eliminate it.

Security as a line item is coming, and in fact, it is already here. Those that understand the value of the cloud to small and medium-sized businesses will be able to capitalize on this, while at the same time, protecting a segment of the market that has traditionally been under-served. Likewise, small and medium-sized businesses that are choosy about to where they outsource will do better than those that are not.

Do you see the clouds forming on the horizon?