• Dejan Nenov

Semantic Technology for Cyber Security Asset Classification

Before discussing the applicability of Semantic Technology (i.e. the interesting) part of what I would like to share, I will take a moment and frame the problem of asset classification in the cyber-security context that I care about today.

Asset classification is a cornerstone to managing cybersecurity in any non-trivial-sized organization. It is a conceptually simple, common-sense requirement, that makes us understand and the IT assets that we use in order to secure them.

Asset classification starts with asset discovery - the process of assembling the list of the organization's physical, virtual and cloud-hosted servers, desktops, laptops, mobile devices, networking equipment, IoT, OT (Operational Technology) systems, and industrial control devices. We then classify each asset by the value and sensitivity of the information it contains, by the location, line of business, and department that owns it, by the technical department that is responsible for operating and maintaining it, and possibly by the cost and revenue accounts it impacts. Most importantly, we want to know if an asset must be considered in the context of one or more regulatory or industry compliance frameworks like PCI, NERC CIP, SOX, SOC-2, etc. The ultimate purpose of the asset classification exercise is to acquire a quantitative measure of inherent risk associates with each asset. The reason we do Asset classification is that it helps us choose where to apply limited resources for maximum impact. It is first and foremost a business process that tries to limit the scope and cost of work we need to perform in order to operate the IT and OT technology that business runs on. In reality, it is not possible to make every system we operate perfectly secure. The complexity of the software we use and the deployments we operate is such that we cannot anticipate all potential weaknesses and vulnerabilities, or fully account for the human factors involved. Even if we tried to do so, the cost and time required would be untenable. A useful analogy is to say that instead of trying to build a "fireproof vault big enough to fit the whole house inside", we have to pick the right size vault, for the right price, and choose a room to put it in. Asset classification helps us make sure that we do not lock up the janitorial supplies while leaving the keys to the delivery vans hanging on the wall by the front door. I do not know of a single organization that has a unified "master" system to provide a truly complete and current database of assets. This is not because such software does not exist or is not deployed, but because the data it needs has a bad case of GIGO (Garbage-In = Garbage-Out) Some companies do a great job managing and supporting windows desktop and server infrastructure, but it is rarely the case that they are equally adept or motivated to handle MacOS, Linux, legacy Unix, SCO, Novell, and the older and more obscure OSes. The management of network infrastructure is often segregated in its own department making routers, switches, access points, and telecom equipment their own fiefdom. It is even harder to get a handle on often air-gapped OT and industrial control systems. In real-world environments, asset discovery relies on multiple, asynchronous, and overlapping scan and discovery processes, each of which delivers a partial view and a subset of the information we want. Recently, as part of a proof of concept, we were asked to make use of data from more than 10 different systems - multiple Qualys vulnerability scans, CMDB, LandDesk and ServiceNow databases, Pi Historian, and a Tufin firewall management installation and am RSA Archer deployment.

After ingesting data from all of these, we still did not have all the data we wanted - we had to add a NeDi installation to handle non-firewall networking equipment, as well as get access to Active Directory and LDAP in order to discover devices that come in and out of the network (laptops and mobile devices) and ones that turn on and off (manually or through the power and sleep settings). We created a great big set of somewhat standardized and fairly comprehensive data, but with very little actionable information. We had created a data lake of asset records - some with lots of fields telling us hardware, OS, location, and user information, others with only IP and MAC addresses. What we were actually asked to do is provide an across-the-organization view of assets by Location, Line of Business, Operating System, Technical Support Department, Data Center, Public and Hybrid Cloud, Device Type, Virtualization, User Group, Department, Subsidiary, Business System, NERC-CIP classification and a few other odds and ends. Ultimately, the goal of the exercise was to feed this data into a threat simulation model in order to derive an inherent risk valuation of the assets and tell the organization what should be patched first and which vulnerabilities should be remediated as a top priority. It appeared as if we had all the data we needed, yet every time we produced a report our largest group-by value, regardless of data dimension ended up being "Unknown" or "Unavailable". Only half of our asset records identified Line of Business, 30% had no OS information, Only 10% were associated with a subsidiary, etc.

At the same time, we received comments and questions like:

  • "These (computers) are obviously at the Orlando site - just look at the location of the switch they are connected to.",

  • "What do you mean you do not know what subsidiary these belong to - only XYZ Pty Ltd has a Tidewater Power Generation line of business.",

  • "Why can't these be automatically classified as in scope for NERC-CIP - they are connected to the devices listed in the GE Historian data - just look at the subnets.",

  • "Of course these servers are CRM related - look at the user accounts on them - every user in the sales and marketing groups in AD has an account on them!",

  • and "Of course these are part of the AWS hybrid cloud - just look at the ASN Number!"

What is required here is a system that can infer information from other known facts - and the inference capabilities of semantic technology are a perfect fit to address this problem. This is not a novel concept. Ontologies describing cybersecurity and asset management domains do exist. For example, see AURUM.