Implementing perimeter security scans on cloud infrastructure

Vytautas Paulauskas

Uncover cloud infrastructure security issues with perimeter scans

At Devbridge, we have multiple engineering teams developing and deploying various applications. The majority of the applications are cloud-based and use Microsoft Azure or Amazon Web Services (AWS). To enable our teams' autonomy and ability to deliver fast, the DevOps team, when requested, creates a new resource group (RG) on a corresponding cloud platform and grants the resource group ownership to the team requesting access. By enforcing unified naming rules and tagging, the DevOps team is then able to centrally monitor cloud costs, identify RG owners, modify resources, and delete obsolete or unused RGs.

On the other hand, giving a team full autonomy within a resource group means that the team might introduce inadvertent security risks. For example, misconfigured PaaS may accept connections from an outside network, which was not the intent. Depending on the type of platform service (the cache, document DB, blob storage, or similar) data exposure may cause various risks. Even further, web applications or APIs themselves may have security-related errors, configuration flaws, or insufficient/superfluous logging and monitoring setup.

Although each team is aware of secure development guidelines, they have static code analysis tools, and apps will eventually go through internal penetration testing stage. However, there is still a chance that a vulnerable application is deployed, and an undisclosed flaw goes unattended for several days or weeks. For this reason, we implemented perimeter security scans on our cloud infrastructure. This blog article describes our approach for perimeter scans, defines what open-source tools we use, and covers the benefits of these tools and tactics. The information outlined helps familiarize devs with common open-source tools and how to create a cloud infrastructure security scan pipeline.

The requirements

We had the following requirements for perimeter scans:

Run scans on a custom schedule.
Dynamically resolve service instances deployed on AWS EC2, Azure VMs, and Azure Cloud Services.
Upload reports to a selected vulnerability management tool.
Use instant notifications for new issues, preferably via Slack.
Manually remediate or mark any false positive issues in the vulnerability management tool UI.
Set custom remediation SLAs (e.g., 7 days for critical issues, 30 days for high issues, 90 days for medium issues, etc.)
Ensure that issues are unique by deleting by resolving duplicates automatically.
Use open-source tools.
As the scans themselves run on cloud instances, they need to be cost-effective with resource usage.

To further illustrate how perimeter scans help organizations keep their infrastructure secure, let’s look at an example of a typical BAU week with perimeter scans in place.

Assume the baseline, initial scans, do not find any issues and the current infrastructure on AWS and Azure is secure.

July 2nd: A perimeter scan discovers several new issues on AWS.

Some of these errors are caused by misconfigured logs directory from one of the development teams, both on DEV and QA environments.
Two more issues related to the internal PostgreSQL management panel are also reported: security scan finds that Strict-Transport-Security header is not present on pgAdmin and also SSH port is open.
As the development team receives immediate feedback after the scan, they fix the logs directory permissions and redeploy to both environments. Internal IT Operations team investigates issue #3 and finds out PostgreSQL management panel doesn’t have SSL enabled, which is an easy fix. Once the configuration change is applied to this particular EC2 instance, the management panel web page becomes encrypted with SSL, and the browser is issued STS headers.

July 3: Internal IT Ops investigates issue #4.

For obvious reasons, SSH access to PostgreSQL management panel Linux box needs to remain intact, but the team only allows several internal IP addresses to access it.
They mark issue #4 as "false positive" to prevent it from being reported again.

July 4: The infrastructure turns to a green state!

July 5th: Another development team deploys a prototype website on Azure.

Due to default settings on the ASP.NET MVC website, the site does not have the correct Cross-Site Request Forgery protection on an anonymous contact us form.
The HTTP Trace verb exposes an IIS version, which should be avoided on public websites.
The development team gets notified of these issues and fixes them within an agreed SLA of 7 days.

The tooling

Python is our tool of choice for scripts that enumerate resources on AWS and Azure cloud platforms. We use boto3 pip package to enumerate public IPs on AWS. The following code sample illustrates our approach:

"""Collect public IPs exposed by AWS subscription."""

import os
import sys
import boto3

from botocore.exceptions import ClientError

aws_credentials = {
    "aws_key": os.environ["ENV_AWS_KEY"],
    "aws_secret": os.environ["ENV_AWS_SECRET"],
}

aws_subscription = os.environ["ENV_AWS_SUBSCRIPTION"]

def err_out(err_msg, err_code=1):

    if err_code == 1:

        print('ERROR: {}'.format(err_msg), file=sys.stderr)

    exit(err_code)

def initialize_boto3_client(aws_credentials):
    """Initialize boto3 client with given credentials."""

    try:
        client = boto3.client(
            'ec2',
            aws_access_key_id=aws_credentials["aws_key"],
            aws_secret_access_key=aws_credentials["aws_secret"],
            region_name='us-east-1',
        )

        return client

    except ClientError:
        err_out("failed to initialize boto3 client")

def get_aws_public_ips(b3_client, aws_subscription):
    """Get AWS public ips."""

    try:
        pips = []

        regions = [region['RegionName'] for region in b3_client.describe_regions()['Regions']]

        for region in regions:
            ec2 = boto3.client(
                'ec2',
                aws_access_key_id=aws_credentials["aws_key"],
                aws_secret_access_key=aws_credentials["aws_secret"],
                region_name=region,
            )

            try:
                elasticIps = ec2.describe_addresses()

            except ClientError as ce:
                err_out("{subscription} -> {region} -> failed to get network interface data: {exception}".format(subscription=aws_subscription, region=region, exception=ce))

            if len(elasticIps["Addresses"]) > 0:
                for elasticIp in elasticIps["Addresses"]:
                    pips.append(elasticIp["PublicIp"])

        return pips

    except ClientError as ce:
        err_out("failed to get network interface data: {exception}".format(exception=ce))

if __name__ == "__main__":
    pips = []
    b3_client = initialize_boto3_client(aws_credentials)
    pips = get_aws_public_ips(b3_client, aws_subscription)
    print('\n'.join(pips))

For Microsoft Azure, we developed two Python scripts, the first one to interact with Azure Classic ServiceManagementService via azure-servicemanagement-legacy package:

"""Collect public IPs from Azure Classic VMs."""

import sys
from azure.servicemanagement import ServiceManagementService

azure_cert_file = os.environ["ENV_AZURE_CERT_FILE"]
azure_subscription = os.environ["ENV_AZURE_SUBSCRIPTION"]

sms = ServiceManagementService(azure_subscription, azure_cert_file)
services = sms.list_hosted_services()
for service in services:
    props = sms.get_hosted_service_properties(service.service_name, True)
    for d in props.deployments:
        for ip in d.virtual_ips.virtual_ips:
            print(ip.address)

The second python script interacted with modern ResourceManagementClient to list all running websites and their custom domain names, if applicable. The script depended on azure-common, azure-mgmt-network, azure-mgmt-resource, and azure-mgmt-web pip packages.

import os
import sys
import json
import msrestazure

from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.web import WebSiteManagementClient

subscription = os.environ["ENV_AZURE_SUBSCRIPTION"]

credentials = ServicePrincipalCredentials(
    client_id = os.environ["ENV_AZURE_APPID"],
    secret = os.environ["ENV_AZURE_CLIENTSECRET"],
    tenant = os.environ["ENV_AZURE_TENANTID"]
)

pips = []

for sid in sorted(subscriptions):
    web_client = WebSiteManagementClient(credentials, sid)
    web_apps = web_client.web_apps.list()
    for webapp in filter(lambda app: app.state == 'Running', web_apps):
            pips.append(webapp.default_host_name)

    network_client = NetworkManagementClient(credentials,sid)
    resource_client = ResourceManagementClient(credentials, sid)

    try:
        for resource_group in sorted(resource_client.resource_groups.list(), key=lambda x: x.name):
            for pip in sorted(network_client.public_ip_addresses.list(resource_group.name), key=lambda x: x.name):
                if pip.ip_address:
                    pips.append(pip.ip_address)

    except msrestazure.azure_exceptions.CloudError:
        print('Skipping subscription {}'.format(sid), file=sys.stderr)
        next

print('\n'.join(pips))

You can read more about two deployment models of Microsoft Azure here. If you do not have any resources in a "classic deployment model," you may only need the second Python script, which interacts with ResourceManagementClient.

After obtaining IP addresses and hostnames to check from Python, we used Nmap to scan for open ports on these resources. Port scan results served two purposes:

Any non-standard open port could be reported as an issue.
Open ports 80 and 443 indicate that the particular resource has a web interface and needs further testing with Arachnis web application scanner (see below).

Our Bash script invoked Nmap scan using standard Nmap command line options and pipes to grep and filter scan results for further processing.

After Nmap scan results were redirected to separate files, we invoked Arachnis scanner on them via the following Bash script:

#!/bin/sh
current_arachni_path=${HOME}/arachni/bin
reports=$3

doScan() {
	targetName="$line:$1"
	target="http://$line"
	if [ "$1" -eq "443" ]
	then
		target="https://$line"
	fi
	
	$current_arachni_path/arachni $target --browser-cluster-pool-size=5 --timeout=0:10:00 \
		--http-request-concurrency=10 --scope-include-subdomains --scope-directory-depth-limit=2 \
		--scope-auto-redundant=2 --report-save-path="${reports}/${targetName}_anonymous.afr" >~/log
}

reportScan() {
	 ${current_arachni_path}/arachni_reporter "${reports}/${targetName}_anonymous.afr" \
		--reporter="json:outfile=${reports}/${targetName}_anonymous.json" >> ~/log
}

# Taking target addresses from files
for line in `cat $1`; do
	echo "Starting Arachni scan for: $line:80"
	(doScan 80 && reportScan)
done

for line in `cat $2`; do
	echo "Starting Arachni scan for: $line:443"
	(doScan 443 && reportScan)
done

Finally, Arachnis scan results were imported to our selected vulnerability management tool, DefectDojo.

DefectDojo exposed an API and provided a Python wrapper pip package. Creating another Python script and invoking it after the Arachnis scan finishes was rather straightforward. The script was a slightly modified version of the CI/CD example in the defectdojo_api github page, which:

initiated a connection to DefectDojo API via a supplied username and API key.
navigated to a predefined product ID and checked whether an engagement called Perimeter Scan yyyy-MM-dd was created already. If not, the engagement was created.
iterated the ${reports} result directory from the previous Bash script and uploaded each Arachni result file.
closed the engagement via API once the uploads complete.

Here was how DefectDojo UI looked after several perimeter scans were completed. Each day's scan could be seen in the "Closed Engagements" section. The total number of findings (including medium and low findings) and a number of affected endpoints were shown on tab headers.

A drill-down into any "Perimeter scan" engagement reveals multiple Arachni scan uploaded. Each upload corresponded to a particular IP address or hostname. Findings and duplicates count acted as a quick way to assess each resource's health.

Clicking on a particular test (i.e., Arachni scan result) link showed unique issues and descriptions.

The DevSecOps approach: Integrating with CI/CD

Initially, our infrastructure perimeter scan existed just as an assorted collection of Python scripts, using a dedicated Arachni service that was hosted internally. Python script submitted a scan job via Rest API and then polled Arachni for scan completion to obtain the results. When we thought of automating this scan via CI/CD server Jenkins, we modified our approach to spin up a temporary EC2 instance; Arachni and Nmap were installed each time before the scan. Finally, we ended up creating a custom Docker image, which enabled us to quickly spin up Docker to perform the requested scans.

Here is the Dockerfile:

FROM ruby:2.6-slim

# Add nmap and configure it
RUN apt-get update \
	&& apt-get install -y nmap libcap2-bin \
	&& setcap cap_net_raw+ep $(which nmap)

RUN apt-get install -y curl

# Use latest development build from Arachni nightlies
ENV ARACHNI_VERSION=arachni-2.0dev-1.0dev
RUN curl -sLO http://downloads.arachni-scanner.com/nightlies/${ARACHNI_VERSION}-linux-x86_64.tar.gz
RUN curl -sLO http://downloads.arachni-scanner.com/nightlies/${ARACHNI_VERSION}-linux-x86_64.tar.gz.sha512
RUN truncate -s -1 ${ARACHNI_VERSION}-linux-x86_64.tar.gz.sha512 \
	&& echo " ${ARACHNI_VERSION}-linux-x86_64.tar.gz" >> ${ARACHNI_VERSION}-linux-x86_64.tar.gz.sha512 \
	&& sha512sum -c "${ARACHNI_VERSION}-linux-x86_64.tar.gz.sha512"

RUN tar zxf ${ARACHNI_VERSION}-linux-x86_64.tar.gz
RUN mv ${ARACHNI_VERSION} arachni
ENV PATH $HOME/arachni/bin:$PATH

# Cleanup
RUN rm ${ARACHNI_VERSION}*.tar.*

The Jenkins pipeline file had the following six steps:

Set up Python virtualenv.
Run Python scripts to dynamically collect IPs from Azure Classic, Azure RM, and AWS subscriptions.
Spin up Docker image to perform Nmap scan.
Perform Arachni scans on any instances that include ports 80 and 443 opened.
Collect the results. Alert in Slack if necessary.
Publish artifact.

A screenshot of our pipeline execution can be seen below:

Note that in this example, we were still using EC2 instead of dedicated a Docker image.

Successful (green) and warning (yellow) Slack alerts can be seen below. We also set up git hooks to notify of new commits in the perimeter scans repository. These messages and repository access rights were, of course, limited to a selected group of maintainers.

The benefits

There are several benefits for running perimeter scans. A few notable mentions are:

Increased visibility: As we're able to run a perimeter scan script on CI/CD for multiple environments, we now have an aggregate view of all issues collected in one place.
Immediate feedback: Slack alerts and the ability to drill-down into particular issues via CI/CD as Arachni scan results are also published as build artifacts.
Operational efficiency: Issue correlation (e.g., first seen on July 2nd) and SLA management allow to define priorities and decide which issues need to be resolved first.

Sticking to open-source tools and stitching the whole process together with simple Bash and Python scripts provides added benefits by enabling us to swap out individual components in the future. For example, we know that Arachni is no longer maintained, but in order to make a switch to an alternative scanner like Nikto (or any other scanners supported by DefectDojo), require trivial changes in one of the Python scripts. The current solution is also open for improvements, such as introducing new scanners to check for misconfigured SSH or database access vulnerabilities.

Some of the improvements likely coming in the future are to:

implement naming convention checks for hostnames retrieved from Azure. The team defines a regex rule and reports non-conforming hostnames to DefectDojo via API.
provide descriptive names for test uploads. The team changes the default named "Arachni Scan" to show hostname or IP address as well.
faster issue remediation. The team experiments with reimport_scan DefectDojo API method to automatically close issues already remediated.

Address cloud infrastructure security risks easily

At Devbridge, we empower engineering teams to manage cloud resource groups. At the same time, DevOps and Security teams want to be alerted of new risks in any of the resource group environments, such as development, staging, UAT, or production. Although some commercial tools are available, we utilize Python, Nmap, Arachni, DefectDojo, Docker, and Jenkins to build a lightweight infrastructure scan pipeline ourselves.

The custom infrastructure scan increases visibility and responsiveness. Additionally, the vulnerability management tool (DefectDojo) manages security-related issues easily in one place. As the solution in place requires multiple iterations, we plan to keep making minor improvements to scanning and visualization when needed. The whole concept is modular. Adjustments can be made by introducing new components (e.g., a new scanner) or replacing an obsolete tool.

Implementing perimeter security scans on cloud infrastructure