Practice what you Preach

Last week I was in London, England for SeleniumConf where I gave a talk on test infrastructure. The feedback seemed to be good from the people I talked to, but I personally was uncomfortable on stage and felt it might have been my worst performance. I felt good about the content the morning of and had a few jokes, etc. planned, but when I got going the switching of windows (1xFirefox, 1xChrome, 1xKeynote, 1xTerminal, 1xSublime) completely threw me off my game and was a downward spiral from there. It likely was too ambitious for a 40 minute slot and better suited to more of a hands-on, all-day workshop. Here is the sorta Commentary Track with extra links and such.

Most of my decks have a visual theme to them, but I couldn’t come up with anything. At one point I tried Rocky and Bullwinkle because I could say ‘Nothing up my sleeve’ and show Bullwinkle as a magician when showing an empty AWS account. But then I thought of ‘Practice what you Preach(er)’ and tried to shoehorn that in. But its /really/ hard to find appropriate things to include so ripped most of it out. And, most of the talk was supposed to be code and/or architecture diagrams so really any visual theme was a stretch. In the end, I left a couple Preacher things in, but it wasn’t an obvious thing and was lost on most — so likely should have ripped them all out.

These really are the ‘rules’ of presenting. In general, your talk will be improved if you avoid these. The logic being; the audience knows who you are from the bio in the program, they want you to succeed so don’t sow the idea of failure into their minds, and things will go inevitably go wrong (best to anticipate it and have a video of you doing a thing instead of doing it.)

“So let’s break some rules.”

Fell flat. And and the Preacher reference felt forced. (It was.)

Breaking Rule 1. I really don’t care about scripts anymore. There has been tragically little innovation in the actual script creation and maintenance space. But what people don’t talk about is the business risk around where those scripts run. I don’t have any data to substantiate this claim, but my gut is that too many people are just spinning up EC2 or ECS instances to run their scripts without knowledge of the tooling around it in order to run them securely and efficiently.

Breaking Rule 2. I had such big plans for this talk, but have been battling burnout for a year now. It’s been especially bad the last couple months which is exactly when I needed to be prepping things for success. Which didn’t help things as burnout feeds off of burnout.

Burnout isn’t technically a clinical diagnosis, but I like this definition.

Thankfully there are organizations now specifically chartered to help tech people deal with their traitor brains. Such as https://osmihelp.org.

This is a ‘simple’ view of what a modern, self-hosted Selenium infrastructure in AWS could look like. I’m likely missing a few things, but it really hasn’t changed in the last 5 or 6 years. Selenium Grid 4.0 could make some interesting changes at scale as the Hub can be broken into 4 different services. Oh, and I don’t include Docker in here because I don’t believe in running scripts on environments your customers are not. You are of course more than welcome to if enough of your customers are using headless browsers or Linux desktops. I’m also not current on how to setup Sauce Labs in a hybrid scenario (or even if they support that configuration anymore) with their tunnel product adding their cloud as available browser nodes into the Hub — which always thought was a sweet spot for them.

Here is the conclusion of the ‘Do not start with an apology’ rule and the origin of the name. In Austin I rambled (even for me) about infrastructure and just threw a tonne of ideas and topics at the audience. In Chicago I use the https://aws.amazon.com/architecture/well-architected/ Well Architected Framework from AWS to organize all those ideas. It is by AWS and uses AWS product offerings as examples, but it really is cloud neutral at its core. There was a tweet a month or so ago about a teach that used it to cut their Azure bill by something like 40% but applying the principles in it to their cloud infrastructure. So the plan then for this was to open up a terminal, run terraform apply, do the rest of the talk then have that full diagram in the previous slide created and run a test.

Yaaaaaa. About that. Remember burnout? Ends up I got maybe 1/5 of it done. And couldn’t run a test. So rather than the pitched ‘All code’ there was ‘Some code’.

Now we’re starting to get into Rule 3 territory about not running a live demo. And the tools I’m recommending these days to do it are the Hashicorp ones. I used to suggest using the cloud provider’s native ones (such as CloudFormation for AWS) but for the above reasons have switched. I of course reserve the right to change my mind again in the future.

Almost ready to build some infrastructure, live on on stage, but first have to talk about pre-conditions.

The ‘Root Account’ is the all powerful account with every permission possible. It is also the most dangerous one. The only thing you should do with it is enable hardware MFA (https://aws.amazon.com/iam/features/mfa/?audit=2019q1 has links to purchase), create an ‘administrators’ group that has the ‘AdministratorAccess’ managed policy and create an ‘IAM User’ in that group.

The ‘IAM User’ will have a virtual MFA token and access keys.

This is where juggling windows went crazy. Escape out of Keynote presenter mod to Firefox which had the Root Account window, then to Safari had the IAM User window, then to the terminal to start Terraform and watch it scroll a bit until it applies the ‘Everything must have MFA policy’ (as described at https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_iam_mfa-selfmanage.html) until things fail and demo getting a token in the shell (by switching between the terminal and Sublime which had the commands I wanted) and finish running Terraform.

The network held, and things applied without a problem. But it was here that I realized window switching wasn’t going to work so had to adjust on the fly.

One of the first pieces of infrastructure to be created needs to be the networking layer itself. I strongly believe that AWS’ VPC Scenario 2 (https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenarios.html) is the right one in almost all Selenium deployments. Basically, everything of importance is in the ‘private’ (blue) subnet and is not reachable from the internet and the only thing that can reach is either a bastion host in the ‘public’ (green) subnet or a VPN server that sits in the public subnet. You still have to keep things in the private subnet patched, etc. but there is a level of risk mitigation achieved when the bad guys (and gals) cannot even access the machines.

I gave the bastion host an Elastic IP (static) and likely should have also registered it with Route 53 (AWS’ DNS service) and tried to SSH into it. But it didn’t work — on purpose. That is because I created an empty security grep with Terraform and attached it to the bastion host. Using the AWS CLI I added just my IP as an ingress rule so now one of the instances I created was accessible from the internet, but only from a specific IP and certificate. It was also to be able to demonstrate in Terraform how to have it ignore certain changes to prevent it from rebuilding parts of your infrastructure and kicking you out mid change. (There’s a lesson learned the hard way…)

This slide talks about how the bastion (and other) instances were produced by another part of the Hashicorp suite — Packer. No demo, but the idea of ‘Cattle vs Pets’ for your instances was brought up again. My current Packer setup uses the puppet-masterless (https://www.packer.io/docs/provisioners/puppet-masterless.html) provisioner but I would consider switching to Ansible in the future as AWS just announced that Systems Manager can run Ansible playbooks directly from S3 or Github which is kinda game changing to me. puppet-masterless relies on SSH and ideally the last step of provisioning should be to remove all the keys from the box and deal with things strictly through Systems Manager. Again, if everything is in a private subnet that doesn’t allow for access into the boxes, it is another level of security.

I also suggested using something like Secrets Manager or Vault to store passwords and other secure things rather than putting them right in your Terraform manifests.

Which dovetailed to me copy-and-pasting a private key into the bastion host. And then showing a security group that allows access into the Hub that was brought up from only the bastion.

Since we’re in AWS (and a lot of others are as well) we have to talk about security. And one of the most important parts around that in AWS is its API logging tool CloudTrail. The Terraform scripts configured a single CloudTrail trail across all regions and stores the logs in S3. Be careful about doing this though if you are in multiple regions as you pay for traffic and this can silently ad to your bill if you are not careful.

One trick AWS suggests is you have CloudTrail monitor itself and automatically re-enable itself if it is disabled. This is what is on this slide and is described in more detail on https://aws.amazon.com/blogs/mt/monitor-changes-and-auto-enable-logging-in-aws-cloudtrail/

One thing anyone building out infrastructure needs to be aware of is how much their stuff is costing at any particular moment in time. And to be warned when something spirals out of control. This is where billing alerts and using tags on everything that supports them to be able to see where your money is going. AWS billing is a black box with entire consulting organizations existing to try and get a handle on it. Terraform created a billing alert for $40 CDN.

Out of the Terraform and into the theoretical.

I believe you should run your scripts in the environment your users will. This means Windows. Not headless or Linux. So using Packer you create a Windows based AMI. I started with https://github.com/joefitzgerald/packer-windows for these demo purposes. In a proper grid, actual licenses will be required.

Your nodes should be;

  • In Auto Scaling Groups even if you are not doing the ‘auto’ part. This is useful as you can intentionally scale them to 0 if you know you never run scripts over night. But also think of a scenario when the Hub notifies AWS that its used 90% of its available nodes and to spin up another 2 or 3 and then remove capacity when it has more than x spare.
  • Use ‘Spot Instances’ which is a market place for companies who have bought Reserved instance (pay by the year) but are not using them and lending out their compute time to recoup some of their investment. You should never pay more for a Spot instance than you would were it On Demand.
  • Have access to their Node instance restricted to only the Hub via a Security Group

One best practice we had a decade ago that has been forgotten is always running your scripts through a scriptable proxy. This lets you blackhole unnecessary scripts which slow down your tests, intercept HTTP codes and control how much bandwidth is simulated. (Having spent almost a week in a hotel with pretty crap internet, its amazing how much of the internet assumes functioning bandwidth.)

Access into this proxy should only be from the Node instances and wherever your scripts are being run from (such as CodeBuild) to configure it.

Some of this functionality is starting to be built into the browsers with bridges to WebDriver through the Javascript Executor and Google Developer Tools. This of course assumes you are only running scripts in Chrome. It’s a far better idea to just run a proxy to get greater functionality and cross-browser capability.

Another reason for Terraform over something like CloudFormation is you can run things in external cloud providers such as MacStadium which uses VMWare as the base of their cloud. So using the same tool to configure your Linux Selenium Hub and Windows Selenium Nodes you can also create Mac Nodes.

Because it is external to your private subnet where everything is, and in fact external to your VPC, a Load Balancer needs to be created in the public subnet to allow communication from MacStadium into the Hub for registration.

Selenium 4.0 is coming. And it will change this diagram a bit. As mentioned above, the Hub itself can be broken into 4 separate services which can be independently configured and scaled. A ‘Hub’ comprised of 2 AWS Lambda functions, an AWS SQS queue and and AWS Elasticache Redis instance is going to be the scalable model of the future I think.

But before that happens, there is a couple things that need to happen.

Communication between all parts of the Selenium infrastructure needs to be securable. Currently everything is HTTP but it needs to be HTTPS (if not by default, then at least configurable.) If anyone wants to do that, patches are welcome and would save me the work of doing it.

Similarly, there needs to be the some way of authorizing Nodes into the Hub. Right now, any Node can register itself with the Hub and start getting traffic. Its an interesting attack vector to think about where you discover someone launching a Hub in a public subnet and you lighting up a Node and attaching to it and now seeing a company’s next version of their app because they are sending it to you. The vector gets even more interesting when taking into consideration there is work being done to allow communication back to the Hub from the Node. If I can overflow a buffer and run arbitrary commands on the shell somehow your network is now fully compromised. Again, feel free to submit a patch along the lines of https://www.elastic.co/guide/en/beats/filebeat/current/configuring-ssl-logstash.html so I don’t have to do it.

And that was the talk. Next steps with it are unknown. I’m seriously considering turning it into a video series and maybe offering it as a workshop at future SeleniumConfs.

Is your Automation Infrastructure ‘Well Architected’? – SeConfChicago edition

This week I was in Chicago to get back onto my soapbox around how automation patterns have been largely created so the risk has shifted to the infrastructure the scripts run on. There is too much content for an hour which is the length I thought I had, until the day before which I realized I had 45 minutes. And then on stage the counter said 40 minutes.

Anyhow, this talk is supposed to overwhelm by intention. The idea being here is a whole bunch of things you can think about on your own time with the assistance of the actual well architected framework (which, again, is cloud neutral if you swap out cloud provider component names.)

See you next month in Malmo where I’m giving it again are Øredev.

(I’ll embed the recording once its available.)

Laravel and Logstash

As we get larger clients, our need to not be cowboying our monitoring / alerting is increasing. In our scenario we are injesting logs via Logstash and sending them all to an AWS Elasticsearch instance, and if it is of severity ERROR we send it to AWS Simple Noticiation Service (which people or services can subscribe to) as well as send them to PagerDuty.

Input
For each of our services we have an input config which basically says ‘consume this file patter, call it a laravel file, and add its stack name to the event.’

input {
  file {
    path => "<%= scope['profiles::tether::www_root'] %>/storage/logs/laravel-*.log"
    start_position => "beginning"
    type => "laravel"
    codec => multiline {
      pattern => "^\[%{TIMESTAMP_ISO8601}\] "
      negate => true
      what => previous
      auto_flush_interval => 10
    }
    add_field => {"stack" => "tether"}
  }
}

Filter
Since its a type laravel file, we pull out the environment its running in, and log severity, plus grab the ip of the instance, build the SNS message subject and make sure the event timestamp is the one in the log, not the time logstash touched the event. (Without that last step, you end up with > 1MM entries for a single day the first time you run things.)

filter {
  # Laravel log files
  if [type] == "laravel" {
    grok {
      match => { "message" => "\[%{TIMESTAMP_ISO8601:timestamp}\] %{DATA:env}\.%{DATA:severity}: %{GREEDYDATA:message}" }
    }
    ruby {
      code => "event.set('ip', `ip a s eth0 | awk \'/inet / {print$2}\'`)"
    }
    mutate {
      add_field => { "sns_subject" => "%{stack} Alert (%{env} - %{ip})" }
    }
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
      target => "@timestamp"
    }
  }    
}

Output
And then we pump it around where it needs to be.

If you are upgrading ES from 5.x to 6.x you need to have the template_overwrite setting else the new schema doesn’t get imported and there was some important changes that were made. The scope stuff is for Puppet to do replacements. And there is a but in 6.4.0 of the amazon_es plugin around template_overwrite…

output {
  amazon_es {
    hosts => ["<%= scope['profiles::laravel::es_host'] %>"]
    region => "us-west-2"
    index => "logstash-<%= scope['environment'] %>-%{+YYYY.MM.dd}"
    template => "/etc/logstash/templates/elasticsearch-template-es6x.json"
	template_overwrite => true
  }
}
output {
  if [severity] == "ERROR" { 
    sns {
      arn => "arn:aws:sns:us-west-2:xxxxxxxxx:<%= scope['environment'] %>-errors"
      region => 'us-west-2'
    }
  }
}

I’m not quite happy with our Pageruty setup as the de-duping is running at an instance level right now. Ideally, it would have the reason for the exception as well but that’s a task for another day.

output {
  if [severity] == "ERROR" { 
      pagerduty {
        event_type => "trigger"
        description => "%{stack} - %{ip}"
        details => {
          timestamp => "%{@timestamp}"
          message => "%{message}"
        }
        service_key => "<%= scope['profiles::laravel::pagerduty'] %>"
        incident_key => "logstash/%{stack}/%{ip}"
      }
  }
}

For the really curious, here is my Puppet stuff for all this. Every machine which has Laravel services has the first manifest, but there are some environments which have multiple services on them which is why the input file lives at the service level.

modules/profiles/manifests/laravel.pp

  class { 'logstash':
    version => '1:6.3.2-1',
  }
  $es_host = hiera('elasticsearch')
  logstash::configfile { 'filter_laravel':
    template => 'logstash/filter_laravel.erb'
  }
  logstash::configfile { 'output_es':
    template => 'logstash/output_es_cluster.erb'
  }
  if $environment == 'sales' or $environment == 'production' {
    logstash::configfile { 'output_sns':
      template => 'logstash/output_sns.erb'
    }

    $pagerduty = lookup('pagerduty')
    logstash::configfile { 'output_pagerduty':
      template => 'logstash/output_pagerduty.erb'
    }
  }
  unless $environment == 'development' {
    file { [ '/etc/logstash/templates' ]:
      ensure => 'directory',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx'
    }

    file { [ '/etc/logstash/templates/elasticsearch-template-es6x.json' ]:
      ensure => 'present',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx',
      source => 'puppet:///modules/logstash/elasticsearch-template-es6x.json',
      require => Class['Logstash']
    }

    logstash::plugin { 'logstash-output-amazon_es': 
      source => 'puppet:///modules/logstash/logstash-output-amazon_es-6.4.1-java.gem',
      ensure => '6.4.1'
    }
  }

modules/profiles/manifests/.pp

  logstash::configfile { 'input_tether':
    template => 'logstash/input_tether.erb'
  }

The next thing I need to work on is consuming the ES data back into our app so we don’t have to log into Kibana or the individual machines to see the log information. I think every view-your-logs solution I’ve seen for Laravel has been based around reading the actual logs on disk which doesn’t work in a clustered environment or where you have multiple services controlled by a hub one.

Structured Logs in Laravel (Part 2)

The previous post showed you how to tweak Laravel’s default logging setup to output in json which is a key part of creating structured logs. With structured logs you can move yourself towards a more Observable future by tacking on a bunch of extra stuff to your logs which can then be parsed and acted upon by your various systems.

And Laravel supports this out of the box — it just isn’t called out that obviously in the docs. (It is the ‘contextual information’ section on the Logging doc page. (Or ‘Errors and Logging’ for pre-5.6 docs). Basically, you create an array and pass it as the second argument to your logging call and it gets written out in the ‘extras’ part of the log.

ubuntu@default:/var/www/tether$ sudo php artisan tinker
Psy Shell v0.9.7 (PHP 7.1.20-1+ubuntu16.04.1+deb.sury.org+1 — cli) by Justin Hileman
>>> use Ramsey\Uuid\Uuid;
>>> $observationId = Uuid::uuid4()->toString();
=> "daed8173-5bd0-4065-9696-85b83f167ead"
>>> $structure = ['id' => $observationId, 'person' => 'abc123', 'client' => 'def456', 'entry' => 'ghi789'];
=> [
     "id" => "daed8173-5bd0-4065-9696-85b83f167ead",
     "person" => "abc123",
     "client" => "def456",
     "entry" => "ghi789",
   ]
>>> \Log::debug('some debug message here', $structure);
=> null

which gets output like this

{"message":"some debug message here","context":{"id":"daed8173-5bd0-4065-9696-85b83f167ead","person":"abc123","client":"def456","entry":"ghi789"},"level":100,"level_name":"DEBUG","channel":"development","datetime":{"date":"2018-09-03 18:31:31.079921","timezone_type":3,"timezone":"UTC"},"extra":[]}

Of course there is no ‘standard’ for structured logs (nor should there be as they really are context sensitive), but most of the examples I’ve seen all have some sort of id for giving context for tracing things around.

Note: The id in this case is solely for dealing with logs message output. This is not for application request tracing which I think is also really interesting but have not delved into yet.

Structured Logs in Laravel

I’ve been following the likes of Charity Majors on the twitters and one of her big things around Observability is producing logs in a ‘structured’ format. (Loosely defined as ‘something that a machine can easily read and make decisions on.)

Out of the box, Laravel ships with a logging system that uses Monolog and its LineFormatter which is Apache-esque.

const SIMPLE_FORMAT = "[%datetime%] %channel%.%level_name%: %message% %context% %extra%\n";

Which means regexes to parse, etc. but they are designed more for human consumption than machine consumption.

The hints of how to change the format to a structured (json) one is right in the docs but as the expression goes, ‘an example would be handy here’. So here you go.

/*
|--------------------------------------------------------------------------
| Logging Changes
|--------------------------------------------------------------------------
|
| Structured logs ftw
|
*/
$app->configureMonologUsing(function ($monolog) {
    $days = config('app.log_max_files', 5);
 
    // default 
    $path = storage_path() . '/logs/laravel.log';
    $handler = new RotatingFileHandler($path, $days);
    $handler->setFormatter(new LineFormatter(null, null, true, true));
    $monolog->pushHandler($handler);
 
    // structured
    $path = storage_path() . '/logs/laravel.json';
    $handler = new RotatingFileHandler($path, $days);
    $handler->setFormatter(new JsonFormatter());
    $monolog->pushHandler($handler);
});

Drop this right before the ‘return $app;’ in bootstrap/app.php and you’ll have two logs, one the default way, and one the new structured way. I’m including both at the moment because we have a bunch of log capture / manipulation stuff around the default structure I haven’t changed yet. Once thats all updated I’ll get rid of the default section.

There’s been a bunch of noise in the Laravel world around ‘supporting the enterprise’. Adding a ‘log format’ is one of those small enterprise support things that adds huge value.

(And yes, I know, pull requests are welcome, but until then, here is a blog post.)

mobilexco/laravel-scout-elastic; an AWS Elasticsearch driver for Laravel Scout

A large piece of what we’ll be doing the last half of this year is improving the support workflows inside Tether (our MarTech platform) and that includes Search. Being a Laravel shop, it makes sense to start with Scout to see if that gets us close, if not completely over the line.

We use Elasticsearch for other things in Tether so it made sense to use that as the Scout backend through the super helpful ErickTamayo/laravel-scout-elastic package. And it worked as advertised right out of the box for local development (inside Vagrant with a local Elasticsearch server). But as soon as we moved the code to a cloud environment that used an AWS Elasticsearch instance it was throwing all sorts of wacky errors. Turns out, AWS Elasticsearch is different than Elastic Elasticsearch — not completely, just how communication is sent over the wire.

No problem. we’re clearly not the first one to discover this problem, and sure enough there are a number of forks of the package that add this in. But, they commit one of the following sins;

  • Required AWS credentials to be checked into your repo in a .env file
  • Used env() to fetch configuration settings which breaks if you are caching configs (and you really should be)
  • Required manual intervention with Elasticsearch while deploying.

These are the result of Laravel being in the awkward teenage years, and all very solvable. I just wish it didn’t seem that things I need require these fixes…

Anyhow, mobilexco/laravel-scout-elastic uses the defaultProvider() which means it will work with IAM roles on your AWS infrastructure to authenticate with Elasticsearch. This the official AWS recommended approach and does not require the presence of keys on the server (and all the pain around rotation, etc. that comes with using keys).

It also publishes conf/laravel-scout-elastic.php for setting flags it needs to decide whether to use Elastic or AWS implementations of Elasticsearch rather than env() so config:cache works. (This should likely be better called out in the Laravel docs for creating packages…)

The package also includes a new Artisan command (scout:create-index) which can be called by via something like AWS CodeDeploy (in the AfterInstall hook) to ensure the index is created that Scout will be using. Which is useful if you are in an environment where your Elasticsearch access is restricted to only the boxes that need access and those boxes don’t have ssh installed on them. (Artisan commands are run either CodeDeploy or SSM.)

Hopefully this saves someone the 12 hours of distracted development it took to come up with this solution.

Client-specific domains with CloudFormation for clients that use Google as email provider

A number of our clients want vanity domains for their experiences, which adds a laywer (or two) of operations overhead beyond just having a line item in the invoice. In the spirit of ‘infrastructure as code’-all-the-things, this is now my process for registering new domains for our clients

  1. Register the domain through Route 53
  2. Delete the hosted zone that is automatically created. (It would be nice if there was an option when getting the domain to not automatically create the hosted zone.)
  3. Login to Google Apps and add the domain as an alias. When prompted to verify, choose Ghandi as the provider and get the TXT record that is needed
  4. Create a CloudFormation stack with this template. Some interesting bits;
    • Tags in the AWS Tag Editor are case sensitive so ‘client’ and ‘Client’ are not equivilent
    • I think all my stacks will include the ‘CreationDateParameter’ parameter from now on which gets added as a tag to the Resource[s] that can accept them. This is part of the ‘timebombing’ of resources to make things more resilient. In theory I can also use AWS Config to find Resources that are not tagged and therefore presumably under CloudFormation control.
    • Same thing for the ‘client’ tag. Though still nto keen on that name or billing_client or such.
    {
      "AWSTemplateFormatVersion": "2010-09-09",
      "Parameters": {
        "ClientNameParameter": {
          "Type": "String",
          "Description": "Which client this domain is for"
        },
        "DomainNameParameter": {
          "Type": "String",
          "Description": "The domain to add a HostedZone for"
        },
        "GoogleSiteVerificationParameter": {
          "Type": "String",
          "Description": "The Google Site Verification TXT value"
        },
        "CreationDateParameter" : {
          "Description" : "Date",
          "Type" : "String",
          "Default" : "2017-08-27 00:00:00",
          "AllowedPattern" : "^\\d{4}(-\\d{2}){2} (\\d{2}:){2}\\d{2}$",
          "ConstraintDescription" : "Date and time of creation"
        }
      },
      "Resources": {
        "clienthostedzone": {
          "Type": "AWS::Route53::HostedZone",
          "Properties": {
            "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
            "HostedZoneTags": [
              {
                "Key": "client",
                "Value": {"Ref": "ClientNameParameter"}
              },
              {
                "Key": "CloudFormation",
                "Value": { "Ref" : "CreationDateParameter" }
              }
            ]
          }
        },
        "dnsclienthostedzone": {
          "Type": "AWS::Route53::RecordSetGroup",
          "Properties": {
            "HostedZoneId": {
              "Ref": "clienthostedzone"
            },
            "RecordSets": [
              {
                "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
                "Type": "TXT",
                "TTL": "900",
                "ResourceRecords": [
                  {"Fn::Sub": "\"google-site-verification=${GoogleSiteVerificationParameter}\""}
                ]
              },
              {
                "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
                "Type": "MX",
                "TTL": "900",
                "ResourceRecords": [
                  "1 ASPMX.L.GOOGLE.COM",
                  "5 ALT1.ASPMX.L.GOOGLE.COM",
                  "5 ALT2.ASPMX.L.GOOGLE.COM",
                  "10 ALT3.ASPMX.L.GOOGLE.COM",
                  "10 ALT4.ASPMX.L.GOOGLE.COM"
                ]
              }
            ]
          }
        },
      }
    }
  5. Update the domain’s nameservers for the ones in our newly created Hosted Zone. I suspect this could be done via a Lambda backed custom resource, but that’s a couple steps too complicated for me right now. If I have to do this more than once every couple weeks it’ll be work the learning time.
  6. Validate the domain with Google.
  7. Manually create a certificate for ${DomainNameParameter} and *.${DomainNameParameter}. (For reals, this should be an automatic thing for domains registered in Route 53 and hosted within Route 53.)

And then I need to create an ALB for the domain and point it at the right service. But thats getting rather yak shave-y. The ALB needs to be added to the ASG for the service but those are not under CloudFormation control so I need to get them under control.

HubSpot in an AWS World

We recently moved our corporate website from WPEngine to HubSpot and as part of that, you have to do some DNS trickery. HubSpot helpfully provides instructions for various DNS providers, but not Route 53. Reading the ones they do provide though provides a good idea what is needed;

  1. Add a CNAME for your HubSpot domain as the www record
  2. Add an S3 hosting bucket to redirect everything to www.yourdomain.com
  3. Add a CloudFront distribution to point to your bucket

Now, this is likely 5 minutes of clicking, but AWS should be done with minimal clicking, in favour of using CloudFormation (or TerraForm or such). As such, it took about 10 hours…

Lesson 1 – Don’t create your Hosted Zones by hand.

Currently, all our Hosted Domains in Route 53 were either created by hand as the domains were registered somewhere else, or created at registration time by Route 53. This is a challenge as CloudFormation cannot edit (to add or update) records in Hosted Zones that were not created by CloudFormation. This meant I needed to use CloudFormation to create a duplicate Hosted Zone and let that propagate through the internets and then delete the existing one.

Here’s the CloudFormation template for doing that — minus 70+ individual records. Future iterations likely would have Parameters and Outputs sections, but because this was a clone of what was already there I just hardcoded things.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "zonemobilexcocom": {
      "Type": "AWS::Route53::HostedZone",
      "Properties": {
        "Name": "mobilexco.com."
      }
    },
    "dnsmobilexcocom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneId": {
          "Ref": "zonemobilexcocom"
        },
        "RecordSets": [
          {
            "Name": "mobilexco.com.",
            "Type": "MX",
            "TTL": "900",
            "ResourceRecords": [
              "1 ASPMX.L.GOOGLE.COM",
              "5 ALT1.ASPMX.L.GOOGLE.COM",
              "5 ALT2.ASPMX.L.GOOGLE.COM",
              "10 ALT3.ASPMX.L.GOOGLE.COM",
              "10 ALT4.ASPMX.L.GOOGLE.COM"
            ]
          }
        ]
      }
    },
    "dns80808mobilexcocom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneId": {
          "Ref": "zonemobilexcocom"
        },
        "RecordSets": [
          {
            "Name": "80808.mobilexco.com.",
            "Type": "A",
            "TTL": "900",
            "ResourceRecords": [
              "45.33.43.207"
            ]
          }
        ]
      }
    }
  }
}

Lesson 2 – Don’t forget that DNS is all about caching and you could clone a domain and forget to include the MX record because you blindly trusted the output of CloudFormer only to realize you had stopped incoming mail overnight but worked for you because you had things cached…

Lesson 3 – Even though you are using an S3 Hosted Website to do the redirection, you are not actually using an S3 Hosted Website in the eyes of Cloud Front.

This cost me the most amount of grief as it led me to try and create an S3OriginPolicy, an Origin Access Identity, etc. that I didn’t need.

Note: in order to make this template work, you need to first have issued a certificate for your domain through ACM. Which is kinda a pain. My current top ‘AWS Wishlist’ item is auto-provisioning of certificates for domains that are both registered and hosted within your account.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Parameters": {
    "DomainNameParameter": {
      "Type": "String",
      "Description": "The domain to connect to hubspot (don't include the www.)"
    },
    "HubspotCNameParameter": {
      "Type": "String",
      "Description": "The CName for your hubspot site"
    },
    "AcmCertificateArnParameter": {
      "Type": "String",
      "Description": "ARN of certificate to use in ACM"
    }
  },
  "Resources": {
    "s3mobilexcocom": {
      "Type": "AWS::S3::Bucket",
      "Properties": {
        "BucketName": {"Ref": "DomainNameParameter"},
        "AccessControl": "Private",
        "WebsiteConfiguration": {
          "RedirectAllRequestsTo": {
            "HostName": {"Fn::Join": ["", ["www.", {"Ref": "DomainNameParameter"}]]},
            "Protocol": "https"
          }
        }
      }
    },
    "dnswwwmobilexcocom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneId": {
          "Fn::ImportValue" : "hosted-zone-mobilexco-com:HostedZoneId"
        },
        "RecordSets": [
          {
            "Name": {"Fn::Join": ["", ["www.", {"Ref": "DomainNameParameter"}, "."]]},
            "Type": "CNAME",
            "TTL": "900",
            "ResourceRecords": [
              {"Ref": "HubspotCNameParameter"}
            ]
          }
        ]
      }
    },
    "dnsmobilexcocom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneId": {
          "Fn::ImportValue" : "hosted-zone-mobilexco-com:HostedZoneId"
        },
        "RecordSets": [
          {
            "Name": {"Fn::Join": ["", [{"Ref": "DomainNameParameter"}, "."]]},
            "Type": "A",
            "AliasTarget": {
              "DNSName": {"Fn::GetAtt": ["httpsDistribution", "DomainName"]},
              "HostedZoneId": "Z2FDTNDATAQYW2"
            }
          }
        ]
      }
    },
    "httpsDistribution" : {
      "Type" : "AWS::CloudFront::Distribution",
      "Properties" : {
        "DistributionConfig": {
          "Aliases": [
            "mobilexco.com"
          ],
          "Origins": [{
            "DomainName": {"Fn::Join": ["", [{"Ref": "DomainNameParameter"}, ".s3-website-", {"Ref": "AWS::Region"}, ".amazonaws.com"]]},
            "Id": "bucketOriginId",
            "CustomOriginConfig": {
              "HTTPPort": 80,
              "HTTPSPort": 443,
              "OriginProtocolPolicy": "http-only"
            }
          }],
          "Enabled": "true",
          "DefaultCacheBehavior": {
            "ForwardedValues": {
              "QueryString": "false"
            },
            "TargetOriginId": "bucketOriginId",
            "ViewerProtocolPolicy": "allow-all"
          },
          "ViewerCertificate": {
            "AcmCertificateArn": {"Ref": "AcmCertificateArnParameter"},
            "SslSupportMethod": "sni-only"
          },
          "PriceClass": "PriceClass_100"
        }
      }
    }
  }
}

Lesson 4 – Naming conventions are a thing. Use them.

As soon as you start doing ImportValue or AWS::CloudFormation::Stack. In theory the ImportValue lines could use DomainNameParameter with Fn::Sub to switch the . to a – and this would be an entirely generic template, but this is working well enough for me. And of course, your naming convention could be (and likely is) different.

Harmonizing Maintenance Windows

At the moment we are only using RDS and ElastiCache within AWS, but more services we use the more maintenance windows is going to come up. Rather than have them at random places around the week and clock, I figure it would be useful to have just a single window that we can subsequently work into our SLAs etc. Now, I really like the management consoles AWS has, but its a lot of clicks to track things — especially if I start using something like CloudFormation and Autoscaling or something to be making things magically.

Scripting to the rescue.

Our applications are PHP based, but at my heart I’m a Python guy, so I whipped up one. And aside from the fear of modifying running items, it appears to have worked well.

import boto3
 
maintenance_window = 'sun:09:35-sun:10:35'
 
# rds can have maintenance windows
update_rds = False
rds = boto3.client('rds')
paginator = rds.get_paginator('describe_db_instances')
for response_iterator in paginator.paginate():
    print('Current RDS Maintenance Windows')
    for instance in response_iterator['DBInstances']:
        print('%s: %s UTC' % (instance['DBInstanceIdentifier'], instance['PreferredMaintenanceWindow']))
        if instance['PreferredMaintenanceWindow'].lower() != maintenance_window.lower():
            update_rds = True
 
if update_rds == True:
    paginator = rds.get_paginator('describe_db_instances')
    for response_iterator in paginator.paginate():
        for instance in response_iterator['DBInstances']:
            if instance['PreferredMaintenanceWindow'].lower() != maintenance_window.lower():
                rds.modify_db_instance(
                    DBInstanceIdentifier=instance['DBInstanceIdentifier'],
                    PreferredMaintenanceWindow=maintenance_window
                )
    paginator = rds.get_paginator('describe_db_instances')
    for response_iterator in paginator.paginate():
        print('Adjusted RDS Maintenance Windows')
        for instance in response_iterator['DBInstances']:
            print('%s: %s UTC' % (instance['DBInstanceIdentifier'], instance['PreferredMaintenanceWindow']))
 
# elasticache can have maintenance windows
update_ec = False
ec = boto3.client('elasticache')
paginator = ec.get_paginator('describe_cache_clusters')
for response_iterator in paginator.paginate():
    print('Current ElastiCache Maintenance Windows')
    for instance in response_iterator['CacheClusters']:
        print('%s: %s UTC' % (instance['CacheClusterId'], instance['PreferredMaintenanceWindow']))
        if instance['PreferredMaintenanceWindow'].lower() != maintenance_window.lower():
            update_ec = True
 
if update_ec == True:
    paginator = ec.get_paginator('describe_cache_clusters')
    for response_iterator in paginator.paginate():
        for instance in response_iterator['CacheClusters']:
            if instance['PreferredMaintenanceWindow'] != maintenance_window:
                ec.modify_cache_cluster(
                    CacheClusterId=instance['CacheClusterId'],
                    PreferredMaintenanceWindow=maintenance_window
                )
 
    paginator = ec.get_paginator('describe_cache_clusters')
    for response_iterator in paginator.paginate():
        print('Adjusted ElastiCache Maintenance Windows')
        for instance in response_iterator['CacheClusters']:
            print('%s: %s UTC' % (instance['CacheClusterId'], instance['PreferredMaintenanceWindow']))

It’s always a Security Group problem…

I’ve got a number number of private subnets within my AWS VPC that are all nice and segregated from each other. But every time I light up a new Ubuntu instance and tell it to ‘apt-get update’ it times out. Now, since these are private subnets I can get away with opening ports wide open, but AWS is always cranky at me for doing so. I feel slightly vindicated that the same behaviour is asked about on Stack Overflow often too, but anyways, I figured it out this week. Finally. And as usual with anything wonky network-wise in AWS it was a Security Group problem.

  1. First thing, read the docs carefully.
  2. Read it again, more careful this time
  3. Setup the Routing. I actually created 2 custom routing tables rather than modify the Main one; explicit is better than implicit (thanks Python!)
  4. Create an ‘apt’ Security Group to be applied to the NAT instance with the inbound rule, from your private VPC address space for HTTP (80), HTTPS (443) and HKP (11371). HTTP is the default protocol for apt but if you are adding new repos the key is delivered via HTTPS and then validated against the central key servers via HKP. You’ll need outbound rules for those ports too per the docs

And now you should be able to lock down your servers a bit more.