Scrum is an anti-pattern for Continuous Delivery

I’ve been saying that ‘Scrum is an anti-pattern for Continuous Delivery’ for awhile, including last week’s post which got a ‘huh?’ so here is my beef with Scrum.

Actually my complaint isn’t with Scrum itself, but with Sprints and if you remove those then the whole house of cards fall down. (This is similar to my stance on Java, which I do dislike, but I loathe Eclipse so Java is tolerable in something other than Eclipse. Barely.)

The whole point of Continuous Delivery, to me, is to ‘deliver’ improvements to whatever it is you do, to your customers ‘continuously.’ Where continuously means, well, continuously. Not ‘at the end of an arbitrary time people which usually is about 2 – 3 weeks in length.’ This is why ‘Mean Time To Production’ is such an important metric to me and drives all other changes to the delivery pipeline.

“But Adam, how will we plan what we do if we don’t get to play Planning Poker every week?” Easy. Your customers will tell you. And that ‘customer’ could be internal. If something is important, you will know. If something is more important than something else, then it will bump that down the queue. This isn’t to say discussing things and figuring out how to slice them into smaller and smaller units isn’t necessary. It absolutely is. And learning how to do this is perhaps one of the hardest problems in software. Which leads to…

“But Adam, this is a big story that will take the full sprint to complete.” Slice it smaller, hide the work-in-progress behind feature flags and still push your local changes daily. (You should be using them anyways to separate feature launch from availability.)

“But Adam, we could deploy at any point — we just do it once a sprint.” Why? You are actively doing a disservice to your customers and your company by holding back things that could improve their experience and make you more money. Disclaimer: this becomes a more real argument when deploying to IoT or other hardware. I don’t want my thermostat to get updated 20 times a day. But if the vendor could do it, I’ll accept that.

“But Adam, we are in a regulated environment and have to do Scrum.” That’s a strawman argument against working with your auditors. See Dave’s recent Continuous Compliance article.

“But Adam, how will we know if we are getting better at estimating?” The same way you do with Scrum or anything else, which is to collect data. This is a bulk food type problem. When you go to buy say peanut butter from the bulk food store, you take in your container and they weigh it before you scoop your peanut-y deliciousness into it, and after. They then do the math to know the weight of just the peanut button. Same thing can be done here. If you know how long your deploys take, you can do math to say between time code was started to the time it was available in production. And then remove the fixed time of deployments to get the actual length of time something took. In its entirety, not just ‘in development’. (I don’t actually track this metric right now. Things take the length of time they take. But I think this is sound theory.)

“But Adam, where do all our manual testers fit in this world?” They are just part of the process. This is a key difference between Continuous Deployment and Continuous Delivery. If your process says humans touch it, then humans touch it. But there also needs to be a way to short-circuit around them in the case of an emergency.

“But Adam, our database is so archaic and fragile that deployments are a huge risks and sprints minimize that.” That’s a good place to start to change things. A local company still does releases weekly overnight on Wednesdays after 5 years because of this. I’m pretty sure it stopped being a tech problem and well into a people problem a couple years ago.

So if not Scrum, then what? The ‘easy’ answer is Kanban. The harder answer is of course ‘it depends’ and likely looks like a tailored version of Kanban to solves your team’s problems. I really like the notion of a work item flowing across a board, but also dislike enforcing WIP limits and the artificial moving of things left to make room for something else because the tooling requires it.

Let me know what other “But Adam’s” I missed in the comments.

Oh, I’ve got one more.

“But Adam, that is hard.” Yes. Yes it is. (It’s also super fun.)

‘So what would you do?’

Another ‘free consulting is content’ post. The context here is a 10 year old company a friend of mine is the VP of Engineering at whose delivery pipeline worked … but there were some horrible manual steps (as compared to manually pushing a button steps which are perfectly acceptable, if not desirable) and the things were too custom and black box-y. Oh, and the deploy from CircleCI was just flat out broken right now. The gist of the conversation was ‘if you helped us out, what would it look like.’

What’s interesting is that this, and other conversations like this that I have had in the last month have really distilled my thoughts around pipelines which leads to a playbook of sorts, but that’s beyond scope of this. Aside from this looking a lot like what the playbook looks like.

Anyhow, here is the ‘only slightly edited’ bit of free consulting I gave.

  1. Check that things that should already be done are done

Root account has a hardward MFA token that is somewhere secure, CloudTrail is enabled and has the fun Lambda script to auto re-enable is disabled, deletion protection turned on, etc.

  1. CodeDeploy

Since deploying from CircleCI is busted anyways, get it producing CodeDeploy packages and manually install the agent on all the boxes

  1. Packerize all images

Standardize on a Linux distro (another other than Amazon Linux 2 is silly). Create base AMIs with CodeBuild triggered off of Github webhooks to the $company-Packer repo. Again, doesnt matter which configuration management tool Packer uses — as long as they can justify the choice. And as I mentioned, AWS has given a credible reason to use Ansible with the integration of running playbooks from System Manager.

  1. Replace CircleCI with CodePipeline (orchestration) and CodeBuild (build, test and package) — since deploy is already done via CodeDeploy
  2. Feature Flags

Managed via an admin screen into the database (not file-based) to dark launch features to cohorts and/or percentages before full availability.

  1. Airplane Development

‘Can you do development on an airplane without internet access’ — so no shared databases, needing to reach out to the internet for js or fonts or icons, etc. Look at developer onboarding at this point too. Vagrant is great. Docker is the hotness. But Vagrant means you can literally have the same configuration locally as you do in production. Docker can too of course if you are going Fargate/ECS.

  1. Health

Monitoring (all the layers, reactive and proactive — all but one of my major outages could have been predicted if I was watching the right things), Logging (centralized, retention), Paging (when and why and fix the broken windows), Testing-in-Production (it’s the only environment that counts), Health Radiator (there should be a big screen with health indicators, but system and business in your area), etc.

  1. Autoscaling

Up and Down, at all the layers. Driven by monitoring and logging.

  1. Bring everything under Terraform control

Yes, only at this point. It ‘works’ now — just not the way you want it to. Everything above doesnt ‘work’. Again, I’d use Terraform over CloudFormation, but for ‘all in on aws’ CloudFormation is certainly an option. Now if only CloudFormation was considerd a first class citizen inside AWS and supported new features before competitors like Terraform does. CloudFormation still doesn’t have Route 53 Delegation Sets the last time I checked.

  1. Disaster Recovery

‘Can you take the last backup and your Terraform scripts and light up $company in a net new aws account and be able to down a maximum of how long it takes to copy RDS snapshots or lose only data from last in-flight backup.’

  1. Move to Aurora

Just because I like the idea of having the ability to have the database trigger Lambda functions

  1. Observability

Slightly different than Health — basically I would use Honeycomb because Charity, etc. are far too smart.

  1. Chaos Engineering

Self healing, multi-region, etc. If Facebook can cut the power to their London datacenter and no one notices, $company can do something less dramatic with equal effect.

And then it’s ‘just’ keeping the ship sailing the way you want, making slight corrections in the course along the way.

We need a priest (QA) to bless (test) all our work

A friend of mine pinged me during his commute this morning about my thoughts on weening a team off of thinking they need ‘a priest (QA) to bless (test) all our work’. ‘Free’ consulting means it gets to be content. :D

Obviously, this is a ‘people problem’. So the approach will vary place-to-place and even within the place. Regardless though, need to start by expunging ‘QA as Quality Assurance’ from the organization. They don’t actually ‘Assure’ anything. You, or a half dozen other people could override. So ‘Quality Assistance’ is a nicer reframing. Or better still, ‘Testing’.

Then you need to play detective and find out what the inciting event was that caused a) the first ‘QA’ person to be hired, and b) how they got anoited as priests. Smooth transisition away from that requires you know those two things.

Organizationally, I would be interested in;

  • how many things are found by the testers
  • what the categorization is (because those are developer blind spots)
  • how many things that are found actually hold up the build until fixed
  • and of those, how many could have shipped

From a purely technical perspective, some practices that address this;

  • dark launches via feature flags and have new stuff rolled out slowly to user slices
  • acknowledge that production is different than any other environment and is the only environment that matters. To quote Charity; ‘I test in production, and so do you.’
  • the only metric that matters in today’s world is ‘mean time to production’. Something isn’t ‘done’ unless it is in production being used by the target customer. Everything you want to do hinges on that. Put on your wall a whiteboard with ‘number of deploys today’, ‘number of deploys this week’, ‘number of deploys this month’ which you increment each time it goes to production
  • if you think your feature stories are small enough, you need to slice it more
  • not to overload the term, but increase the observability of the application in the more traditional way not honeycomb way. If you are pushing to production fast and often, you need to know its behaving or not fast and often. Number of logins per 5 minutes, number of registrations per 5 minutes, number of searches per 5 minutes, etc. Every new feature / fix needs to have a measure to know if it is working. (It will take a long time to get to here.)
  • move to trunk based development. Everyone should be pushing code at least once every 2 days. Feature branches allow people to get sloppy.
  • Obviously, TDD is huge in this. (or TAD — I don’t care, just slow down and write some damn tests before committing)
  • Steal from Etsy’s playbook and have your pipeline such that day 1 at <redacted> is pushing to production, and day 2 is paperwork / onboarding. Forces you to get your development environment in shape so you can onboard someone from bare machine to productive in an hour and also it breaks the feeling of sanctity around production and creates shared ownership. I believe everyone at etsy did this, not just developers. (Though obviously non-developers had a borrowed environment and were hand-held)

MTTP reduction is the whole thing purpose of building out a Continuous Delivery pipeline. ‘QA Priest’ doesn’t fit time wise for that. (Its also why Scrum is a Continuous Delivery anti-pattern.)

But again, this is a Culture thing. To quote Jerry; ‘things are the way they are because that is the way they got there’ — figure that out and you can change the culture.

Practice what you Preach

Last week I was in London, England for SeleniumConf where I gave a talk on test infrastructure. The feedback seemed to be good from the people I talked to, but I personally was uncomfortable on stage and felt it might have been my worst performance. I felt good about the content the morning of and had a few jokes, etc. planned, but when I got going the switching of windows (1xFirefox, 1xChrome, 1xKeynote, 1xTerminal, 1xSublime) completely threw me off my game and was a downward spiral from there. It likely was too ambitious for a 40 minute slot and better suited to more of a hands-on, all-day workshop. Here is the sorta Commentary Track with extra links and such.

Most of my decks have a visual theme to them, but I couldn’t come up with anything. At one point I tried Rocky and Bullwinkle because I could say ‘Nothing up my sleeve’ and show Bullwinkle as a magician when showing an empty AWS account. But then I thought of ‘Practice what you Preach(er)’ and tried to shoehorn that in. But its /really/ hard to find appropriate things to include so ripped most of it out. And, most of the talk was supposed to be code and/or architecture diagrams so really any visual theme was a stretch. In the end, I left a couple Preacher things in, but it wasn’t an obvious thing and was lost on most — so likely should have ripped them all out.

These really are the ‘rules’ of presenting. In general, your talk will be improved if you avoid these. The logic being; the audience knows who you are from the bio in the program, they want you to succeed so don’t sow the idea of failure into their minds, and things will go inevitably go wrong (best to anticipate it and have a video of you doing a thing instead of doing it.)

“So let’s break some rules.”

Fell flat. And and the Preacher reference felt forced. (It was.)

Breaking Rule 1. I really don’t care about scripts anymore. There has been tragically little innovation in the actual script creation and maintenance space. But what people don’t talk about is the business risk around where those scripts run. I don’t have any data to substantiate this claim, but my gut is that too many people are just spinning up EC2 or ECS instances to run their scripts without knowledge of the tooling around it in order to run them securely and efficiently.

Breaking Rule 2. I had such big plans for this talk, but have been battling burnout for a year now. It’s been especially bad the last couple months which is exactly when I needed to be prepping things for success. Which didn’t help things as burnout feeds off of burnout.

Burnout isn’t technically a clinical diagnosis, but I like this definition.

Thankfully there are organizations now specifically chartered to help tech people deal with their traitor brains. Such as https://osmihelp.org.

This is a ‘simple’ view of what a modern, self-hosted Selenium infrastructure in AWS could look like. I’m likely missing a few things, but it really hasn’t changed in the last 5 or 6 years. Selenium Grid 4.0 could make some interesting changes at scale as the Hub can be broken into 4 different services. Oh, and I don’t include Docker in here because I don’t believe in running scripts on environments your customers are not. You are of course more than welcome to if enough of your customers are using headless browsers or Linux desktops. I’m also not current on how to setup Sauce Labs in a hybrid scenario (or even if they support that configuration anymore) with their tunnel product adding their cloud as available browser nodes into the Hub — which always thought was a sweet spot for them.

Here is the conclusion of the ‘Do not start with an apology’ rule and the origin of the name. In Austin I rambled (even for me) about infrastructure and just threw a tonne of ideas and topics at the audience. In Chicago I use the https://aws.amazon.com/architecture/well-architected/ Well Architected Framework from AWS to organize all those ideas. It is by AWS and uses AWS product offerings as examples, but it really is cloud neutral at its core. There was a tweet a month or so ago about a teach that used it to cut their Azure bill by something like 40% but applying the principles in it to their cloud infrastructure. So the plan then for this was to open up a terminal, run terraform apply, do the rest of the talk then have that full diagram in the previous slide created and run a test.

Yaaaaaa. About that. Remember burnout? Ends up I got maybe 1/5 of it done. And couldn’t run a test. So rather than the pitched ‘All code’ there was ‘Some code’.

Now we’re starting to get into Rule 3 territory about not running a live demo. And the tools I’m recommending these days to do it are the Hashicorp ones. I used to suggest using the cloud provider’s native ones (such as CloudFormation for AWS) but for the above reasons have switched. I of course reserve the right to change my mind again in the future.

Almost ready to build some infrastructure, live on on stage, but first have to talk about pre-conditions.

The ‘Root Account’ is the all powerful account with every permission possible. It is also the most dangerous one. The only thing you should do with it is enable hardware MFA (https://aws.amazon.com/iam/features/mfa/?audit=2019q1 has links to purchase), create an ‘administrators’ group that has the ‘AdministratorAccess’ managed policy and create an ‘IAM User’ in that group.

The ‘IAM User’ will have a virtual MFA token and access keys.

This is where juggling windows went crazy. Escape out of Keynote presenter mod to Firefox which had the Root Account window, then to Safari had the IAM User window, then to the terminal to start Terraform and watch it scroll a bit until it applies the ‘Everything must have MFA policy’ (as described at https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_iam_mfa-selfmanage.html) until things fail and demo getting a token in the shell (by switching between the terminal and Sublime which had the commands I wanted) and finish running Terraform.

The network held, and things applied without a problem. But it was here that I realized window switching wasn’t going to work so had to adjust on the fly.

One of the first pieces of infrastructure to be created needs to be the networking layer itself. I strongly believe that AWS’ VPC Scenario 2 (https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenarios.html) is the right one in almost all Selenium deployments. Basically, everything of importance is in the ‘private’ (blue) subnet and is not reachable from the internet and the only thing that can reach is either a bastion host in the ‘public’ (green) subnet or a VPN server that sits in the public subnet. You still have to keep things in the private subnet patched, etc. but there is a level of risk mitigation achieved when the bad guys (and gals) cannot even access the machines.

I gave the bastion host an Elastic IP (static) and likely should have also registered it with Route 53 (AWS’ DNS service) and tried to SSH into it. But it didn’t work — on purpose. That is because I created an empty security grep with Terraform and attached it to the bastion host. Using the AWS CLI I added just my IP as an ingress rule so now one of the instances I created was accessible from the internet, but only from a specific IP and certificate. It was also to be able to demonstrate in Terraform how to have it ignore certain changes to prevent it from rebuilding parts of your infrastructure and kicking you out mid change. (There’s a lesson learned the hard way…)

This slide talks about how the bastion (and other) instances were produced by another part of the Hashicorp suite — Packer. No demo, but the idea of ‘Cattle vs Pets’ for your instances was brought up again. My current Packer setup uses the puppet-masterless (https://www.packer.io/docs/provisioners/puppet-masterless.html) provisioner but I would consider switching to Ansible in the future as AWS just announced that Systems Manager can run Ansible playbooks directly from S3 or Github which is kinda game changing to me. puppet-masterless relies on SSH and ideally the last step of provisioning should be to remove all the keys from the box and deal with things strictly through Systems Manager. Again, if everything is in a private subnet that doesn’t allow for access into the boxes, it is another level of security.

I also suggested using something like Secrets Manager or Vault to store passwords and other secure things rather than putting them right in your Terraform manifests.

Which dovetailed to me copy-and-pasting a private key into the bastion host. And then showing a security group that allows access into the Hub that was brought up from only the bastion.

Since we’re in AWS (and a lot of others are as well) we have to talk about security. And one of the most important parts around that in AWS is its API logging tool CloudTrail. The Terraform scripts configured a single CloudTrail trail across all regions and stores the logs in S3. Be careful about doing this though if you are in multiple regions as you pay for traffic and this can silently ad to your bill if you are not careful.

One trick AWS suggests is you have CloudTrail monitor itself and automatically re-enable itself if it is disabled. This is what is on this slide and is described in more detail on https://aws.amazon.com/blogs/mt/monitor-changes-and-auto-enable-logging-in-aws-cloudtrail/

One thing anyone building out infrastructure needs to be aware of is how much their stuff is costing at any particular moment in time. And to be warned when something spirals out of control. This is where billing alerts and using tags on everything that supports them to be able to see where your money is going. AWS billing is a black box with entire consulting organizations existing to try and get a handle on it. Terraform created a billing alert for $40 CDN.

Out of the Terraform and into the theoretical.

I believe you should run your scripts in the environment your users will. This means Windows. Not headless or Linux. So using Packer you create a Windows based AMI. I started with https://github.com/joefitzgerald/packer-windows for these demo purposes. In a proper grid, actual licenses will be required.

Your nodes should be;

  • In Auto Scaling Groups even if you are not doing the ‘auto’ part. This is useful as you can intentionally scale them to 0 if you know you never run scripts over night. But also think of a scenario when the Hub notifies AWS that its used 90% of its available nodes and to spin up another 2 or 3 and then remove capacity when it has more than x spare.
  • Use ‘Spot Instances’ which is a market place for companies who have bought Reserved instance (pay by the year) but are not using them and lending out their compute time to recoup some of their investment. You should never pay more for a Spot instance than you would were it On Demand.
  • Have access to their Node instance restricted to only the Hub via a Security Group

One best practice we had a decade ago that has been forgotten is always running your scripts through a scriptable proxy. This lets you blackhole unnecessary scripts which slow down your tests, intercept HTTP codes and control how much bandwidth is simulated. (Having spent almost a week in a hotel with pretty crap internet, its amazing how much of the internet assumes functioning bandwidth.)

Access into this proxy should only be from the Node instances and wherever your scripts are being run from (such as CodeBuild) to configure it.

Some of this functionality is starting to be built into the browsers with bridges to WebDriver through the Javascript Executor and Google Developer Tools. This of course assumes you are only running scripts in Chrome. It’s a far better idea to just run a proxy to get greater functionality and cross-browser capability.

Another reason for Terraform over something like CloudFormation is you can run things in external cloud providers such as MacStadium which uses VMWare as the base of their cloud. So using the same tool to configure your Linux Selenium Hub and Windows Selenium Nodes you can also create Mac Nodes.

Because it is external to your private subnet where everything is, and in fact external to your VPC, a Load Balancer needs to be created in the public subnet to allow communication from MacStadium into the Hub for registration.

Selenium 4.0 is coming. And it will change this diagram a bit. As mentioned above, the Hub itself can be broken into 4 separate services which can be independently configured and scaled. A ‘Hub’ comprised of 2 AWS Lambda functions, an AWS SQS queue and and AWS Elasticache Redis instance is going to be the scalable model of the future I think.

But before that happens, there is a couple things that need to happen.

Communication between all parts of the Selenium infrastructure needs to be securable. Currently everything is HTTP but it needs to be HTTPS (if not by default, then at least configurable.) If anyone wants to do that, patches are welcome and would save me the work of doing it.

Similarly, there needs to be the some way of authorizing Nodes into the Hub. Right now, any Node can register itself with the Hub and start getting traffic. Its an interesting attack vector to think about where you discover someone launching a Hub in a public subnet and you lighting up a Node and attaching to it and now seeing a company’s next version of their app because they are sending it to you. The vector gets even more interesting when taking into consideration there is work being done to allow communication back to the Hub from the Node. If I can overflow a buffer and run arbitrary commands on the shell somehow your network is now fully compromised. Again, feel free to submit a patch along the lines of https://www.elastic.co/guide/en/beats/filebeat/current/configuring-ssl-logstash.html so I don’t have to do it.

And that was the talk. Next steps with it are unknown. I’m seriously considering turning it into a video series and maybe offering it as a workshop at future SeleniumConfs.

Is your Automation Infrastructure ‘Well Architected’? – SeConfChicago edition

This week I was in Chicago to get back onto my soapbox around how automation patterns have been largely created so the risk has shifted to the infrastructure the scripts run on. There is too much content for an hour which is the length I thought I had, until the day before which I realized I had 45 minutes. And then on stage the counter said 40 minutes.

Anyhow, this talk is supposed to overwhelm by intention. The idea being here is a whole bunch of things you can think about on your own time with the assistance of the actual well architected framework (which, again, is cloud neutral if you swap out cloud provider component names.)

See you next month in Malmo where I’m giving it again are Øredev.

(I’ll embed the recording once its available.)

Laravel and Logstash

As we get larger clients, our need to not be cowboying our monitoring / alerting is increasing. In our scenario we are injesting logs via Logstash and sending them all to an AWS Elasticsearch instance, and if it is of severity ERROR we send it to AWS Simple Noticiation Service (which people or services can subscribe to) as well as send them to PagerDuty.

Input
For each of our services we have an input config which basically says ‘consume this file patter, call it a laravel file, and add its stack name to the event.’

input {
  file {
    path => "<%= scope['profiles::tether::www_root'] %>/storage/logs/laravel-*.log"
    start_position => "beginning"
    type => "laravel"
    codec => multiline {
      pattern => "^\[%{TIMESTAMP_ISO8601}\] "
      negate => true
      what => previous
      auto_flush_interval => 10
    }
    add_field => {"stack" => "tether"}
  }
}

Filter
Since its a type laravel file, we pull out the environment its running in, and log severity, plus grab the ip of the instance, build the SNS message subject and make sure the event timestamp is the one in the log, not the time logstash touched the event. (Without that last step, you end up with > 1MM entries for a single day the first time you run things.)

filter {
  # Laravel log files
  if [type] == "laravel" {
    grok {
      match => { "message" => "\[%{TIMESTAMP_ISO8601:timestamp}\] %{DATA:env}\.%{DATA:severity}: %{GREEDYDATA:message}" }
    }
    ruby {
      code => "event.set('ip', `ip a s eth0 | awk \'/inet / {print$2}\'`)"
    }
    mutate {
      add_field => { "sns_subject" => "%{stack} Alert (%{env} - %{ip})" }
    }
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
      target => "@timestamp"
    }
  }    
}

Output
And then we pump it around where it needs to be.

If you are upgrading ES from 5.x to 6.x you need to have the template_overwrite setting else the new schema doesn’t get imported and there was some important changes that were made. The scope stuff is for Puppet to do replacements. And there is a but in 6.4.0 of the amazon_es plugin around template_overwrite…

output {
  amazon_es {
    hosts => ["<%= scope['profiles::laravel::es_host'] %>"]
    region => "us-west-2"
    index => "logstash-<%= scope['environment'] %>-%{+YYYY.MM.dd}"
    template => "/etc/logstash/templates/elasticsearch-template-es6x.json"
	template_overwrite => true
  }
}
output {
  if [severity] == "ERROR" { 
    sns {
      arn => "arn:aws:sns:us-west-2:xxxxxxxxx:<%= scope['environment'] %>-errors"
      region => 'us-west-2'
    }
  }
}

I’m not quite happy with our Pageruty setup as the de-duping is running at an instance level right now. Ideally, it would have the reason for the exception as well but that’s a task for another day.

output {
  if [severity] == "ERROR" { 
      pagerduty {
        event_type => "trigger"
        description => "%{stack} - %{ip}"
        details => {
          timestamp => "%{@timestamp}"
          message => "%{message}"
        }
        service_key => "<%= scope['profiles::laravel::pagerduty'] %>"
        incident_key => "logstash/%{stack}/%{ip}"
      }
  }
}

For the really curious, here is my Puppet stuff for all this. Every machine which has Laravel services has the first manifest, but there are some environments which have multiple services on them which is why the input file lives at the service level.

modules/profiles/manifests/laravel.pp

  class { 'logstash':
    version => '1:6.3.2-1',
  }
  $es_host = hiera('elasticsearch')
  logstash::configfile { 'filter_laravel':
    template => 'logstash/filter_laravel.erb'
  }
  logstash::configfile { 'output_es':
    template => 'logstash/output_es_cluster.erb'
  }
  if $environment == 'sales' or $environment == 'production' {
    logstash::configfile { 'output_sns':
      template => 'logstash/output_sns.erb'
    }

    $pagerduty = lookup('pagerduty')
    logstash::configfile { 'output_pagerduty':
      template => 'logstash/output_pagerduty.erb'
    }
  }
  unless $environment == 'development' {
    file { [ '/etc/logstash/templates' ]:
      ensure => 'directory',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx'
    }

    file { [ '/etc/logstash/templates/elasticsearch-template-es6x.json' ]:
      ensure => 'present',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx',
      source => 'puppet:///modules/logstash/elasticsearch-template-es6x.json',
      require => Class['Logstash']
    }

    logstash::plugin { 'logstash-output-amazon_es': 
      source => 'puppet:///modules/logstash/logstash-output-amazon_es-6.4.1-java.gem',
      ensure => '6.4.1'
    }
  }

modules/profiles/manifests/.pp

  logstash::configfile { 'input_tether':
    template => 'logstash/input_tether.erb'
  }

The next thing I need to work on is consuming the ES data back into our app so we don’t have to log into Kibana or the individual machines to see the log information. I think every view-your-logs solution I’ve seen for Laravel has been based around reading the actual logs on disk which doesn’t work in a clustered environment or where you have multiple services controlled by a hub one.

Structured Logs in Laravel (Part 2)

The previous post showed you how to tweak Laravel’s default logging setup to output in json which is a key part of creating structured logs. With structured logs you can move yourself towards a more Observable future by tacking on a bunch of extra stuff to your logs which can then be parsed and acted upon by your various systems.

And Laravel supports this out of the box — it just isn’t called out that obviously in the docs. (It is the ‘contextual information’ section on the Logging doc page. (Or ‘Errors and Logging’ for pre-5.6 docs). Basically, you create an array and pass it as the second argument to your logging call and it gets written out in the ‘extras’ part of the log.

ubuntu@default:/var/www/tether$ sudo php artisan tinker
Psy Shell v0.9.7 (PHP 7.1.20-1+ubuntu16.04.1+deb.sury.org+1 — cli) by Justin Hileman
>>> use Ramsey\Uuid\Uuid;
>>> $observationId = Uuid::uuid4()->toString();
=> "daed8173-5bd0-4065-9696-85b83f167ead"
>>> $structure = ['id' => $observationId, 'person' => 'abc123', 'client' => 'def456', 'entry' => 'ghi789'];
=> [
     "id" => "daed8173-5bd0-4065-9696-85b83f167ead",
     "person" => "abc123",
     "client" => "def456",
     "entry" => "ghi789",
   ]
>>> \Log::debug('some debug message here', $structure);
=> null

which gets output like this

{"message":"some debug message here","context":{"id":"daed8173-5bd0-4065-9696-85b83f167ead","person":"abc123","client":"def456","entry":"ghi789"},"level":100,"level_name":"DEBUG","channel":"development","datetime":{"date":"2018-09-03 18:31:31.079921","timezone_type":3,"timezone":"UTC"},"extra":[]}

Of course there is no ‘standard’ for structured logs (nor should there be as they really are context sensitive), but most of the examples I’ve seen all have some sort of id for giving context for tracing things around.

Note: The id in this case is solely for dealing with logs message output. This is not for application request tracing which I think is also really interesting but have not delved into yet.

Structured Logs in Laravel

I’ve been following the likes of Charity Majors on the twitters and one of her big things around Observability is producing logs in a ‘structured’ format. (Loosely defined as ‘something that a machine can easily read and make decisions on.)

Out of the box, Laravel ships with a logging system that uses Monolog and its LineFormatter which is Apache-esque.

const SIMPLE_FORMAT = "[%datetime%] %channel%.%level_name%: %message% %context% %extra%\n";

Which means regexes to parse, etc. but they are designed more for human consumption than machine consumption.

The hints of how to change the format to a structured (json) one is right in the docs but as the expression goes, ‘an example would be handy here’. So here you go.

/*
|--------------------------------------------------------------------------
| Logging Changes
|--------------------------------------------------------------------------
|
| Structured logs ftw
|
*/
$app->configureMonologUsing(function ($monolog) {
    $days = config('app.log_max_files', 5);
 
    // default 
    $path = storage_path() . '/logs/laravel.log';
    $handler = new RotatingFileHandler($path, $days);
    $handler->setFormatter(new LineFormatter(null, null, true, true));
    $monolog->pushHandler($handler);
 
    // structured
    $path = storage_path() . '/logs/laravel.json';
    $handler = new RotatingFileHandler($path, $days);
    $handler->setFormatter(new JsonFormatter());
    $monolog->pushHandler($handler);
});

Drop this right before the ‘return $app;’ in bootstrap/app.php and you’ll have two logs, one the default way, and one the new structured way. I’m including both at the moment because we have a bunch of log capture / manipulation stuff around the default structure I haven’t changed yet. Once thats all updated I’ll get rid of the default section.

There’s been a bunch of noise in the Laravel world around ‘supporting the enterprise’. Adding a ‘log format’ is one of those small enterprise support things that adds huge value.

(And yes, I know, pull requests are welcome, but until then, here is a blog post.)

mobilexco/laravel-scout-elastic; an AWS Elasticsearch driver for Laravel Scout

A large piece of what we’ll be doing the last half of this year is improving the support workflows inside Tether (our MarTech platform) and that includes Search. Being a Laravel shop, it makes sense to start with Scout to see if that gets us close, if not completely over the line.

We use Elasticsearch for other things in Tether so it made sense to use that as the Scout backend through the super helpful ErickTamayo/laravel-scout-elastic package. And it worked as advertised right out of the box for local development (inside Vagrant with a local Elasticsearch server). But as soon as we moved the code to a cloud environment that used an AWS Elasticsearch instance it was throwing all sorts of wacky errors. Turns out, AWS Elasticsearch is different than Elastic Elasticsearch — not completely, just how communication is sent over the wire.

No problem. we’re clearly not the first one to discover this problem, and sure enough there are a number of forks of the package that add this in. But, they commit one of the following sins;

  • Required AWS credentials to be checked into your repo in a .env file
  • Used env() to fetch configuration settings which breaks if you are caching configs (and you really should be)
  • Required manual intervention with Elasticsearch while deploying.

These are the result of Laravel being in the awkward teenage years, and all very solvable. I just wish it didn’t seem that things I need require these fixes…

Anyhow, mobilexco/laravel-scout-elastic uses the defaultProvider() which means it will work with IAM roles on your AWS infrastructure to authenticate with Elasticsearch. This the official AWS recommended approach and does not require the presence of keys on the server (and all the pain around rotation, etc. that comes with using keys).

It also publishes conf/laravel-scout-elastic.php for setting flags it needs to decide whether to use Elastic or AWS implementations of Elasticsearch rather than env() so config:cache works. (This should likely be better called out in the Laravel docs for creating packages…)

The package also includes a new Artisan command (scout:create-index) which can be called by via something like AWS CodeDeploy (in the AfterInstall hook) to ensure the index is created that Scout will be using. Which is useful if you are in an environment where your Elasticsearch access is restricted to only the boxes that need access and those boxes don’t have ssh installed on them. (Artisan commands are run either CodeDeploy or SSM.)

Hopefully this saves someone the 12 hours of distracted development it took to come up with this solution.

Client-specific domains with CloudFormation for clients that use Google as email provider

A number of our clients want vanity domains for their experiences, which adds a laywer (or two) of operations overhead beyond just having a line item in the invoice. In the spirit of ‘infrastructure as code’-all-the-things, this is now my process for registering new domains for our clients

  1. Register the domain through Route 53
  2. Delete the hosted zone that is automatically created. (It would be nice if there was an option when getting the domain to not automatically create the hosted zone.)
  3. Login to Google Apps and add the domain as an alias. When prompted to verify, choose Ghandi as the provider and get the TXT record that is needed
  4. Create a CloudFormation stack with this template. Some interesting bits;
    • Tags in the AWS Tag Editor are case sensitive so ‘client’ and ‘Client’ are not equivilent
    • I think all my stacks will include the ‘CreationDateParameter’ parameter from now on which gets added as a tag to the Resource[s] that can accept them. This is part of the ‘timebombing’ of resources to make things more resilient. In theory I can also use AWS Config to find Resources that are not tagged and therefore presumably under CloudFormation control.
    • Same thing for the ‘client’ tag. Though still nto keen on that name or billing_client or such.
    {
      "AWSTemplateFormatVersion": "2010-09-09",
      "Parameters": {
        "ClientNameParameter": {
          "Type": "String",
          "Description": "Which client this domain is for"
        },
        "DomainNameParameter": {
          "Type": "String",
          "Description": "The domain to add a HostedZone for"
        },
        "GoogleSiteVerificationParameter": {
          "Type": "String",
          "Description": "The Google Site Verification TXT value"
        },
        "CreationDateParameter" : {
          "Description" : "Date",
          "Type" : "String",
          "Default" : "2017-08-27 00:00:00",
          "AllowedPattern" : "^\\d{4}(-\\d{2}){2} (\\d{2}:){2}\\d{2}$",
          "ConstraintDescription" : "Date and time of creation"
        }
      },
      "Resources": {
        "clienthostedzone": {
          "Type": "AWS::Route53::HostedZone",
          "Properties": {
            "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
            "HostedZoneTags": [
              {
                "Key": "client",
                "Value": {"Ref": "ClientNameParameter"}
              },
              {
                "Key": "CloudFormation",
                "Value": { "Ref" : "CreationDateParameter" }
              }
            ]
          }
        },
        "dnsclienthostedzone": {
          "Type": "AWS::Route53::RecordSetGroup",
          "Properties": {
            "HostedZoneId": {
              "Ref": "clienthostedzone"
            },
            "RecordSets": [
              {
                "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
                "Type": "TXT",
                "TTL": "900",
                "ResourceRecords": [
                  {"Fn::Sub": "\"google-site-verification=${GoogleSiteVerificationParameter}\""}
                ]
              },
              {
                "Name": {"Fn::Join": [".", [{"Ref": "DomainNameParameter"}]]},
                "Type": "MX",
                "TTL": "900",
                "ResourceRecords": [
                  "1 ASPMX.L.GOOGLE.COM",
                  "5 ALT1.ASPMX.L.GOOGLE.COM",
                  "5 ALT2.ASPMX.L.GOOGLE.COM",
                  "10 ALT3.ASPMX.L.GOOGLE.COM",
                  "10 ALT4.ASPMX.L.GOOGLE.COM"
                ]
              }
            ]
          }
        },
      }
    }
  5. Update the domain’s nameservers for the ones in our newly created Hosted Zone. I suspect this could be done via a Lambda backed custom resource, but that’s a couple steps too complicated for me right now. If I have to do this more than once every couple weeks it’ll be work the learning time.
  6. Validate the domain with Google.
  7. Manually create a certificate for ${DomainNameParameter} and *.${DomainNameParameter}. (For reals, this should be an automatic thing for domains registered in Route 53 and hosted within Route 53.)

And then I need to create an ALB for the domain and point it at the right service. But thats getting rather yak shave-y. The ALB needs to be added to the ASG for the service but those are not under CloudFormation control so I need to get them under control.