Secure Node Registration in Selenium Grid 4

This is another in the small series of ‘things that have changed in Selenium Grid but I have not yet added to the official docs’ posts.

When Selenium Grid was first created, the expectation was that your Grid Hub was nicely secured inside your network, along with your Nodes so everything should be trusted to communicate. But now that we live in a cloud environment, that assumption isn’t quite as tight as it once was. You could have everything tightly locked down in your AWS account, but if someone gets their access key comprimised who can make instances, well, it’s a problem. I know if I was a bad guy, I would be scanning for open Grid Hubs and then figure out how to register with them. There is a wealth of information to be had; competitive intelligence on new features not available in the wild, account details for production testing, etc.

Last night I pushed a change that prevents rouge Grid Nodes from registering in your Grid Hub (so it will be available in the next alpha or you can build it yourself now). I don’t know if this has ever happened in the wild, but the fact that it could is enough that it needed to be closed down.

In order to secure Node registration, you need to supply the new argument --registration-secret in a couple different places. If the secrets do not match, the Node is not registered. This secret should be treated like any other password in your infrastructure, which is to say, not checked into a repo or other practice. Instead it should be kept in something like Hashicorp Vault or AWS Secrets Manager and only accessed (via automated means) when needed.

Standalone Server

When running your Hub as a single instance, there is only one process so only one place that needs the secret handed to it

  java -jar selenium.jar \
       hub \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       --registration-secret cheese

Distributed Server

When running your Hub in a distributed configuration, the Distributor and Router servers need to have it.

  java -jar selenium.jar \
       distributor \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       --registration-secret cheese \
  java -jar selenium.jar \
       router \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       -d https://distributor.grid.com:5553 \
       --registration-secret cheese

Node

Regardless of your approach to running the Server, the Node needs it too. (Obviously.)

  java -jar selenium.jar \
       node \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       --detect-drivers \
       --registration-secret cheese

Detection

When a Node fails to register, two things happen;

  1. A log entry is created at an ERROR level saying a Node did not register correctly. Your Selenium infrastructure needs the same attention to its logging as your production infrastructure. So this should trip an alert to someone in whatever manner it is this would happen for a potential security problem in any other environment.
  2. An event is dropped onto the bus. The Selenium Server shipped with a 0mq bus built-in, but when deploying in a real environment I would suggest using it with something like AWS SQS (or your cloud’s equivilant) as your queuing system which you and then have something like AWS Lambda watch for these events and trigger actions accordingly.

It should be noted further that these are all on the side of the Server, not the Node. The rouge Node is not given any indication that secrets are configured or that the secret they sent was incorrect.

I was on the ‘Test Guild Automation Podcast’

Woke up this morning to a note from Joe that my episode of the Test Guild Automation Podcast is now live. It should come as no surprise that I talk about Selenium infrastructure with him. I felt like I was rambling, but pretty sure Joe kept pulling me back on topic — but since I don’t like how my voice sounds on recordings, I’ll have to you let me know how it turned out.

Secure Communications with Selenium Grid 4

For the last couple years, my schtick has been that I don’t care about your scripts, just your infrastructure. I’m pretty sure in my talk at SeConf London I mused that it was bonkers that we had got away communicating to the Se Server via HTTP. (I have to deal with vendor audits at work and they get really antsy at any mention of HTTP.) At SeConf Chicago I crashed the Se Grid workshop and asked (knowingly) if I was correct that communication was only via HTTP hoping someone would fix it for me. Alas, no one took the bait so at SeConf in London, I was describing the problem to Simon who happened to be creating a ticket (actually, a couple) as I talked, and then I got an alert saying it was assigned to me. The squeaky wheel applies its own grease it seems.

There are a couple catch-22’s in place before I can update the official Selenium Documentation (have you seen the new doc site? It’s great!), so in lieu of that, here is a quick how-to on something that will be in Selenium 4 Alpha 2 (or now if you build it yourself).

What is below is the output of ‘info security’ on the new server. (The ‘info’ command is also new and as of yet undocumented.)


Selenium Grid by default communicates over HTTP. This is fine for a lot of use cases, especially if everything is contained within the firewall and against test sites with testing data. However, if your server is exposed to the Internet or is being used in environments with production data (or that which has PII) then you should secure it.

Standalone

In order to run the server using HTTPS instead of HTTP you need to start it with the --https-private-key and --https-certificate flags to provide it the certificate and private key (as a PKCS8 file).

  java -jar selenium.jar \
       hub \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem

Distributed

Alternatively, if you are starting things individually you would also specify HTTPS when telling where to find things.

  java -jar selenium.jar \
       sessions \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem
  java -jar selenium.jar \
       distributor \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556
  java -jar selenium.jar \
       router \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       -d https://distributor.grid.com:5553 \

Certificates

The Selenium Grid will not operate with self-signed certificates, as a result you will need to have some provisioned to you from a Certificate Authority of some sort. For experimentation purposes you can use MiniCA to create and sign your certificates.

  minica --domains sessions.grid.com,distributor.grid.com,router.grid.com

This will create minica.pem and minica.key in the current directory as well as cert.pem and key.pem in a directory sessions.grid.com which will have both distributor.grid.com and router.grid.com as alternative names. Because Selenium Grid requires the key to be in PKCS8, you have to convert it.

  openssl pkcs8 \
    -in sessions.grid.com/key.pem \
    -topk8 \
    -out sessions.grid.com/key.pkcs8 \
    -nocrypt

And since we are using a non-standard CA, we have to teach Java about it. To do that you add it to the cacert truststore which is by default, $JAVA_HOME/jre/lib/security/cacerts

  sudo keytool \
      -import \
      -file /path/to/minica.pem \
      -alias minica \
      -keystore $JAVA_HOME/jre/lib/security/cacerts \
      -storepass changeit \
      -cacerts

Clients

None of the official clients have been updated yet to support this, but if you are using a CA that the system knows about you can just use an HTTPS Command Executor and everything will work. If you are using a non-standard one (like MiniCA) you probably will have to just through a hoop or two similar to here in Python which basically says “Yes, yes, I know you don’t know about the CA but I do so just continue along anyways.”

from selenium import webdriver

import urllib3
urllib3.disable_warnings()

options = webdriver.FirefoxOptions()
driver = webdriver.Remote(
    command_executor='https://router.grid.com:4444',
    options = options
)

driver.close()

Scrum is an anti-pattern for Continuous Delivery

I’ve been saying that ‘Scrum is an anti-pattern for Continuous Delivery’ for awhile, including last week’s post which got a ‘huh?’ so here is my beef with Scrum.

Actually my complaint isn’t with Scrum itself, but with Sprints and if you remove those then the whole house of cards fall down. (This is similar to my stance on Java, which I do dislike, but I loathe Eclipse so Java is tolerable in something other than Eclipse. Barely.)

The whole point of Continuous Delivery, to me, is to ‘deliver’ improvements to whatever it is you do, to your customers ‘continuously.’ Where continuously means, well, continuously. Not ‘at the end of an arbitrary time people which usually is about 2 – 3 weeks in length.’ This is why ‘Mean Time To Production’ is such an important metric to me and drives all other changes to the delivery pipeline.

“But Adam, how will we plan what we do if we don’t get to play Planning Poker every week?” Easy. Your customers will tell you. And that ‘customer’ could be internal. If something is important, you will know. If something is more important than something else, then it will bump that down the queue. This isn’t to say discussing things and figuring out how to slice them into smaller and smaller units isn’t necessary. It absolutely is. And learning how to do this is perhaps one of the hardest problems in software. Which leads to…

“But Adam, this is a big story that will take the full sprint to complete.” Slice it smaller, hide the work-in-progress behind feature flags and still push your local changes daily. (You should be using them anyways to separate feature launch from availability.)

“But Adam, we could deploy at any point — we just do it once a sprint.” Why? You are actively doing a disservice to your customers and your company by holding back things that could improve their experience and make you more money. Disclaimer: this becomes a more real argument when deploying to IoT or other hardware. I don’t want my thermostat to get updated 20 times a day. But if the vendor could do it, I’ll accept that.

“But Adam, we are in a regulated environment and have to do Scrum.” That’s a strawman argument against working with your auditors. See Dave’s recent Continuous Compliance article.

“But Adam, how will we know if we are getting better at estimating?” The same way you do with Scrum or anything else, which is to collect data. This is a bulk food type problem. When you go to buy say peanut butter from the bulk food store, you take in your container and they weigh it before you scoop your peanut-y deliciousness into it, and after. They then do the math to know the weight of just the peanut button. Same thing can be done here. If you know how long your deploys take, you can do math to say between time code was started to the time it was available in production. And then remove the fixed time of deployments to get the actual length of time something took. In its entirety, not just ‘in development’. (I don’t actually track this metric right now. Things take the length of time they take. But I think this is sound theory.)

“But Adam, where do all our manual testers fit in this world?” They are just part of the process. This is a key difference between Continuous Deployment and Continuous Delivery. If your process says humans touch it, then humans touch it. But there also needs to be a way to short-circuit around them in the case of an emergency.

“But Adam, our database is so archaic and fragile that deployments are a huge risks and sprints minimize that.” That’s a good place to start to change things. A local company still does releases weekly overnight on Wednesdays after 5 years because of this. I’m pretty sure it stopped being a tech problem and well into a people problem a couple years ago.

So if not Scrum, then what? The ‘easy’ answer is Kanban. The harder answer is of course ‘it depends’ and likely looks like a tailored version of Kanban to solves your team’s problems. I really like the notion of a work item flowing across a board, but also dislike enforcing WIP limits and the artificial moving of things left to make room for something else because the tooling requires it.

Let me know what other “But Adam’s” I missed in the comments.

Oh, I’ve got one more.

“But Adam, that is hard.” Yes. Yes it is. (It’s also super fun.)

‘So what would you do?’

Another ‘free consulting is content’ post. The context here is a 10 year old company a friend of mine is the VP of Engineering at whose delivery pipeline worked … but there were some horrible manual steps (as compared to manually pushing a button steps which are perfectly acceptable, if not desirable) and the things were too custom and black box-y. Oh, and the deploy from CircleCI was just flat out broken right now. The gist of the conversation was ‘if you helped us out, what would it look like.’

What’s interesting is that this, and other conversations like this that I have had in the last month have really distilled my thoughts around pipelines which leads to a playbook of sorts, but that’s beyond scope of this. Aside from this looking a lot like what the playbook looks like.

Anyhow, here is the ‘only slightly edited’ bit of free consulting I gave.

  1. Check that things that should already be done are done

Root account has a hardward MFA token that is somewhere secure, CloudTrail is enabled and has the fun Lambda script to auto re-enable is disabled, deletion protection turned on, etc.

  1. CodeDeploy

Since deploying from CircleCI is busted anyways, get it producing CodeDeploy packages and manually install the agent on all the boxes

  1. Packerize all images

Standardize on a Linux distro (another other than Amazon Linux 2 is silly). Create base AMIs with CodeBuild triggered off of Github webhooks to the $company-Packer repo. Again, doesnt matter which configuration management tool Packer uses — as long as they can justify the choice. And as I mentioned, AWS has given a credible reason to use Ansible with the integration of running playbooks from System Manager.

  1. Replace CircleCI with CodePipeline (orchestration) and CodeBuild (build, test and package) — since deploy is already done via CodeDeploy
  2. Feature Flags

Managed via an admin screen into the database (not file-based) to dark launch features to cohorts and/or percentages before full availability.

  1. Airplane Development

‘Can you do development on an airplane without internet access’ — so no shared databases, needing to reach out to the internet for js or fonts or icons, etc. Look at developer onboarding at this point too. Vagrant is great. Docker is the hotness. But Vagrant means you can literally have the same configuration locally as you do in production. Docker can too of course if you are going Fargate/ECS.

  1. Health

Monitoring (all the layers, reactive and proactive — all but one of my major outages could have been predicted if I was watching the right things), Logging (centralized, retention), Paging (when and why and fix the broken windows), Testing-in-Production (it’s the only environment that counts), Health Radiator (there should be a big screen with health indicators, but system and business in your area), etc.

  1. Autoscaling

Up and Down, at all the layers. Driven by monitoring and logging.

  1. Bring everything under Terraform control

Yes, only at this point. It ‘works’ now — just not the way you want it to. Everything above doesnt ‘work’. Again, I’d use Terraform over CloudFormation, but for ‘all in on aws’ CloudFormation is certainly an option. Now if only CloudFormation was considerd a first class citizen inside AWS and supported new features before competitors like Terraform does. CloudFormation still doesn’t have Route 53 Delegation Sets the last time I checked.

  1. Disaster Recovery

‘Can you take the last backup and your Terraform scripts and light up $company in a net new aws account and be able to down a maximum of how long it takes to copy RDS snapshots or lose only data from last in-flight backup.’

  1. Move to Aurora

Just because I like the idea of having the ability to have the database trigger Lambda functions

  1. Observability

Slightly different than Health — basically I would use Honeycomb because Charity, etc. are far too smart.

  1. Chaos Engineering

Self healing, multi-region, etc. If Facebook can cut the power to their London datacenter and no one notices, $company can do something less dramatic with equal effect.

And then it’s ‘just’ keeping the ship sailing the way you want, making slight corrections in the course along the way.

We need a priest (QA) to bless (test) all our work

A friend of mine pinged me during his commute this morning about my thoughts on weening a team off of thinking they need ‘a priest (QA) to bless (test) all our work’. ‘Free’ consulting means it gets to be content. :D

Obviously, this is a ‘people problem’. So the approach will vary place-to-place and even within the place. Regardless though, need to start by expunging ‘QA as Quality Assurance’ from the organization. They don’t actually ‘Assure’ anything. You, or a half dozen other people could override. So ‘Quality Assistance’ is a nicer reframing. Or better still, ‘Testing’.

Then you need to play detective and find out what the inciting event was that caused a) the first ‘QA’ person to be hired, and b) how they got anoited as priests. Smooth transisition away from that requires you know those two things.

Organizationally, I would be interested in;

  • how many things are found by the testers
  • what the categorization is (because those are developer blind spots)
  • how many things that are found actually hold up the build until fixed
  • and of those, how many could have shipped

From a purely technical perspective, some practices that address this;

  • dark launches via feature flags and have new stuff rolled out slowly to user slices
  • acknowledge that production is different than any other environment and is the only environment that matters. To quote Charity; ‘I test in production, and so do you.’
  • the only metric that matters in today’s world is ‘mean time to production’. Something isn’t ‘done’ unless it is in production being used by the target customer. Everything you want to do hinges on that. Put on your wall a whiteboard with ‘number of deploys today’, ‘number of deploys this week’, ‘number of deploys this month’ which you increment each time it goes to production
  • if you think your feature stories are small enough, you need to slice it more
  • not to overload the term, but increase the observability of the application in the more traditional way not honeycomb way. If you are pushing to production fast and often, you need to know its behaving or not fast and often. Number of logins per 5 minutes, number of registrations per 5 minutes, number of searches per 5 minutes, etc. Every new feature / fix needs to have a measure to know if it is working. (It will take a long time to get to here.)
  • move to trunk based development. Everyone should be pushing code at least once every 2 days. Feature branches allow people to get sloppy.
  • Obviously, TDD is huge in this. (or TAD — I don’t care, just slow down and write some damn tests before committing)
  • Steal from Etsy’s playbook and have your pipeline such that day 1 at <redacted> is pushing to production, and day 2 is paperwork / onboarding. Forces you to get your development environment in shape so you can onboard someone from bare machine to productive in an hour and also it breaks the feeling of sanctity around production and creates shared ownership. I believe everyone at etsy did this, not just developers. (Though obviously non-developers had a borrowed environment and were hand-held)

MTTP reduction is the whole thing purpose of building out a Continuous Delivery pipeline. ‘QA Priest’ doesn’t fit time wise for that. (Its also why Scrum is a Continuous Delivery anti-pattern.)

But again, this is a Culture thing. To quote Jerry; ‘things are the way they are because that is the way they got there’ — figure that out and you can change the culture.

Practice what you Preach

Last week I was in London, England for SeleniumConf where I gave a talk on test infrastructure. The feedback seemed to be good from the people I talked to, but I personally was uncomfortable on stage and felt it might have been my worst performance. I felt good about the content the morning of and had a few jokes, etc. planned, but when I got going the switching of windows (1xFirefox, 1xChrome, 1xKeynote, 1xTerminal, 1xSublime) completely threw me off my game and was a downward spiral from there. It likely was too ambitious for a 40 minute slot and better suited to more of a hands-on, all-day workshop. Here is the sorta Commentary Track with extra links and such.

Most of my decks have a visual theme to them, but I couldn’t come up with anything. At one point I tried Rocky and Bullwinkle because I could say ‘Nothing up my sleeve’ and show Bullwinkle as a magician when showing an empty AWS account. But then I thought of ‘Practice what you Preach(er)’ and tried to shoehorn that in. But its /really/ hard to find appropriate things to include so ripped most of it out. And, most of the talk was supposed to be code and/or architecture diagrams so really any visual theme was a stretch. In the end, I left a couple Preacher things in, but it wasn’t an obvious thing and was lost on most — so likely should have ripped them all out.

These really are the ‘rules’ of presenting. In general, your talk will be improved if you avoid these. The logic being; the audience knows who you are from the bio in the program, they want you to succeed so don’t sow the idea of failure into their minds, and things will go inevitably go wrong (best to anticipate it and have a video of you doing a thing instead of doing it.)

“So let’s break some rules.”

Fell flat. And and the Preacher reference felt forced. (It was.)

Breaking Rule 1. I really don’t care about scripts anymore. There has been tragically little innovation in the actual script creation and maintenance space. But what people don’t talk about is the business risk around where those scripts run. I don’t have any data to substantiate this claim, but my gut is that too many people are just spinning up EC2 or ECS instances to run their scripts without knowledge of the tooling around it in order to run them securely and efficiently.

Breaking Rule 2. I had such big plans for this talk, but have been battling burnout for a year now. It’s been especially bad the last couple months which is exactly when I needed to be prepping things for success. Which didn’t help things as burnout feeds off of burnout.

Burnout isn’t technically a clinical diagnosis, but I like this definition.

Thankfully there are organizations now specifically chartered to help tech people deal with their traitor brains. Such as https://osmihelp.org.

This is a ‘simple’ view of what a modern, self-hosted Selenium infrastructure in AWS could look like. I’m likely missing a few things, but it really hasn’t changed in the last 5 or 6 years. Selenium Grid 4.0 could make some interesting changes at scale as the Hub can be broken into 4 different services. Oh, and I don’t include Docker in here because I don’t believe in running scripts on environments your customers are not. You are of course more than welcome to if enough of your customers are using headless browsers or Linux desktops. I’m also not current on how to setup Sauce Labs in a hybrid scenario (or even if they support that configuration anymore) with their tunnel product adding their cloud as available browser nodes into the Hub — which always thought was a sweet spot for them.

Here is the conclusion of the ‘Do not start with an apology’ rule and the origin of the name. In Austin I rambled (even for me) about infrastructure and just threw a tonne of ideas and topics at the audience. In Chicago I use the https://aws.amazon.com/architecture/well-architected/ Well Architected Framework from AWS to organize all those ideas. It is by AWS and uses AWS product offerings as examples, but it really is cloud neutral at its core. There was a tweet a month or so ago about a teach that used it to cut their Azure bill by something like 40% but applying the principles in it to their cloud infrastructure. So the plan then for this was to open up a terminal, run terraform apply, do the rest of the talk then have that full diagram in the previous slide created and run a test.

Yaaaaaa. About that. Remember burnout? Ends up I got maybe 1/5 of it done. And couldn’t run a test. So rather than the pitched ‘All code’ there was ‘Some code’.

Now we’re starting to get into Rule 3 territory about not running a live demo. And the tools I’m recommending these days to do it are the Hashicorp ones. I used to suggest using the cloud provider’s native ones (such as CloudFormation for AWS) but for the above reasons have switched. I of course reserve the right to change my mind again in the future.

Almost ready to build some infrastructure, live on on stage, but first have to talk about pre-conditions.

The ‘Root Account’ is the all powerful account with every permission possible. It is also the most dangerous one. The only thing you should do with it is enable hardware MFA (https://aws.amazon.com/iam/features/mfa/?audit=2019q1 has links to purchase), create an ‘administrators’ group that has the ‘AdministratorAccess’ managed policy and create an ‘IAM User’ in that group.

The ‘IAM User’ will have a virtual MFA token and access keys.

This is where juggling windows went crazy. Escape out of Keynote presenter mod to Firefox which had the Root Account window, then to Safari had the IAM User window, then to the terminal to start Terraform and watch it scroll a bit until it applies the ‘Everything must have MFA policy’ (as described at https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_iam_mfa-selfmanage.html) until things fail and demo getting a token in the shell (by switching between the terminal and Sublime which had the commands I wanted) and finish running Terraform.

The network held, and things applied without a problem. But it was here that I realized window switching wasn’t going to work so had to adjust on the fly.

One of the first pieces of infrastructure to be created needs to be the networking layer itself. I strongly believe that AWS’ VPC Scenario 2 (https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenarios.html) is the right one in almost all Selenium deployments. Basically, everything of importance is in the ‘private’ (blue) subnet and is not reachable from the internet and the only thing that can reach is either a bastion host in the ‘public’ (green) subnet or a VPN server that sits in the public subnet. You still have to keep things in the private subnet patched, etc. but there is a level of risk mitigation achieved when the bad guys (and gals) cannot even access the machines.

I gave the bastion host an Elastic IP (static) and likely should have also registered it with Route 53 (AWS’ DNS service) and tried to SSH into it. But it didn’t work — on purpose. That is because I created an empty security grep with Terraform and attached it to the bastion host. Using the AWS CLI I added just my IP as an ingress rule so now one of the instances I created was accessible from the internet, but only from a specific IP and certificate. It was also to be able to demonstrate in Terraform how to have it ignore certain changes to prevent it from rebuilding parts of your infrastructure and kicking you out mid change. (There’s a lesson learned the hard way…)

This slide talks about how the bastion (and other) instances were produced by another part of the Hashicorp suite — Packer. No demo, but the idea of ‘Cattle vs Pets’ for your instances was brought up again. My current Packer setup uses the puppet-masterless (https://www.packer.io/docs/provisioners/puppet-masterless.html) provisioner but I would consider switching to Ansible in the future as AWS just announced that Systems Manager can run Ansible playbooks directly from S3 or Github which is kinda game changing to me. puppet-masterless relies on SSH and ideally the last step of provisioning should be to remove all the keys from the box and deal with things strictly through Systems Manager. Again, if everything is in a private subnet that doesn’t allow for access into the boxes, it is another level of security.

I also suggested using something like Secrets Manager or Vault to store passwords and other secure things rather than putting them right in your Terraform manifests.

Which dovetailed to me copy-and-pasting a private key into the bastion host. And then showing a security group that allows access into the Hub that was brought up from only the bastion.

Since we’re in AWS (and a lot of others are as well) we have to talk about security. And one of the most important parts around that in AWS is its API logging tool CloudTrail. The Terraform scripts configured a single CloudTrail trail across all regions and stores the logs in S3. Be careful about doing this though if you are in multiple regions as you pay for traffic and this can silently ad to your bill if you are not careful.

One trick AWS suggests is you have CloudTrail monitor itself and automatically re-enable itself if it is disabled. This is what is on this slide and is described in more detail on https://aws.amazon.com/blogs/mt/monitor-changes-and-auto-enable-logging-in-aws-cloudtrail/

One thing anyone building out infrastructure needs to be aware of is how much their stuff is costing at any particular moment in time. And to be warned when something spirals out of control. This is where billing alerts and using tags on everything that supports them to be able to see where your money is going. AWS billing is a black box with entire consulting organizations existing to try and get a handle on it. Terraform created a billing alert for $40 CDN.

Out of the Terraform and into the theoretical.

I believe you should run your scripts in the environment your users will. This means Windows. Not headless or Linux. So using Packer you create a Windows based AMI. I started with https://github.com/joefitzgerald/packer-windows for these demo purposes. In a proper grid, actual licenses will be required.

Your nodes should be;

  • In Auto Scaling Groups even if you are not doing the ‘auto’ part. This is useful as you can intentionally scale them to 0 if you know you never run scripts over night. But also think of a scenario when the Hub notifies AWS that its used 90% of its available nodes and to spin up another 2 or 3 and then remove capacity when it has more than x spare.
  • Use ‘Spot Instances’ which is a market place for companies who have bought Reserved instance (pay by the year) but are not using them and lending out their compute time to recoup some of their investment. You should never pay more for a Spot instance than you would were it On Demand.
  • Have access to their Node instance restricted to only the Hub via a Security Group

One best practice we had a decade ago that has been forgotten is always running your scripts through a scriptable proxy. This lets you blackhole unnecessary scripts which slow down your tests, intercept HTTP codes and control how much bandwidth is simulated. (Having spent almost a week in a hotel with pretty crap internet, its amazing how much of the internet assumes functioning bandwidth.)

Access into this proxy should only be from the Node instances and wherever your scripts are being run from (such as CodeBuild) to configure it.

Some of this functionality is starting to be built into the browsers with bridges to WebDriver through the Javascript Executor and Google Developer Tools. This of course assumes you are only running scripts in Chrome. It’s a far better idea to just run a proxy to get greater functionality and cross-browser capability.

Another reason for Terraform over something like CloudFormation is you can run things in external cloud providers such as MacStadium which uses VMWare as the base of their cloud. So using the same tool to configure your Linux Selenium Hub and Windows Selenium Nodes you can also create Mac Nodes.

Because it is external to your private subnet where everything is, and in fact external to your VPC, a Load Balancer needs to be created in the public subnet to allow communication from MacStadium into the Hub for registration.

Selenium 4.0 is coming. And it will change this diagram a bit. As mentioned above, the Hub itself can be broken into 4 separate services which can be independently configured and scaled. A ‘Hub’ comprised of 2 AWS Lambda functions, an AWS SQS queue and and AWS Elasticache Redis instance is going to be the scalable model of the future I think.

But before that happens, there is a couple things that need to happen.

Communication between all parts of the Selenium infrastructure needs to be securable. Currently everything is HTTP but it needs to be HTTPS (if not by default, then at least configurable.) If anyone wants to do that, patches are welcome and would save me the work of doing it.

Similarly, there needs to be the some way of authorizing Nodes into the Hub. Right now, any Node can register itself with the Hub and start getting traffic. Its an interesting attack vector to think about where you discover someone launching a Hub in a public subnet and you lighting up a Node and attaching to it and now seeing a company’s next version of their app because they are sending it to you. The vector gets even more interesting when taking into consideration there is work being done to allow communication back to the Hub from the Node. If I can overflow a buffer and run arbitrary commands on the shell somehow your network is now fully compromised. Again, feel free to submit a patch along the lines of https://www.elastic.co/guide/en/beats/filebeat/current/configuring-ssl-logstash.html so I don’t have to do it.

And that was the talk. Next steps with it are unknown. I’m seriously considering turning it into a video series and maybe offering it as a workshop at future SeleniumConfs.

Is your Automation Infrastructure ‘Well Architected’? – SeConfChicago edition

This week I was in Chicago to get back onto my soapbox around how automation patterns have been largely created so the risk has shifted to the infrastructure the scripts run on. There is too much content for an hour which is the length I thought I had, until the day before which I realized I had 45 minutes. And then on stage the counter said 40 minutes.

Anyhow, this talk is supposed to overwhelm by intention. The idea being here is a whole bunch of things you can think about on your own time with the assistance of the actual well architected framework (which, again, is cloud neutral if you swap out cloud provider component names.)

See you next month in Malmo where I’m giving it again are Øredev.

(I’ll embed the recording once its available.)

Laravel and Logstash

As we get larger clients, our need to not be cowboying our monitoring / alerting is increasing. In our scenario we are injesting logs via Logstash and sending them all to an AWS Elasticsearch instance, and if it is of severity ERROR we send it to AWS Simple Noticiation Service (which people or services can subscribe to) as well as send them to PagerDuty.

Input
For each of our services we have an input config which basically says ‘consume this file patter, call it a laravel file, and add its stack name to the event.’

input {
  file {
    path => "<%= scope['profiles::tether::www_root'] %>/storage/logs/laravel-*.log"
    start_position => "beginning"
    type => "laravel"
    codec => multiline {
      pattern => "^\[%{TIMESTAMP_ISO8601}\] "
      negate => true
      what => previous
      auto_flush_interval => 10
    }
    add_field => {"stack" => "tether"}
  }
}

Filter
Since its a type laravel file, we pull out the environment its running in, and log severity, plus grab the ip of the instance, build the SNS message subject and make sure the event timestamp is the one in the log, not the time logstash touched the event. (Without that last step, you end up with > 1MM entries for a single day the first time you run things.)

filter {
  # Laravel log files
  if [type] == "laravel" {
    grok {
      match => { "message" => "\[%{TIMESTAMP_ISO8601:timestamp}\] %{DATA:env}\.%{DATA:severity}: %{GREEDYDATA:message}" }
    }
    ruby {
      code => "event.set('ip', `ip a s eth0 | awk \'/inet / {print$2}\'`)"
    }
    mutate {
      add_field => { "sns_subject" => "%{stack} Alert (%{env} - %{ip})" }
    }
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
      target => "@timestamp"
    }
  }    
}

Output
And then we pump it around where it needs to be.

If you are upgrading ES from 5.x to 6.x you need to have the template_overwrite setting else the new schema doesn’t get imported and there was some important changes that were made. The scope stuff is for Puppet to do replacements. And there is a but in 6.4.0 of the amazon_es plugin around template_overwrite…

output {
  amazon_es {
    hosts => ["<%= scope['profiles::laravel::es_host'] %>"]
    region => "us-west-2"
    index => "logstash-<%= scope['environment'] %>-%{+YYYY.MM.dd}"
    template => "/etc/logstash/templates/elasticsearch-template-es6x.json"
	template_overwrite => true
  }
}
output {
  if [severity] == "ERROR" { 
    sns {
      arn => "arn:aws:sns:us-west-2:xxxxxxxxx:<%= scope['environment'] %>-errors"
      region => 'us-west-2'
    }
  }
}

I’m not quite happy with our Pageruty setup as the de-duping is running at an instance level right now. Ideally, it would have the reason for the exception as well but that’s a task for another day.

output {
  if [severity] == "ERROR" { 
      pagerduty {
        event_type => "trigger"
        description => "%{stack} - %{ip}"
        details => {
          timestamp => "%{@timestamp}"
          message => "%{message}"
        }
        service_key => "<%= scope['profiles::laravel::pagerduty'] %>"
        incident_key => "logstash/%{stack}/%{ip}"
      }
  }
}

For the really curious, here is my Puppet stuff for all this. Every machine which has Laravel services has the first manifest, but there are some environments which have multiple services on them which is why the input file lives at the service level.

modules/profiles/manifests/laravel.pp

  class { 'logstash':
    version => '1:6.3.2-1',
  }
  $es_host = hiera('elasticsearch')
  logstash::configfile { 'filter_laravel':
    template => 'logstash/filter_laravel.erb'
  }
  logstash::configfile { 'output_es':
    template => 'logstash/output_es_cluster.erb'
  }
  if $environment == 'sales' or $environment == 'production' {
    logstash::configfile { 'output_sns':
      template => 'logstash/output_sns.erb'
    }

    $pagerduty = lookup('pagerduty')
    logstash::configfile { 'output_pagerduty':
      template => 'logstash/output_pagerduty.erb'
    }
  }
  unless $environment == 'development' {
    file { [ '/etc/logstash/templates' ]:
      ensure => 'directory',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx'
    }

    file { [ '/etc/logstash/templates/elasticsearch-template-es6x.json' ]:
      ensure => 'present',
      group  => 'root',
      owner  => 'root',
      mode   => 'u=rwx,go+rx',
      source => 'puppet:///modules/logstash/elasticsearch-template-es6x.json',
      require => Class['Logstash']
    }

    logstash::plugin { 'logstash-output-amazon_es': 
      source => 'puppet:///modules/logstash/logstash-output-amazon_es-6.4.1-java.gem',
      ensure => '6.4.1'
    }
  }

modules/profiles/manifests/.pp

  logstash::configfile { 'input_tether':
    template => 'logstash/input_tether.erb'
  }

The next thing I need to work on is consuming the ES data back into our app so we don’t have to log into Kibana or the individual machines to see the log information. I think every view-your-logs solution I’ve seen for Laravel has been based around reading the actual logs on disk which doesn’t work in a clustered environment or where you have multiple services controlled by a hub one.

Structured Logs in Laravel (Part 2)

The previous post showed you how to tweak Laravel’s default logging setup to output in json which is a key part of creating structured logs. With structured logs you can move yourself towards a more Observable future by tacking on a bunch of extra stuff to your logs which can then be parsed and acted upon by your various systems.

And Laravel supports this out of the box — it just isn’t called out that obviously in the docs. (It is the ‘contextual information’ section on the Logging doc page. (Or ‘Errors and Logging’ for pre-5.6 docs). Basically, you create an array and pass it as the second argument to your logging call and it gets written out in the ‘extras’ part of the log.

ubuntu@default:/var/www/tether$ sudo php artisan tinker
Psy Shell v0.9.7 (PHP 7.1.20-1+ubuntu16.04.1+deb.sury.org+1 — cli) by Justin Hileman
>>> use Ramsey\Uuid\Uuid;
>>> $observationId = Uuid::uuid4()->toString();
=> "daed8173-5bd0-4065-9696-85b83f167ead"
>>> $structure = ['id' => $observationId, 'person' => 'abc123', 'client' => 'def456', 'entry' => 'ghi789'];
=> [
     "id" => "daed8173-5bd0-4065-9696-85b83f167ead",
     "person" => "abc123",
     "client" => "def456",
     "entry" => "ghi789",
   ]
>>> \Log::debug('some debug message here', $structure);
=> null

which gets output like this

{"message":"some debug message here","context":{"id":"daed8173-5bd0-4065-9696-85b83f167ead","person":"abc123","client":"def456","entry":"ghi789"},"level":100,"level_name":"DEBUG","channel":"development","datetime":{"date":"2018-09-03 18:31:31.079921","timezone_type":3,"timezone":"UTC"},"extra":[]}

Of course there is no ‘standard’ for structured logs (nor should there be as they really are context sensitive), but most of the examples I’ve seen all have some sort of id for giving context for tracing things around.

Note: The id in this case is solely for dealing with logs message output. This is not for application request tracing which I think is also really interesting but have not delved into yet.