Using Elasticache Read Replicas with Laravel

One of the nice things about using Redis via Elasticache is that it gives you a Primary endpoint and a Reader endpoint which means we can split our load more easily (in addition to having the redundancy that they provides.) Unfortunately, Laravel doesn’t have this out of the box which is forever a paper cut I experience. (My only guess is that it is because it really is an AWS thing and by-and-large things are pretty agnostic. Going all-in on AWS is a rant for another day.)

Anyhow, we’re re-examining our caching strategy and this finally bothered me enough to do something about it.

Our main way of interacting with the cache is currently through remember() and rememberForever() … which uses only the Primary endpoint. Now, these functions just wrap a get and if nothing is returned, runs the closure output as a put. So there is our roadmap.

First, we need to do some setup.

In config/database.php I copied the cache block and created a cache_read and cache_write one as well. I left the original one as well as a transition … though let’s be honest, I’m never going to delete it.

    'cache_read' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST_READ', '127.0.0.1'),
        'username' => env('REDIS_USERNAME'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_CACHE_DB', '1'),
    ],

    'cache_write' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST_WRITE', '127.0.0.1'),
        'username' => env('REDIS_USERNAME'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_CACHE_DB', '1'),
    ],

You’ll also notice that the only thing that changed in there is the host is being populated from a different environment variable, so you’ll need to update your various .env processes as well.

Now that the connections are configured, we need to teach the caching system how to access these new stores in config/cache.php by, again, copy and pasting the existing redis blocks and update the connection to use.

    'redis_read' => [
        'driver' => 'redis',
        'connection' => 'cache_read',
        'lock_connection' => 'default',
    ],

    'redis_write' => [
        'driver' => 'redis',
        'connection' => 'cache_write',
        'lock_connection' => 'default',
    ],

All that is left now is to start pulling our code towards acting-like-remember-but-not.

$value = Cache::store('redis_read')->get('key');
if (! $value) {
    $value = 'foo'; // or whatever you would do in the remember closure
    Cache::store('redis_write')->put('key');
}

And just like that, your hit percentage starts to go up on your Replica.

Though I really do think Laravel should be smart enough that if you give a Redis connection read and write options that things automatically route accordingly…

OOM with Laravel Excel

One of the nice things about PHP is you do not have to manage your process’s memory … except when you do.

We have a job that runs once an hour and dumps a csv generated by Laravel Excel to a customer’s SFTP. Because it uses information for multiple models, we used map(). Easy-peasy. Except that we also don’t have a 1:1 model-found-via-query()-to-row-in-file output. In some cases its 1:1000 or more. Further complicating things is that we eager load a bunch of stuff to be a little more gentle on the database.

Here’s the stripped down version of the class’s map().

public function map($contractor): array
{
    $contractor->load([
        'employees' => function ($query) {
            // some conditions and selecting only fields we need
        },
        'employees.user:id,uuid,first_name,last_name',
    ]);

    $employees = $contractor->employees->sortBy('user.last_name', SORT_NATURAL | SORT_FLAG_CASE);

    $rows = [];

    foreach ($employees as $employee) {
        $employee->load([
            'positions' => function ($query) {
                // some conditions and selecting only fields we need
            },
            'positions.requirements' => function ($query) {
                // some conditions and selecting only fields we need
            },
        ]);

        // a bunch of secret sauce logic

        $rows[] = [
            // fields we want
        ];
    }

    return rows;
}

This all worked fine until two weeks ago when the job started to run out of memory. Yay for hitting unknown performance cliffs.

[2023-01-01 03:33:14] production.ERROR: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 4096 bytes) {"exception":"[object] (Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0): Allowed memory size of 2147483648 bytes exhausted (tried to allocate 4096 bytes) at /path/vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Cell/Coordinate.php:34)
[stacktrace]
#0 {main}
"}

At first I started to refactor the whole thing to use the new approach we have for the secret sauce section of the export, but then I started to actually process what the error was saying.

We can usually ignore memory usage when working with PHP because the processes are short-lived and the memory is returned at the end of it. But this export was run as a not-fast job (~ 6 minutes) and so suddenly we have to worry about such things. In this case, all the eager loaded models just kept stacking up on each other beyond their usefulness.

Thankfully, the Model class has this which is essentially the opposite of load()

    /**
     * Unset an attribute on the model.
     *
     * @param  string  $key
     * @return void
     */
    public function __unset($key)
    {
        $this->offsetUnset($key);
    }

Armed with this, I was able to get the job running again and make it more memory efficient at the same time by explicitly freeing the memory that was no longer needed.

public function map($contractor): array
{
    $contractor->load([
        'employees' => function ($query) {
            // some conditions and selecting only fields we need
        },
        'employees.user:id,uuid,first_name,last_name',
    ]);

    $employees = $contractor->employees->sortBy('user.last_name', SORT_NATURAL | SORT_FLAG_CASE);

    $rows = [];

    foreach ($employees as $employee) {
        $employee->load([
            'positions' => function ($query) {
                // some conditions and selecting only fields we need
            },
            'positions.requirements' => function ($query) {
                // some conditions and selecting only fields we need
            },
        ]);

        // a bunch of secret sauce logic

        $rows[] = [
            // fields we want
        ];

        // free the eager-loaded memory for this employee
        unset($employee->positions);
    }

    // free the eager-loaded memory for this contractor
    unset($contractor->employees);

    return rows;
}

TLDR; when eager loading in a long running process (especially when eager loading in nested loops), be sure to clean up after yourself as you go along.

Eloquent Pagination and Map — not the bestest of friends

Eloquent can make things really easy. But sometimes that thing is shooting yourself in the foot with a performance problem.

We recently released an API endpoint to a customer that was paginated. The first version of the code looked something like this.

return $model
    ->relationship
    ->map(function ($relation) {
        // tonne of heavy lifting
    })
    ->paginate();

It is nothing crazy — just manipulate the objects in the relationship and send it on its way back to the caller. Tests all passed and things were great.

Until it was deployed into production and it took almost 3 minutes to run.

What was happening was that it was doing the ‘tonne of heavy lifting’ against all 3000 models and then returning the page that was requested. That’s not quite efficient.

The solution is to flip things around and paginate first and then the map.

$items = $model
    ->relationship
    ->paginate();

$mapped = $items
    ->getCollection()
    ->map(function ($relation) {
        // a tonne of heavy lifting
    });

$items->setCollection($mapped);

return $items;

By mapping only the things in the requested pagination page, the time dropped by around 85%.

Needless to say, we’ve stopped using the map-then-paginate pattern in our application … and you likely should as well.

Laravel News Catchup

Next up in my ‘eventually I’ll catch up on email’ queue is Laravel News, which anyone who deals with Laravel should be signed up to. Now, this folder has 180 things in it … but most were already read so I’m not going to read them again, but since Halloween, this is what I find interesting…

  • A bit of a deep dive into how email validation works and can be extended
  • Laravel Meta Fields lets you attach random amounts of random metadata to models. The last two applications I’ve been responsible for didn’t do this in nearly as elegant a manner
  • Laravel Log Route Statistics seems like it could be an interesting way to either determine what parts of your application has nice test coverage and / or use production data to guide analysis and refactorings. But, it also logs to the database which could get very noisy in a large scale application.
  • Laravel Request Logger is from the same guy and is interesting, with the same caveats I suspect.
  • I’m rebuilding VMs to two companies right now and part of that will be integrating Horizon. How to get notified when Laravel Horizon stops running seems like a useful thing to keep in mind, though I’m not sure I’m keen on it being an artisan command. And things will be running in a cluster, sooo, yay, more things to worry about?
  • Hiding Sensitive Arguments in PHP Stack Traces is always a handy thing to keep in mind. Doing this properly means you can de-scope your log files from things like Right To Delete and such
  • NPS gives me hives, but Laravel NPS seems like a straight-forward way of requesting and storing it
  • I have to think about our pagination strategy over the next couple weeks, so Efficient Pagination Using Deferred Joins is rather timely

ArchTech Newsletter

On the upside, my mail is getting nicely sorted. On the downside, its now sorted /and/ neglected. So this is one of, hopefully, many posts where I catch up on things. First up, is the ArchTech Newsletter which has some hightlights from their weekly twitter thread to your inbox, plus some other bits around products and packages they are working on. Subscribe here.

  • LazilyRefreshDatabase looks like a nice cleanup for tests.
  • Sidecar has a tonne of potential. It feels like it farms your queue workers to the AWS Lambda — and so in any language it supports
  • The Road to PHP: Static Analysis is an email drip course, on, well, status analysis
  • Laravel 8.x and newer projects shouldn’t be using Guzzle (or heaven forbid, curl) but should be using the built-in HTTP client. Getting it to throw errors is a useful thing to know how to do.
  • Laravel SEO looks like it could reduce some code on some of my projects.
  • Mail Intercept lets you test mail in Laravel by, well, intercepting it rather than Faking it. I think I like this concept. We’ll be testing a tonne of mail for i18n reasons by the end of February so might make use of this.
  • I don’t think you should ever be storing sensitive data as properties of jobs, but if you insist, then you should be using ShouldBeEncrypted on those jobs

Experimenting where to put MySQL

At MobileXCo, our ‘app’ consists of 5 Laravel apps plus their supporting services like MySQL, Redis, ElasticSearch and we have them all in a single Vagrant instance. VMs aren’t ‘cool’ anymore, but our production infrastructure is VM based (EC2) and onboarding developers is pretty easy as its just ‘vagrant up’.

That said, as an experiment I moved where MySQL lived a couple places to see if I could simplify the VM by not having it in there. After all, we use RDS to host our database in production, so why not externalize that from development as well?

First, the baseline of having everything in the VM

Time: 19.69 seconds, Memory: 94.25 MB 

This maths out to about .75s per test. Not too too shabby.

Next up was reaching out of the VM to the Host (macOS 10.15.4).

Time: 28.1 seconds, Memory: 94.25 MB

Which is 1.08s per test which is on the border of consideration for running the server on the Host — if setup wasn’t as trivial as it is in the VM. (Which is fully configured via the Puppet Provisioner using the same scripts as production. Well, with a couple minor tweaks through environment checks.)

Lastly, I spent an couple hours teaching myself about docker-compose and ran MySQL in a container using the mysql:5.7 image. (And if I’m am going to be honest, this really was an excuse to do said learning.) Port 33306 on the Host forwarded to port 3306 on the Container so really this is Guest -> Host -> Container, but

Time: 1.02 minutes, Memory: 94.25 MB

That’s … erm, not awesome at 2.38s.

This can’t be a unique configuration and I find it hard to believe that such a performance discrepancy would not have been addressed which makes me think there is some networking tuning options I don’t know about. If anyone has any ideas on how to tweak things, let me know and I’ll re-run the experiment.

Secure Node Registration in Selenium Grid 4

This is another in the small series of ‘things that have changed in Selenium Grid but I have not yet added to the official docs’ posts.

When Selenium Grid was first created, the expectation was that your Grid Hub was nicely secured inside your network, along with your Nodes so everything should be trusted to communicate. But now that we live in a cloud environment, that assumption isn’t quite as tight as it once was. You could have everything tightly locked down in your AWS account, but if someone gets their access key comprimised who can make instances, well, it’s a problem. I know if I was a bad guy, I would be scanning for open Grid Hubs and then figure out how to register with them. There is a wealth of information to be had; competitive intelligence on new features not available in the wild, account details for production testing, etc.

Last night I pushed a change that prevents rouge Grid Nodes from registering in your Grid Hub (so it will be available in the next alpha or you can build it yourself now). I don’t know if this has ever happened in the wild, but the fact that it could is enough that it needed to be closed down.

In order to secure Node registration, you need to supply the new argument --registration-secret in a couple different places. If the secrets do not match, the Node is not registered. This secret should be treated like any other password in your infrastructure, which is to say, not checked into a repo or other practice. Instead it should be kept in something like Hashicorp Vault or AWS Secrets Manager and only accessed (via automated means) when needed.

Standalone Server

When running your Hub as a single instance, there is only one process so only one place that needs the secret handed to it

  java -jar selenium.jar \
       hub \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       --registration-secret cheese

Distributed Server

When running your Hub in a distributed configuration, the Distributor and Router servers need to have it.

  java -jar selenium.jar \
       distributor \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       --registration-secret cheese \
  java -jar selenium.jar \
       router \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       -d https://distributor.grid.com:5553 \
       --registration-secret cheese

Node

Regardless of your approach to running the Server, the Node needs it too. (Obviously.)

  java -jar selenium.jar \
       node \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       --detect-drivers \
       --registration-secret cheese

Detection

When a Node fails to register, two things happen;

  1. A log entry is created at an ERROR level saying a Node did not register correctly. Your Selenium infrastructure needs the same attention to its logging as your production infrastructure. So this should trip an alert to someone in whatever manner it is this would happen for a potential security problem in any other environment.
  2. An event is dropped onto the bus. The Selenium Server shipped with a 0mq bus built-in, but when deploying in a real environment I would suggest using it with something like AWS SQS (or your cloud’s equivilant) as your queuing system which you and then have something like AWS Lambda watch for these events and trigger actions accordingly.

It should be noted further that these are all on the side of the Server, not the Node. The rouge Node is not given any indication that secrets are configured or that the secret they sent was incorrect.

I was on the ‘Test Guild Automation Podcast’

Woke up this morning to a note from Joe that my episode of the Test Guild Automation Podcast is now live. It should come as no surprise that I talk about Selenium infrastructure with him. I felt like I was rambling, but pretty sure Joe kept pulling me back on topic — but since I don’t like how my voice sounds on recordings, I’ll have to you let me know how it turned out.

Secure Communications with Selenium Grid 4

For the last couple years, my schtick has been that I don’t care about your scripts, just your infrastructure. I’m pretty sure in my talk at SeConf London I mused that it was bonkers that we had got away communicating to the Se Server via HTTP. (I have to deal with vendor audits at work and they get really antsy at any mention of HTTP.) At SeConf Chicago I crashed the Se Grid workshop and asked (knowingly) if I was correct that communication was only via HTTP hoping someone would fix it for me. Alas, no one took the bait so at SeConf in London, I was describing the problem to Simon who happened to be creating a ticket (actually, a couple) as I talked, and then I got an alert saying it was assigned to me. The squeaky wheel applies its own grease it seems.

There are a couple catch-22’s in place before I can update the official Selenium Documentation (have you seen the new doc site? It’s great!), so in lieu of that, here is a quick how-to on something that will be in Selenium 4 Alpha 2 (or now if you build it yourself).

What is below is the output of ‘info security’ on the new server. (The ‘info’ command is also new and as of yet undocumented.)


Selenium Grid by default communicates over HTTP. This is fine for a lot of use cases, especially if everything is contained within the firewall and against test sites with testing data. However, if your server is exposed to the Internet or is being used in environments with production data (or that which has PII) then you should secure it.

Standalone

In order to run the server using HTTPS instead of HTTP you need to start it with the --https-private-key and --https-certificate flags to provide it the certificate and private key (as a PKCS8 file).

  java -jar selenium.jar \
       hub \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem

Distributed

Alternatively, if you are starting things individually you would also specify HTTPS when telling where to find things.

  java -jar selenium.jar \
       sessions \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem
  java -jar selenium.jar \
       distributor \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556
  java -jar selenium.jar \
       router \
       --https-private-key /path/to/key.pkcs8 \
       --https-certificate /path/to/cert.pem \
       -s https://sessions.grid.com:5556 \
       -d https://distributor.grid.com:5553 \

Certificates

The Selenium Grid will not operate with self-signed certificates, as a result you will need to have some provisioned to you from a Certificate Authority of some sort. For experimentation purposes you can use MiniCA to create and sign your certificates.

  minica --domains sessions.grid.com,distributor.grid.com,router.grid.com

This will create minica.pem and minica.key in the current directory as well as cert.pem and key.pem in a directory sessions.grid.com which will have both distributor.grid.com and router.grid.com as alternative names. Because Selenium Grid requires the key to be in PKCS8, you have to convert it.

  openssl pkcs8 \
    -in sessions.grid.com/key.pem \
    -topk8 \
    -out sessions.grid.com/key.pkcs8 \
    -nocrypt

And since we are using a non-standard CA, we have to teach Java about it. To do that you add it to the cacert truststore which is by default, $JAVA_HOME/jre/lib/security/cacerts

  sudo keytool \
      -import \
      -file /path/to/minica.pem \
      -alias minica \
      -keystore $JAVA_HOME/jre/lib/security/cacerts \
      -storepass changeit \
      -cacerts

Clients

None of the official clients have been updated yet to support this, but if you are using a CA that the system knows about you can just use an HTTPS Command Executor and everything will work. If you are using a non-standard one (like MiniCA) you probably will have to just through a hoop or two similar to here in Python which basically says “Yes, yes, I know you don’t know about the CA but I do so just continue along anyways.”

from selenium import webdriver

import urllib3
urllib3.disable_warnings()

options = webdriver.FirefoxOptions()
driver = webdriver.Remote(
    command_executor='https://router.grid.com:4444',
    options = options
)

driver.close()

Scrum is an anti-pattern for Continuous Delivery

I’ve been saying that ‘Scrum is an anti-pattern for Continuous Delivery’ for awhile, including last week’s post which got a ‘huh?’ so here is my beef with Scrum.

Actually my complaint isn’t with Scrum itself, but with Sprints and if you remove those then the whole house of cards fall down. (This is similar to my stance on Java, which I do dislike, but I loathe Eclipse so Java is tolerable in something other than Eclipse. Barely.)

The whole point of Continuous Delivery, to me, is to ‘deliver’ improvements to whatever it is you do, to your customers ‘continuously.’ Where continuously means, well, continuously. Not ‘at the end of an arbitrary time people which usually is about 2 – 3 weeks in length.’ This is why ‘Mean Time To Production’ is such an important metric to me and drives all other changes to the delivery pipeline.

“But Adam, how will we plan what we do if we don’t get to play Planning Poker every week?” Easy. Your customers will tell you. And that ‘customer’ could be internal. If something is important, you will know. If something is more important than something else, then it will bump that down the queue. This isn’t to say discussing things and figuring out how to slice them into smaller and smaller units isn’t necessary. It absolutely is. And learning how to do this is perhaps one of the hardest problems in software. Which leads to…

“But Adam, this is a big story that will take the full sprint to complete.” Slice it smaller, hide the work-in-progress behind feature flags and still push your local changes daily. (You should be using them anyways to separate feature launch from availability.)

“But Adam, we could deploy at any point — we just do it once a sprint.” Why? You are actively doing a disservice to your customers and your company by holding back things that could improve their experience and make you more money. Disclaimer: this becomes a more real argument when deploying to IoT or other hardware. I don’t want my thermostat to get updated 20 times a day. But if the vendor could do it, I’ll accept that.

“But Adam, we are in a regulated environment and have to do Scrum.” That’s a strawman argument against working with your auditors. See Dave’s recent Continuous Compliance article.

“But Adam, how will we know if we are getting better at estimating?” The same way you do with Scrum or anything else, which is to collect data. This is a bulk food type problem. When you go to buy say peanut butter from the bulk food store, you take in your container and they weigh it before you scoop your peanut-y deliciousness into it, and after. They then do the math to know the weight of just the peanut button. Same thing can be done here. If you know how long your deploys take, you can do math to say between time code was started to the time it was available in production. And then remove the fixed time of deployments to get the actual length of time something took. In its entirety, not just ‘in development’. (I don’t actually track this metric right now. Things take the length of time they take. But I think this is sound theory.)

“But Adam, where do all our manual testers fit in this world?” They are just part of the process. This is a key difference between Continuous Deployment and Continuous Delivery. If your process says humans touch it, then humans touch it. But there also needs to be a way to short-circuit around them in the case of an emergency.

“But Adam, our database is so archaic and fragile that deployments are a huge risks and sprints minimize that.” That’s a good place to start to change things. A local company still does releases weekly overnight on Wednesdays after 5 years because of this. I’m pretty sure it stopped being a tech problem and well into a people problem a couple years ago.

So if not Scrum, then what? The ‘easy’ answer is Kanban. The harder answer is of course ‘it depends’ and likely looks like a tailored version of Kanban to solves your team’s problems. I really like the notion of a work item flowing across a board, but also dislike enforcing WIP limits and the artificial moving of things left to make room for something else because the tooling requires it.

Let me know what other “But Adam’s” I missed in the comments.

Oh, I’ve got one more.

“But Adam, that is hard.” Yes. Yes it is. (It’s also super fun.)