Home

AWS giveth with its right hand and breaketh with its left

Earlier this month, AWS ended standard support for PostgreSQL 13 on RDS. Customers who want to stay on a supported database — as AWS is actively encouraging them to do — need to upgrade to PostgreSQL 14 or later.

This makes sense, as PostgreSQL (pronounced POST-gruh-SQUEAL if, like me, you want to annoy the living hell out of everyone within earshot) 13 reached its community end of life late last year.

PostgreSQL 14, which shipped in 2021, defaults to a more secure password authentication scheme (SCRAM-SHA-256, for any nerds that have read this far without diving for their keyboards to correct my previous parenthetical). It also just so happens to break AWS Glue, their managed ETL (extract-transform-load) service, which cannot handle that authentication scheme. If you upgrade your RDS database to follow AWS's own security guidance, AWS's own data pipeline tooling responds with "Authentication type 10 is not supported" and stops working.

Given that both of these services tend to hang out in the environment that most companies call "production," this is not terrific!

The deprecation didn't create this problem. It just removed the ability to avoid a problem that has existed for five years, unless you take on an additional maintenance burden or pay the Extended Support tax.

Here's the technical shape of the Catch-22, stripped to what matters: when you move to a newer PostgreSQL on RDS, Glue's connection-testing infrastructure uses an internal driver that predates the newer authentication support. The "Test Connection" button — the thing you'd click to verify that your setup works before trusting it with production data — simply doesn't. A community expert on AWS's support forum acknowledged three years ago that "the tester is pending a driver upgrade," and assured users that crawlers use their own drivers and should work fine. Users in the same thread reported back that the crawlers also fail. Running Glue against RDS PostgreSQL is a bread-and-butter data engineering pattern, not an edge case — this is a well-paved path that AWS has let fall into disrepair.

The incompatibility has been known since PostgreSQL 14 shipped in 2021. The deprecation timeline for PG13 was announced in advance. Both teams—RDS and Glue—presumably track industry developments. Neither, apparently, bothered to track each other.

The charitable read on how this happens is also the correct one: AWS has tens of thousands of engineers organized into hundreds of semi-autonomous service teams. The RDS team ships deprecations on the RDS lifecycle, the Glue team maintains driver dependencies on the Glue roadmap, and nobody explicitly owns the gap between them. The customer discovers the incompatibility in production, usually at an inconvenient hour.

This is not a conspiracy, as AWS lacks the internal cohesion needed to pull one of those off. This is also not a carefully-constructed revenue-enhancement mechanism, because the Extended Support revenue is almost certainly a rounding error on AWS's balance sheet compared to the customer ill-will it generates. Instead, this is simply organizational complexity doing what organizational complexity does. It's the same reason your company's internal tools don't talk to each other; AWS is just doing it at a scale where the blast radius is someone else's production database. Integration testing across service boundaries is genuinely hard when those boundaries span multiple billion-dollar businesses that happen to share a parent company. Nobody woke up and decided to break Glue. It came that way from the factory.

I want to be clear that I genuinely believe this, because the alternative I'm about to describe isn't about intent.

The problem with the charitable read is that it doesn't matter

If you're staring at a broken pipeline in your environment at 2 am, the reason is academic. You need a fix. AWS has provided three of them, and they all suck. You can downgrade password encryption on your database to the older, less secure standard: the one you just upgraded away from, per AWS's own recommendations. You can bring your own JDBC driver, which disables connection testing and may not support all the features you want. Or you can rewrite your ETL workflows as Python shell jobs.

Every exit means giving up the entire value proposition of a managed service — presumably why you're in this mess to begin with — or walking back the security improvement you were just told to make.

For customers who stayed on PG13 to avoid this specific problem, Extended Support is now running automatically unless you opted out at cluster creation time—a detail that's easy to miss. That's $0.10 per vCPU-hour for the first two years, doubling in year three. A 16-vCPU Multi-AZ instance works out to nearly $30,000 per year in Extended Support fees alone. It's not a shakedown. But it is a number that appears on a bill, from a company that also controls the timeline for fixing the problem, and all of the customer response options are bad.

AWS doesn't need to be running a shakedown. They just need to be large enough that the result is indistinguishable from one.

This pattern isn't unique to AWS, and it isn't going away. Every major cloud provider – indeed, every major technology provider – is a portfolio of semi-autonomous teams whose roadmaps occasionally collide in their customers' environments. It will happen again, with different services and different authentication protocols and different billing line items. The question isn't whether the org chart will produce another gap like this. It will. The question is what happens after the gap appears: does the response look like accountability — acknowledging the incompatibility before the deprecation deadline, not after — or does it look like a shrug and three paid alternatives?

Never attribute to malice what can be adequately explained by one very large org chart. Just don't forget to check the invoice. ®

Source: The register

Previous

Next