This episode with Joachim Hill-Grannec asks: How do platforms bloat, and how do you keep them simple and fast with trunk-based dev and small batches? Which metrics prove it works—cycle time, uptime, or developer experience? Can security act as a partner that speeds delivery instead of a gate?
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode of DevSecOps Talks, Mattias speaks with Joachim Hill-Grannec, co-founder of Peltek, a boutique consulting firm specializing in high-availability, cloud-native infrastructure. Following up on a previous episode where Steve discussed cleaning up bloated platforms, Mattias and Joachim dig into why platforms get bloated in the first place and how platform teams should think when building from scratch. Their conversation spans cloud provider preferences, the primacy of cycle time, the danger of adding process in response to failure, and a strong argument for treating security and quality as enablers rather than gatekeepers.
Key Topics Platform Teams Should Serve Delivery TeamsJoachim frames the core question of platform engineering around who the platform is actually for. His answer is clear: the delivery teams are the client. Platform engineers should focus on making it easier for developers to ship products, not on making their own work more convenient.
He connects this directly to platform bloat. In his experience, many platforms grow uncontrollably because platform engineers keep adding tools that help the platform team itself: "Look, I spent this week to make my job this much faster." But Joachim pushes back on this instinct — the platform team is an amplifier for the organization, and every addition should be evaluated by whether it helps a product get to production faster and gives developers better visibility into what they are working on.
Choosing a Cloud Provider: Preferences vs. RealityThe conversation briefly explores cloud provider choices. Joachim says GCP is his personal favorite from a developer perspective because of cleaner APIs and faster response times, though he acknowledges Google's tendency to discontinue services unexpectedly. He describes AWS as the market workhorse — mature, solid, and widely adopted, comparing it to "the Java of the land." Azure gets the coldest reception; both acknowledge it has improved over time, but Joachim says he still struggles whenever he is forced to use it.
They observe that cloud choices are frequently made outside engineering. Finance teams, investors, and existing enterprise agreements often drive the decision more than technical fit. Joachim notes a common pairing: organizations using Google Workspace for productivity but AWS for cloud infrastructure, partly because the Entra ID (formerly Azure AD) integration with AWS Identity Center works more smoothly via SCIM than the equivalent Google Workspace setup, which requires a Lambda function to sync groups.
Measuring Platform Success: Cycle Time Above AllWhen Mattias asks how a team can tell whether a platform is actually successful, Joachim separates subjective and objective measures.
On the subjective side, he points to developer happiness and developer experience (DX). Feedback from delivery teams matters, even if surveys are imperfect.
On the objective side, his favorite metric is cycle time — specifically, the time from when code is ready to when it reaches production. He also mentions uptime and availability, but keeps returning to cycle time as the clearest indicator that a platform is helping teams deliver faster. This aligns with DORA research, which has consistently shown that deployment frequency and lead time for changes are strong predictors of overall software delivery performance.
Start With a Highway to ProductionA major theme of the episode is that platforms should begin with the shortest possible route to production. Mattias calls this a "highway to production," and Joachim strongly agrees.
For greenfield projects, Joachim favors extremely fast delivery at first — commit goes to production, commit goes to production — even with minimal process. As usage and risk increase, teams can gradually add automation, testing, and safeguards. The critical thing is to keep the flow and then ask "how do we make those steps faster?" as you add them, rather than letting each new step slow down the pipeline unchallenged.
He also makes a strong case for tags and promotions over branch-based deployment, noting his instinctive reaction when someone asks "which branch are we deploying from?" is: "No branches — tags and promotions."
The Trap of Slowing Down After FailureJoachim warns about a common and dangerous pattern: when a bug reaches production, the natural organizational reaction is not to fix the pipeline, but to add gates. A QA team does a full pass, a security audit is inserted, a manual review step appears. Each gate slows delivery, which leads to larger batches, which increases risk, which triggers even more controls.
He sees this as a vicious cycle. Organizations that respond to incidents by slowing delivery actually get worse security, worse quality, and worse throughput over time. He references a study — likely the research behind the book Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim — showing that faster delivery correlates with better security and quality outcomes. The organizations adding Engineering Review Boards (ERBs) and Architecture Review Boards (ARBs) in the name of safety often do not measure the actual impact, so they never see that the controls are making things worse.
Mattias connects this to AI-assisted development, where developers can now produce changes faster than ever. If the pipeline cannot keep up, the pile of unreleased changes grows, making each release riskier.
Getting Buy-In: Start With Small ExperimentsJoachim does not recommend that a slow, process-heavy organization throw everything out overnight. Instead, he suggests starting with small experiments. Code promotions are a good entry point: teams can start producing artifacts more rapidly without changing how those artifacts are deployed. Once that works, the conversation shifts to delivering those artifacts faster.
He finds starting on the artifact pipeline side produces quicker wins and more organizational buy-in than starting with the platform deployment side, which tends to be more intertwined and higher-risk to change.
Guiding Principles Over a Rigid Golden PathMattias questions the idea of a single "golden path," saying the term implies one rigid way of working. Joachim leans toward guiding principles instead.
His strongest principle is simplicity — specifically, simplicity to understand, not necessarily simplicity to create. He references Rich Hickey's influential talk Simple Made Easy (from Strange Loop 2011), which distinguishes between things that are simple (not intertwined) and things that are easy (familiar or close at hand). Creating simple systems is hard work, but the payoff is systems that are easy to reason about, easy to change, and easy to secure.
His second guiding principle is replaceability. When evaluating any tool in the platform, he asks: "How hard would it be to yank this out and replace it?" If swapping a component would be extremely difficult, that is a smell — it means the system has become too intertwined. Even with a tool as established as Argo CD, his team thinks about what it would look like to switch it out.
Tooling Choices and Platform FoundationsJoachim outlines the patterns his team typically uses when building platforms, organized into two paths:
Delivery pipeline (artifact creation): - Trunk-based development over GitFlow - Release tags and promotions rather than branch-based deployment - Containerization early in the pipeline - Release Please for automated release management and changelogs - Renovate for dependency updates (used for production environment promotions from Helm charts and container images)
Platform side (environment management): - Kubernetes-heavy, typically EKS on AWS - Karpenter for node scaling - AWS Load Balancer Controller only as a backing service for a separate ingress controller (not using ALB Ingress directly, due to its rough edges) - Argo CD for GitOps synchronization and deployment - Argo Image Updater for lower environments to pull latest images automatically - Helm for packaging, despite its learning curve
He notes that NGINX Ingress Controller has been deprecated, so teams need to evaluate alternatives for their ingress layer.
Developers Should Not Be Fully Shielded From OperationsOne of the more nuanced parts of the conversation is how much operational responsibility developers should have. Joachim rejects both extremes. He does not think every developer needs to know everything about infrastructure, but he has seen too many cases where developers completely isolated from runtime concerns make poor decisions — missing simple code changes that would make a system dramatically easier to deploy and operate.
He advocates for transparency and collaboration. Platform repos should be open for anyone on the dev team to submit pull requests. When the platform team makes a change, they should pull in developers to work alongside them. This way, the delivery team gradually builds a deeper understanding of how the whole system works.
Joachim loves the open-source maintainer model applied inside organizations: platform teams are maintainers of their areas, but anyone in the organization should be able to introduce change. He warns against building custom CLIs or heavy abstractions that create dependencies — if a developer wants to do something the CLI does not support, the platform team becomes a bottleneck.
Mattias adds that opening up the platform to contributions also exposes assumptions. What feels easy to the person who built it may not be easy at all; it is just familiar. Outside contributors reveal where the system is actually hard to understand.
Designers, Not Artists: Detaching Ego From CodeJoachim shares an analogy he prefers over the common "developers as artists" framing. He sees developers more like designers than artists, because an artist's work is tied to their identity — they want it to endure. A designer, by contrast, creates something to serve a purpose and expects it to be replaced when something better comes along.
He applies this to platforms and infrastructure: "I want my thing to get wiped out. If I build something, I want it to get removed eventually and have something better replace it." Organizations where ego is tied to specific systems or tools tend to resist change, which leads to the kind of dysfunction that keeps platforms bloated and brittle.
Complexity Is the Enemy of SecurityMattias raises the difficulty of maintaining complex security setups over time, especially when the original experts leave. Joachim responds firmly: complexity is anti-security.
If people cannot comprehend a system, they cannot secure it well. He acknowledges that some problems are genuinely hard, but argues that much of the complexity engineers create is unnecessary — driven by ego rather than need. "The really smart people are the ones that create simple things," he says, wishing the industry would redirect its narrative from admiring complicated systems to admiring simple ones.
Security and QA as Internal Consulting, Not GatekeepingJoachim draws a parallel between security and QA. He dislikes calling a team "the quality team," preferring "verification" — they are one component of quality, not the entirety of it. Similarly, security is not one team's responsibility; it spans product design, development practices, tooling, and operations.
His ideal model is for security and QA teams to operate as internal consultants whose goal is to reduce risk and improve the overall system — not to catch every possible issue at any cost. The framing matters: if a security team's mandate is simply "block all security issues," the logical conclusion is to stop shipping or delete the product entirely. That may be technically secure, but it is useless.
He frames security as risk management: "Security is a risk management process, not just security for the sake of security. You're managing the risk to the business." The goal should be to deliver faster and more securely — an "and," not an "or."
Mattias recalls a PCI DSS consultant joking over drinks that a system being down is perfectly compliant — no one can steal card numbers if the system is unavailable. The joke lands because it exposes exactly the broken incentive Joachim describes.
Business Value as the Unifying FrameThe episode closes by tying everything back to business outcomes. Joachim argues that speed and security are not opposites; both contribute to business value. Fast delivery creates value directly, while security reduces business risk — and risk management is itself a business operation.
He explains why focusing on the highest-impact business bottleneck first builds trust. When you hit the big items first, you earn credibility, and subsequent changes become easier to justify. For example, one of his clients has a security group that is the slowest part of their organization. Speeding up that security process would have a massive impact on business delivery — more than optimizing the artifact pipeline.
Mattias reflects that he used to see platform work as separate from business concerns — "I don't care about the business, I'm here to build a platform for developers." Looking back, he would reframe that: using business impact as the measure of platform success does not mean abandoning the focus on developers, it means having a clearer way to prioritize and demonstrate value.
HighlightsThis episode with Joachim Hill-Grannec asks: How do platforms bloat, and how do you keep them simple and fast with trunk-based dev and small batches? Which metrics prove it works—cycle time, uptime, or developer experience? Can security act as a partner that speeds delivery instead of a gate?
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode of DevSecOps Talks, Mattias speaks with Joachim Hill-Grannec, co-founder of Peltek, a boutique consulting firm specializing in high-availability, cloud-native infrastructure. Following up on a previous episode where Steve discussed cleaning up bloated platforms, Mattias and Joachim dig into why platforms get bloated in the first place and how platform teams should think when building from scratch. Their conversation spans cloud provider preferences, the primacy of cycle time, the danger of adding process in response to failure, and a strong argument for treating security and quality as enablers rather than gatekeepers.
Key Topics Platform Teams Should Serve Delivery TeamsJoachim frames the core question of platform engineering around who the platform is actually for. His answer is clear: the delivery teams are the client. Platform engineers should focus on making it easier for developers to ship products, not on making their own work more convenient.
He connects this directly to platform bloat. In his experience, many platforms grow uncontrollably because platform engineers keep adding tools that help the platform team itself: "Look, I spent this week to make my job this much faster." But Joachim pushes back on this instinct — the platform team is an amplifier for the organization, and every addition should be evaluated by whether it helps a product get to production faster and gives developers better visibility into what they are working on.
Choosing a Cloud Provider: Preferences vs. RealityThe conversation briefly explores cloud provider choices. Joachim says GCP is his personal favorite from a developer perspective because of cleaner APIs and faster response times, though he acknowledges Google's tendency to discontinue services unexpectedly. He describes AWS as the market workhorse — mature, solid, and widely adopted, comparing it to "the Java of the land." Azure gets the coldest reception; both acknowledge it has improved over time, but Joachim says he still struggles whenever he is forced to use it.
They observe that cloud choices are frequently made outside engineering. Finance teams, investors, and existing enterprise agreements often drive the decision more than technical fit. Joachim notes a common pairing: organizations using Google Workspace for productivity but AWS for cloud infrastructure, partly because the Entra ID (formerly Azure AD) integration with AWS Identity Center works more smoothly via SCIM than the equivalent Google Workspace setup, which requires a Lambda function to sync groups.
Measuring Platform Success: Cycle Time Above AllWhen Mattias asks how a team can tell whether a platform is actually successful, Joachim separates subjective and objective measures.
On the subjective side, he points to developer happiness and developer experience (DX). Feedback from delivery teams matters, even if surveys are imperfect.
On the objective side, his favorite metric is cycle time — specifically, the time from when code is ready to when it reaches production. He also mentions uptime and availability, but keeps returning to cycle time as the clearest indicator that a platform is helping teams deliver faster. This aligns with DORA research, which has consistently shown that deployment frequency and lead time for changes are strong predictors of overall software delivery performance.
Start With a Highway to ProductionA major theme of the episode is that platforms should begin with the shortest possible route to production. Mattias calls this a "highway to production," and Joachim strongly agrees.
For greenfield projects, Joachim favors extremely fast delivery at first — commit goes to production, commit goes to production — even with minimal process. As usage and risk increase, teams can gradually add automation, testing, and safeguards. The critical thing is to keep the flow and then ask "how do we make those steps faster?" as you add them, rather than letting each new step slow down the pipeline unchallenged.
He also makes a strong case for tags and promotions over branch-based deployment, noting his instinctive reaction when someone asks "which branch are we deploying from?" is: "No branches — tags and promotions."
The Trap of Slowing Down After FailureJoachim warns about a common and dangerous pattern: when a bug reaches production, the natural organizational reaction is not to fix the pipeline, but to add gates. A QA team does a full pass, a security audit is inserted, a manual review step appears. Each gate slows delivery, which leads to larger batches, which increases risk, which triggers even more controls.
He sees this as a vicious cycle. Organizations that respond to incidents by slowing delivery actually get worse security, worse quality, and worse throughput over time. He references a study — likely the research behind the book Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim — showing that faster delivery correlates with better security and quality outcomes. The organizations adding Engineering Review Boards (ERBs) and Architecture Review Boards (ARBs) in the name of safety often do not measure the actual impact, so they never see that the controls are making things worse.
Mattias connects this to AI-assisted development, where developers can now produce changes faster than ever. If the pipeline cannot keep up, the pile of unreleased changes grows, making each release riskier.
Getting Buy-In: Start With Small ExperimentsJoachim does not recommend that a slow, process-heavy organization throw everything out overnight. Instead, he suggests starting with small experiments. Code promotions are a good entry point: teams can start producing artifacts more rapidly without changing how those artifacts are deployed. Once that works, the conversation shifts to delivering those artifacts faster.
He finds starting on the artifact pipeline side produces quicker wins and more organizational buy-in than starting with the platform deployment side, which tends to be more intertwined and higher-risk to change.
Guiding Principles Over a Rigid Golden PathMattias questions the idea of a single "golden path," saying the term implies one rigid way of working. Joachim leans toward guiding principles instead.
His strongest principle is simplicity — specifically, simplicity to understand, not necessarily simplicity to create. He references Rich Hickey's influential talk Simple Made Easy (from Strange Loop 2011), which distinguishes between things that are simple (not intertwined) and things that are easy (familiar or close at hand). Creating simple systems is hard work, but the payoff is systems that are easy to reason about, easy to change, and easy to secure.
His second guiding principle is replaceability. When evaluating any tool in the platform, he asks: "How hard would it be to yank this out and replace it?" If swapping a component would be extremely difficult, that is a smell — it means the system has become too intertwined. Even with a tool as established as Argo CD, his team thinks about what it would look like to switch it out.
Tooling Choices and Platform FoundationsJoachim outlines the patterns his team typically uses when building platforms, organized into two paths:
Delivery pipeline (artifact creation): - Trunk-based development over GitFlow - Release tags and promotions rather than branch-based deployment - Containerization early in the pipeline - Release Please for automated release management and changelogs - Renovate for dependency updates (used for production environment promotions from Helm charts and container images)
Platform side (environment management): - Kubernetes-heavy, typically EKS on AWS - Karpenter for node scaling - AWS Load Balancer Controller only as a backing service for a separate ingress controller (not using ALB Ingress directly, due to its rough edges) - Argo CD for GitOps synchronization and deployment - Argo Image Updater for lower environments to pull latest images automatically - Helm for packaging, despite its learning curve
He notes that NGINX Ingress Controller has been deprecated, so teams need to evaluate alternatives for their ingress layer.
Developers Should Not Be Fully Shielded From OperationsOne of the more nuanced parts of the conversation is how much operational responsibility developers should have. Joachim rejects both extremes. He does not think every developer needs to know everything about infrastructure, but he has seen too many cases where developers completely isolated from runtime concerns make poor decisions — missing simple code changes that would make a system dramatically easier to deploy and operate.
He advocates for transparency and collaboration. Platform repos should be open for anyone on the dev team to submit pull requests. When the platform team makes a change, they should pull in developers to work alongside them. This way, the delivery team gradually builds a deeper understanding of how the whole system works.
Joachim loves the open-source maintainer model applied inside organizations: platform teams are maintainers of their areas, but anyone in the organization should be able to introduce change. He warns against building custom CLIs or heavy abstractions that create dependencies — if a developer wants to do something the CLI does not support, the platform team becomes a bottleneck.
Mattias adds that opening up the platform to contributions also exposes assumptions. What feels easy to the person who built it may not be easy at all; it is just familiar. Outside contributors reveal where the system is actually hard to understand.
Designers, Not Artists: Detaching Ego From CodeJoachim shares an analogy he prefers over the common "developers as artists" framing. He sees developers more like designers than artists, because an artist's work is tied to their identity — they want it to endure. A designer, by contrast, creates something to serve a purpose and expects it to be replaced when something better comes along.
He applies this to platforms and infrastructure: "I want my thing to get wiped out. If I build something, I want it to get removed eventually and have something better replace it." Organizations where ego is tied to specific systems or tools tend to resist change, which leads to the kind of dysfunction that keeps platforms bloated and brittle.
Complexity Is the Enemy of SecurityMattias raises the difficulty of maintaining complex security setups over time, especially when the original experts leave. Joachim responds firmly: complexity is anti-security.
If people cannot comprehend a system, they cannot secure it well. He acknowledges that some problems are genuinely hard, but argues that much of the complexity engineers create is unnecessary — driven by ego rather than need. "The really smart people are the ones that create simple things," he says, wishing the industry would redirect its narrative from admiring complicated systems to admiring simple ones.
Security and QA as Internal Consulting, Not GatekeepingJoachim draws a parallel between security and QA. He dislikes calling a team "the quality team," preferring "verification" — they are one component of quality, not the entirety of it. Similarly, security is not one team's responsibility; it spans product design, development practices, tooling, and operations.
His ideal model is for security and QA teams to operate as internal consultants whose goal is to reduce risk and improve the overall system — not to catch every possible issue at any cost. The framing matters: if a security team's mandate is simply "block all security issues," the logical conclusion is to stop shipping or delete the product entirely. That may be technically secure, but it is useless.
He frames security as risk management: "Security is a risk management process, not just security for the sake of security. You're managing the risk to the business." The goal should be to deliver faster and more securely — an "and," not an "or."
Mattias recalls a PCI DSS consultant joking over drinks that a system being down is perfectly compliant — no one can steal card numbers if the system is unavailable. The joke lands because it exposes exactly the broken incentive Joachim describes.
Business Value as the Unifying FrameThe episode closes by tying everything back to business outcomes. Joachim argues that speed and security are not opposites; both contribute to business value. Fast delivery creates value directly, while security reduces business risk — and risk management is itself a business operation.
He explains why focusing on the highest-impact business bottleneck first builds trust. When you hit the big items first, you earn credibility, and subsequent changes become easier to justify. For example, one of his clients has a security group that is the slowest part of their organization. Speeding up that security process would have a massive impact on business delivery — more than optimizing the artifact pipeline.
Mattias reflects that he used to see platform work as separate from business concerns — "I don't care about the business, I'm here to build a platform for developers." Looking back, he would reframe that: using business impact as the measure of platform success does not mean abandoning the focus on developers, it means having a clearer way to prioritize and demonstrate value.
Highlights
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode of DevSecOps Talks, Mattias and Paulina speak with Steve Wade, founder of Platform Fix, about why so many Kubernetes and platform initiatives become overcomplicated, expensive, and painful for developers. Steve has helped simplify over 50 cloud-native platforms and estimates he has removed around $100 million in complexity waste. The conversation covers how to spot a bloated platform, why "free" tools are never really free, how to systematically delete what you don't need, and why the best platform engineering is often about subtraction rather than addition.
Key Topics Steve's Background: From Complexity Creator to Strategic DeleterSteve introduces himself as the founder of Platform Fix — the person companies call when their Kubernetes migration is 18 months in, millions over budget, and their best engineers are leaving. He has done this over 50 times, and he is candid about why it matters so much to him: he used to be this problem.
Years ago, Steve led a migration that was supposed to take six months. Eighteen months later, the team had 70 microservices, three service meshes (they kept starting new ones without finishing the old), and monitoring tools that needed their own monitoring. Two senior engineers quit. The VP of Engineering gave Steve 90 days or the team would be replaced.
Those 90 days changed everything. The team deleted roughly 50 of the 70 services, ripped out all the service meshes, and cut deployment time from three weeks of chaos to three days, consistently. Six months later, one of the engineers who had left came back. That experience became the foundation for Platform Fix.
As Steve puts it: "While everyone's collecting cloud native tools like Pokemon cards, I'm trying to help teams figure out which ones to throw away and which ones to keep."
Why Platform Complexity HappensSteve explains that organizations fall into a complexity trap by continuously adding tools without questioning whether they are actually needed. He describes walking into companies where the platform team spends 65–70% of their time explaining their own platform to the people using it. His verdict: "That's not a team, that's a help desk with infrastructure access."
People inside the complexity normalize it. They cannot see the problem because they have been living in it for months or years. Steve identifies several drivers: conference-fueled recency bias (someone sees a shiny tool at KubeCon and adopts it without evaluating the need), resume-driven architecture (engineers choosing tools to pad their CVs), and a culture where everyone is trained to add but nobody asks "what if we remove something instead?"
He illustrates the resume-driven pattern with a story from a 200-person fintech. A senior hire — "Mark" — proposed a full stack: Kubernetes, Istio, Argo, Crossplane, Backstage, Vault, Prometheus, Loki, Tempo, and more. The CTO approved it because "Spotify uses it, so it must be best practice." Eighteen months and $2.3 million later, six engineers were needed just to keep it running, developers waited weeks to deploy, and Mark left — with "led Kubernetes migration" on his CV. When Steve asked what Istio was actually solving, nobody could answer. It was costing around $250,000 to run, for a problem that could have been fixed with network policies.
He also highlights a telling sign: he asked three people in the same company how many Kubernetes clusters they needed and got three completely different answers. "That's not a technical disagreement. That's a sign that nobody's aligned on what the platform is actually for."
The AI Layer: Tool Fatigue Gets WorsePaulina observes that the same tool-sprawl pattern is now being repeated with AI tooling — an additional layer of fatigue on top of what already exists in the cloud-native space. Steve agrees and adds three dimensions to the AI complexity problem: choosing which LLM to use, learning how to write effective prompts, and figuring out who is accountable when AI-written code does not work as expected. Mattias notes that AI also enables anyone to build custom tools for their specific needs, which further expands the toolbox and potential for sprawl.
How Leaders Can Spot a Bloated PlatformOne of the most practical segments is Steve's framework for helping leaders who are not hands-on with engineering identify platform bloat. He gives them three things to watch for:
Steve uses a memorable analogy: many platforms are like the Sagrada Familia in Barcelona — they look incredibly impressive and intricate, but they are never actually finished. The question leaders should ask is: what does an MVP platform look like, what tools does it need, and how do we start delivering business value to the developers who use it? Because, as Steve says, "if we're not building any business value, we're just messing around."
Who the Platform Is Really ForMattias asks the fundamental question: who is the platform actually for? Steve's answer is direct — the platform's customers are the developers deploying workloads to it. A platform without applications running on it is useless.
He distinguishes three stages: - Vanilla Kubernetes: the out-of-the-box cluster - Platform Kubernetes: the foundational workloads the platform needs to function (secret management, observability, perhaps a service mesh) - The actual platform: only real once applications are being deployed and business value is delivered
The hosts discuss how some teams build platforms for themselves rather than for application developers or the business, which is a fast track to unnecessary complexity.
Kubernetes: Standard Tool or Premature Choice?The episode explores when Kubernetes is the right answer and when it is overkill. Steve emphasizes that he loves Kubernetes — he has contributed to the Flux project and other CNCF projects — but only when it is earned. He gives an example of a startup with three microservices, ten users, and five engineers that chose Kubernetes because "Google uses it" and the CTO went to KubeCon. Six months later, they had infrastructure that could handle ten million users while serving about 97.
"Google needs Kubernetes, but your Series B startup needs to ship features."
Steve also shares a recent on-site engagement where he ran the unit economics on day two: the proposed architecture needed four times the CPU and double the RAM for identical features. One spreadsheet saved the company from a migration that would have destroyed the business model. "That's the question nobody asks before a Kubernetes migration — does the maths actually work?"
Mattias pushes back slightly, noting that a small Kubernetes cluster can still provide real benefits if the team already has the knowledge and tooling. Paulina adds an important caveat: even if a consultant can deploy and maintain Kubernetes, the question is whether the customer's own team can realistically support it afterward. The entry skill set for Kubernetes is significantly higher than, say, managed Docker or ECS.
Managed Services and "Boring Is Beautiful"Steve's recommendation for many teams is straightforward: managed platforms, managed databases, CI/CD that just works, deploy on push, and go home at 5 p.m. "Boring is beautiful, especially when you call me at 3 a.m."
He illustrates this with a company that spent 18 months and roughly $850,000 in engineering time building a custom deployment system using well-known CNCF tools. The result was about 80–90% as good as GitHub Actions. The migration to GitHub Actions cost around $30,000, and the ongoing maintenance cost was zero.
Paulina adds that managed services are not completely zero maintenance either, but the operational burden is orders of magnitude less than self-managed infrastructure, and the cloud provider takes on a share of the responsibility.
The New Tool Tax: Why "Free" Tools Are Never FreeA central theme is that open-source tools carry hidden costs far exceeding their license fee. Steve introduces the new tool tax framework with four components, using Vault (at a $40,000 license) as an example:
Total year-one cost: roughly $243,000 — a 6x multiplier over the $40,000 budget. And as Steve points out, most teams never present this full picture to leadership.
Mattias extends the point to tool documentation complexity, noting that anyone who has worked with Envoy's configuration knows how complicated it can be. Steve adds that Envoy is written in C — "How many C developers do you have in your organization? Probably zero." — yet teams adopt it because it offers 15 to 20 features that may or may not be useful.
This is the same total cost of ownership concept the industry has used for on-premises hardware, but applied to the seemingly "free" cloud-native landscape. The tools are free to install, but they are not free to manage and maintain.
Why Service Meshes Are Often the First to GoWhen Mattias asks which tool type Steve most often deletes, the answer is service meshes. Steve does not name a specific product but says six or seven times out of ten, service meshes exist because someone thought they were cool, not because the team genuinely needed mutual TLS, rate limiting, or canary deploys at the mesh level.
Mattias agrees: in his experience, he has never seen an environment that truly required a service mesh. The demos at KubeCon are always compelling, but the implementation reality is different. Steve adds a self-deprecating note — this was him in the past, running three service meshes simultaneously because none of them worked perfectly and he kept starting new ones in test mode.
A Framework for Deleting ToolsSteve outlines three frameworks he uses to systematically simplify platforms.
The Simplicity Test is a diagnostic that scores platform complexity across ten dimensions on a scale of 0 to 50: tool sprawl, deployment complexity, cognitive load, operational burden, documentation debt, knowledge silos, incident frequency, time to production, self-service capability, and team satisfaction. A score of 0–15 is sustainable, 16–25 is manageable, 26–35 is a warning, and 36–50 is crisis. Over 400 engineers have taken it; the average score is around 34. Companies that call Steve typically score 38 to 45.
The Four Buckets categorize every tool: Essential (keep it), Redundant (duplicates something else — delete immediately), Over-engineered (solves a real problem but is too complicated — simplify it), or Premature (future-scale you don't have yet — delete for now).
From one engagement with 47 tools: 12 were essential, 19 redundant, 11 over-engineered, and 5 premature — meaning 35 were deletable.
He then prioritizes by impact versus risk, tackling high-impact, low-risk items first. For example, a large customer had Datadog, Prometheus, and New Relic running simultaneously with no clear rationale. Deleting New Relic took three hours, saved $30,000, and nobody noticed. Seventeen abandoned databases with zero connections in 30 days were deprecated by email, then deleted — zero responses, zero impact.
The security angle matters here too: one of those abandoned databases was an unpatched attack surface sitting in production with no one monitoring it. Paulina adds a related example — her team once found a Flyway instance that had gone unpatched for seven or eight years because each team assumed the other was maintaining it. As she puts it, lack of ownership creates the same kind of hidden risk.
The 30-Day Cleanup SprintSteve structures platform simplification as a focused 30-day effort:
He illustrates this with a company whose VP of Engineering — "Sarah" — told him: "This isn't a technical problem anymore. This is a people problem." Two senior engineers had quit on the same day with the same exit interview: "I'm tired of fighting the platform." One said he had not had dinner with his kids on a weekend in six months. The team's morale score was 3.2 out of 10.
The critical insight: the team already knew what was wrong. They had known for months. But nobody had been given permission to delete anything. "That's not a cultural problem and it's not a knowledge problem. It's a permissions problem. And I gave them the permission."
Results: complexity score dropped from 42 to 26, monthly costs fell from $150,000 to $80,000 (roughly $840,000 in annual savings), and deployment time improved from two weeks to one day.
But Steve emphasizes the human outcome. A developer told him afterward: "Steve, I went home at 5 p.m. yesterday. It's the first time in eight months. And my daughter said, 'Daddy, you're home.'" That, Steve says, is what this work is really about.
Golden Paths, Guardrails, and Developer ExperienceMattias says he wants the platform he builds to compete with the easiest external options — Vercel, Netlify, and the like. If developers would rather go elsewhere, the internal platform has failed.
Steve agrees and describes a pattern he sees constantly: developers do not complain when the platform is painful — they route around it. He gives an example from a fintech where a lead developer ("James") needed a test environment for a Friday customer demo. The official process required a JIRA ticket, a two-day wait, YAML files, and a pipeline. Instead, James spun up a Render instance on his personal credit card: 12 minutes, deployed, did the demo, got the deal. Nobody knew for three months, until finance found the charges.
Steve's view: that is not shadow IT or irresponsibility — it is a rational response to poor platform usability. "The fastest path to business value went around the platform, not through it."
The solution is what Steve calls the golden path — or, as he reframes it using a bowling alley analogy, golden guardrails. Like the bumpers that keep the ball heading toward the pins regardless of how it is thrown, the guardrails keep developers on a safe path without dictating exactly how they get there. The goal is hitting the pins — delivering business value.
Mattias extends the guardrails concept to security: the easiest path should also be the most secure and compliant one. If security is harder than the workaround, the workaround wins every time. He aims to make the platform so seamless that developers do not have to think separately about security — it is built into the default experience.
Measuring Outcomes, Not FeaturesSteve argues that platform teams should measure developer outcomes, not platform features: time to first deploy, time to fix a broken deployment, overall developer satisfaction, and how secure and compliant the default deployment paths are.
He recommends monthly platform retrospectives where developers can openly share feedback. In these sessions, Steve goes around the room and insists that each person share their own experience rather than echoing the previous speaker. This builds a backlog of improvements directly tied to real developer pain.
Paulina agrees that feedback is essential but notes a practical challenge: in many organizations, only a handful of more active developers provide feedback, while the majority say they do not have time and just want to write code. Collecting representative feedback requires deliberate effort.
She also raises the business and management perspective. In her consulting experience, she has seen assessments include a third dimension beyond the platform team and developers: business leadership, who focus on compliance, security, and cost. Sometimes the platform enables fast development, but management processes still block frequent deployment to production — a mindset gap, not a technical one. Steve agrees and points to value stream mapping as a technique for surfacing these bottlenecks with data.
Translating Engineering Work Into Business ValueSteve makes a forceful case that engineering leaders must express technical work in business terms. "The uncomfortable truth is that engineering is a cost center. We exist to support profit centers. The moment we forget that, we optimize for architectural elegance instead of business outcomes — and we lose the room."
He illustrates this with a story: a CFO asked seven engineering leaders one question — "How long to rebuild production if we lost everything tomorrow?" Five seconds of silence. Ninety-four years of combined experience, and nobody could answer. "That's where engineering careers die."
The translation matters at every level. Saying "we deleted a Jenkins server" means nothing to a CFO. Saying "we removed $40,000 in annual costs and cut deployment failures by 60%" gets attention.
Steve challenges listeners to take their last three technical achievements and rewrite each one with a currency figure, a percentage, and a timeframe. "If you can't, you're speaking engineering, not business."
Closing Advice: Start Deleting This WeekSteve's parting advice is concrete: pick one tool you suspect nobody is using, check the logs, and if nothing has happened in 30 days, deprecate it. In 60 days, delete it. He also offers the simplicity test for free — it takes eight minutes, produces a 0-to-50 score with specific recommendations, and is available by reaching out to him directly.
"Your platform's biggest risk isn't technical — it's political. Platforms die when the CFO asks you a question you can't answer, when your best engineer leaves, or when the team builds for their CV instead of the business."
Highlights
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode of DevSecOps Talks, Mattias and Paulina speak with Steve Wade, founder of Platform Fix, about why so many Kubernetes and platform initiatives become overcomplicated, expensive, and painful for developers. Steve has helped simplify over 50 cloud-native platforms and estimates he has removed around $100 million in complexity waste. The conversation covers how to spot a bloated platform, why "free" tools are never really free, how to systematically delete what you don't need, and why the best platform engineering is often about subtraction rather than addition.
Key Topics Steve's Background: From Complexity Creator to Strategic DeleterSteve introduces himself as the founder of Platform Fix — the person companies call when their Kubernetes migration is 18 months in, millions over budget, and their best engineers are leaving. He has done this over 50 times, and he is candid about why it matters so much to him: he used to be this problem.
Years ago, Steve led a migration that was supposed to take six months. Eighteen months later, the team had 70 microservices, three service meshes (they kept starting new ones without finishing the old), and monitoring tools that needed their own monitoring. Two senior engineers quit. The VP of Engineering gave Steve 90 days or the team would be replaced.
Those 90 days changed everything. The team deleted roughly 50 of the 70 services, ripped out all the service meshes, and cut deployment time from three weeks of chaos to three days, consistently. Six months later, one of the engineers who had left came back. That experience became the foundation for Platform Fix.
As Steve puts it: "While everyone's collecting cloud native tools like Pokemon cards, I'm trying to help teams figure out which ones to throw away and which ones to keep."
Why Platform Complexity HappensSteve explains that organizations fall into a complexity trap by continuously adding tools without questioning whether they are actually needed. He describes walking into companies where the platform team spends 65–70% of their time explaining their own platform to the people using it. His verdict: "That's not a team, that's a help desk with infrastructure access."
People inside the complexity normalize it. They cannot see the problem because they have been living in it for months or years. Steve identifies several drivers: conference-fueled recency bias (someone sees a shiny tool at KubeCon and adopts it without evaluating the need), resume-driven architecture (engineers choosing tools to pad their CVs), and a culture where everyone is trained to add but nobody asks "what if we remove something instead?"
He illustrates the resume-driven pattern with a story from a 200-person fintech. A senior hire — "Mark" — proposed a full stack: Kubernetes, Istio, Argo, Crossplane, Backstage, Vault, Prometheus, Loki, Tempo, and more. The CTO approved it because "Spotify uses it, so it must be best practice." Eighteen months and $2.3 million later, six engineers were needed just to keep it running, developers waited weeks to deploy, and Mark left — with "led Kubernetes migration" on his CV. When Steve asked what Istio was actually solving, nobody could answer. It was costing around $250,000 to run, for a problem that could have been fixed with network policies.
He also highlights a telling sign: he asked three people in the same company how many Kubernetes clusters they needed and got three completely different answers. "That's not a technical disagreement. That's a sign that nobody's aligned on what the platform is actually for."
The AI Layer: Tool Fatigue Gets WorsePaulina observes that the same tool-sprawl pattern is now being repeated with AI tooling — an additional layer of fatigue on top of what already exists in the cloud-native space. Steve agrees and adds three dimensions to the AI complexity problem: choosing which LLM to use, learning how to write effective prompts, and figuring out who is accountable when AI-written code does not work as expected. Mattias notes that AI also enables anyone to build custom tools for their specific needs, which further expands the toolbox and potential for sprawl.
How Leaders Can Spot a Bloated PlatformOne of the most practical segments is Steve's framework for helping leaders who are not hands-on with engineering identify platform bloat. He gives them three things to watch for:
Steve uses a memorable analogy: many platforms are like the Sagrada Familia in Barcelona — they look incredibly impressive and intricate, but they are never actually finished. The question leaders should ask is: what does an MVP platform look like, what tools does it need, and how do we start delivering business value to the developers who use it? Because, as Steve says, "if we're not building any business value, we're just messing around."
Who the Platform Is Really ForMattias asks the fundamental question: who is the platform actually for? Steve's answer is direct — the platform's customers are the developers deploying workloads to it. A platform without applications running on it is useless.
He distinguishes three stages: - Vanilla Kubernetes: the out-of-the-box cluster - Platform Kubernetes: the foundational workloads the platform needs to function (secret management, observability, perhaps a service mesh) - The actual platform: only real once applications are being deployed and business value is delivered
The hosts discuss how some teams build platforms for themselves rather than for application developers or the business, which is a fast track to unnecessary complexity.
Kubernetes: Standard Tool or Premature Choice?The episode explores when Kubernetes is the right answer and when it is overkill. Steve emphasizes that he loves Kubernetes — he has contributed to the Flux project and other CNCF projects — but only when it is earned. He gives an example of a startup with three microservices, ten users, and five engineers that chose Kubernetes because "Google uses it" and the CTO went to KubeCon. Six months later, they had infrastructure that could handle ten million users while serving about 97.
"Google needs Kubernetes, but your Series B startup needs to ship features."
Steve also shares a recent on-site engagement where he ran the unit economics on day two: the proposed architecture needed four times the CPU and double the RAM for identical features. One spreadsheet saved the company from a migration that would have destroyed the business model. "That's the question nobody asks before a Kubernetes migration — does the maths actually work?"
Mattias pushes back slightly, noting that a small Kubernetes cluster can still provide real benefits if the team already has the knowledge and tooling. Paulina adds an important caveat: even if a consultant can deploy and maintain Kubernetes, the question is whether the customer's own team can realistically support it afterward. The entry skill set for Kubernetes is significantly higher than, say, managed Docker or ECS.
Managed Services and "Boring Is Beautiful"Steve's recommendation for many teams is straightforward: managed platforms, managed databases, CI/CD that just works, deploy on push, and go home at 5 p.m. "Boring is beautiful, especially when you call me at 3 a.m."
He illustrates this with a company that spent 18 months and roughly $850,000 in engineering time building a custom deployment system using well-known CNCF tools. The result was about 80–90% as good as GitHub Actions. The migration to GitHub Actions cost around $30,000, and the ongoing maintenance cost was zero.
Paulina adds that managed services are not completely zero maintenance either, but the operational burden is orders of magnitude less than self-managed infrastructure, and the cloud provider takes on a share of the responsibility.
The New Tool Tax: Why "Free" Tools Are Never FreeA central theme is that open-source tools carry hidden costs far exceeding their license fee. Steve introduces the new tool tax framework with four components, using Vault (at a $40,000 license) as an example:
Total year-one cost: roughly $243,000 — a 6x multiplier over the $40,000 budget. And as Steve points out, most teams never present this full picture to leadership.
Mattias extends the point to tool documentation complexity, noting that anyone who has worked with Envoy's configuration knows how complicated it can be. Steve adds that Envoy is written in C — "How many C developers do you have in your organization? Probably zero." — yet teams adopt it because it offers 15 to 20 features that may or may not be useful.
This is the same total cost of ownership concept the industry has used for on-premises hardware, but applied to the seemingly "free" cloud-native landscape. The tools are free to install, but they are not free to manage and maintain.
Why Service Meshes Are Often the First to GoWhen Mattias asks which tool type Steve most often deletes, the answer is service meshes. Steve does not name a specific product but says six or seven times out of ten, service meshes exist because someone thought they were cool, not because the team genuinely needed mutual TLS, rate limiting, or canary deploys at the mesh level.
Mattias agrees: in his experience, he has never seen an environment that truly required a service mesh. The demos at KubeCon are always compelling, but the implementation reality is different. Steve adds a self-deprecating note — this was him in the past, running three service meshes simultaneously because none of them worked perfectly and he kept starting new ones in test mode.
A Framework for Deleting ToolsSteve outlines three frameworks he uses to systematically simplify platforms.
The Simplicity Test is a diagnostic that scores platform complexity across ten dimensions on a scale of 0 to 50: tool sprawl, deployment complexity, cognitive load, operational burden, documentation debt, knowledge silos, incident frequency, time to production, self-service capability, and team satisfaction. A score of 0–15 is sustainable, 16–25 is manageable, 26–35 is a warning, and 36–50 is crisis. Over 400 engineers have taken it; the average score is around 34. Companies that call Steve typically score 38 to 45.
The Four Buckets categorize every tool: Essential (keep it), Redundant (duplicates something else — delete immediately), Over-engineered (solves a real problem but is too complicated — simplify it), or Premature (future-scale you don't have yet — delete for now).
From one engagement with 47 tools: 12 were essential, 19 redundant, 11 over-engineered, and 5 premature — meaning 35 were deletable.
He then prioritizes by impact versus risk, tackling high-impact, low-risk items first. For example, a large customer had Datadog, Prometheus, and New Relic running simultaneously with no clear rationale. Deleting New Relic took three hours, saved $30,000, and nobody noticed. Seventeen abandoned databases with zero connections in 30 days were deprecated by email, then deleted — zero responses, zero impact.
The security angle matters here too: one of those abandoned databases was an unpatched attack surface sitting in production with no one monitoring it. Paulina adds a related example — her team once found a Flyway instance that had gone unpatched for seven or eight years because each team assumed the other was maintaining it. As she puts it, lack of ownership creates the same kind of hidden risk.
The 30-Day Cleanup SprintSteve structures platform simplification as a focused 30-day effort:
He illustrates this with a company whose VP of Engineering — "Sarah" — told him: "This isn't a technical problem anymore. This is a people problem." Two senior engineers had quit on the same day with the same exit interview: "I'm tired of fighting the platform." One said he had not had dinner with his kids on a weekend in six months. The team's morale score was 3.2 out of 10.
The critical insight: the team already knew what was wrong. They had known for months. But nobody had been given permission to delete anything. "That's not a cultural problem and it's not a knowledge problem. It's a permissions problem. And I gave them the permission."
Results: complexity score dropped from 42 to 26, monthly costs fell from $150,000 to $80,000 (roughly $840,000 in annual savings), and deployment time improved from two weeks to one day.
But Steve emphasizes the human outcome. A developer told him afterward: "Steve, I went home at 5 p.m. yesterday. It's the first time in eight months. And my daughter said, 'Daddy, you're home.'" That, Steve says, is what this work is really about.
Golden Paths, Guardrails, and Developer ExperienceMattias says he wants the platform he builds to compete with the easiest external options — Vercel, Netlify, and the like. If developers would rather go elsewhere, the internal platform has failed.
Steve agrees and describes a pattern he sees constantly: developers do not complain when the platform is painful — they route around it. He gives an example from a fintech where a lead developer ("James") needed a test environment for a Friday customer demo. The official process required a JIRA ticket, a two-day wait, YAML files, and a pipeline. Instead, James spun up a Render instance on his personal credit card: 12 minutes, deployed, did the demo, got the deal. Nobody knew for three months, until finance found the charges.
Steve's view: that is not shadow IT or irresponsibility — it is a rational response to poor platform usability. "The fastest path to business value went around the platform, not through it."
The solution is what Steve calls the golden path — or, as he reframes it using a bowling alley analogy, golden guardrails. Like the bumpers that keep the ball heading toward the pins regardless of how it is thrown, the guardrails keep developers on a safe path without dictating exactly how they get there. The goal is hitting the pins — delivering business value.
Mattias extends the guardrails concept to security: the easiest path should also be the most secure and compliant one. If security is harder than the workaround, the workaround wins every time. He aims to make the platform so seamless that developers do not have to think separately about security — it is built into the default experience.
Measuring Outcomes, Not FeaturesSteve argues that platform teams should measure developer outcomes, not platform features: time to first deploy, time to fix a broken deployment, overall developer satisfaction, and how secure and compliant the default deployment paths are.
He recommends monthly platform retrospectives where developers can openly share feedback. In these sessions, Steve goes around the room and insists that each person share their own experience rather than echoing the previous speaker. This builds a backlog of improvements directly tied to real developer pain.
Paulina agrees that feedback is essential but notes a practical challenge: in many organizations, only a handful of more active developers provide feedback, while the majority say they do not have time and just want to write code. Collecting representative feedback requires deliberate effort.
She also raises the business and management perspective. In her consulting experience, she has seen assessments include a third dimension beyond the platform team and developers: business leadership, who focus on compliance, security, and cost. Sometimes the platform enables fast development, but management processes still block frequent deployment to production — a mindset gap, not a technical one. Steve agrees and points to value stream mapping as a technique for surfacing these bottlenecks with data.
Translating Engineering Work Into Business ValueSteve makes a forceful case that engineering leaders must express technical work in business terms. "The uncomfortable truth is that engineering is a cost center. We exist to support profit centers. The moment we forget that, we optimize for architectural elegance instead of business outcomes — and we lose the room."
He illustrates this with a story: a CFO asked seven engineering leaders one question — "How long to rebuild production if we lost everything tomorrow?" Five seconds of silence. Ninety-four years of combined experience, and nobody could answer. "That's where engineering careers die."
The translation matters at every level. Saying "we deleted a Jenkins server" means nothing to a CFO. Saying "we removed $40,000 in annual costs and cut deployment failures by 60%" gets attention.
Steve challenges listeners to take their last three technical achievements and rewrite each one with a currency figure, a percentage, and a timeframe. "If you can't, you're speaking engineering, not business."
Closing Advice: Start Deleting This WeekSteve's parting advice is concrete: pick one tool you suspect nobody is using, check the logs, and if nothing has happened in 30 days, deprecate it. In 60 days, delete it. He also offers the simplicity test for free — it takes eight minutes, produces a 0-to-50 score with specific recommendations, and is available by reaching out to him directly.
"Your platform's biggest risk isn't technical — it's political. Platforms die when the CFO asks you a question you can't answer, when your best engineer leaves, or when the team builds for their CV instead of the business."
Highlights
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode, Mattias, Andre, and Paulina welcome back returning guest Paul from System Initiative to continue a conversation that started in the previous episode about their project Swamp. The discussion digs into how AI-assisted software development has changed over the past year, and why the real shift is not "AI writes code" but humans orchestrating multiple specialized agents with strong guardrails. Paul walks through the practical workflows, multi-layered testing, architecture-first thinking, cost discipline, and security practices his team has adopted — while the hosts push on how this applies across enterprise environments, mentoring newcomers, and the uncomfortable question of who is responsible when AI-built software fails.
Key Topics The industry crossroads: layoffs, fear, and a new realityBefore diving into technical specifics, Paul acknowledges that the industry is at "a real crazy crossroads." He references Block (formerly Square) cutting roughly 40% of their workforce, citing uncertainty about what AI means for their teams. He wants to be transparent that System Initiative also shrank — but clarifies the company did not cut people because of AI. The decision to reduce headcount came before they even knew what they were going to build next, let alone how they would build it. AI entered the picture only after they started prototyping the next version of their product.
Block's February 2026 layoffs, announced by CEO Jack Dorsey, eliminated over 4,000 positions. The move was framed as an AI-driven restructuring, making it one of the most visible examples of AI anxiety playing out in real corporate decisions.
From GenAI hype to agentic collaborationPaul explains that AI coding quality shifted significantly around October–November of the previous year. Before that, results were inconsistent — sometimes impressive, often garbage. Then the models improved dramatically in both reasoning and code generation.
But the bigger breakthrough, in his view, was not the models themselves. It was the industry's shift from "Gen AI" — one-shot prompting where you hand over a spec and accept whatever comes back — to agentic AI, where the model acts more like a pair programmer. In that setup, the human stays in the loop, challenges the plan, adds constraints, and steers the result toward something that fits the codebase.
He gives a concrete early example: System Initiative had a CLI written in Deno (a TypeScript runtime). Because the models were well-trained on TypeScript libraries and the Deno ecosystem, they started producing decent code. Not beautiful, not perfectly architected — but functional. When Paul began feeding the agent patterns, conventions, and existing code to follow, the output became coherent with their codebase.
This led to a workflow where Paul would open six Claude Code sessions at once in separate Git worktrees — isolated copies of the repository on different branches — each building a small feature in parallel, feeding them bug reports and data, and continuously interacting with the results rather than one-shotting them.
Git worktrees let you check out multiple branches of the same repository simultaneously in separate directories. Each worktree is independent, so you can work on several features at once and merge them back via pull requests.
He later expanded this by running longer tasks on a Mac Mini accessible via Tailscale (a mesh VPN), while handling shorter tasks on his laptop — effectively distributing AI workloads across machines.
Why architecture matters more than everOne of Paul's strongest themes is that AI shifts engineering attention away from syntax and back toward architecture. He argues that AI can generate plenty of code, but without design principles and boundaries it will produce spaghetti on top of existing spaghetti.
He introduces the idea of "the first thousand lines" — an anecdote he read recently claiming that the first thousand lines of code an agent helps write determine its path forward. If those lines are well-structured and follow clear design principles, the agent will build coherently on top of them. If they are messy and unprincipled, everything after will compound the mess.
Paul breaks software development into three layers:
He argues the industry spent the last decade obsessing over "taste" while often mocking "ivory tower architects" — the people who designed systems but didn't write code. In an AI-driven world, those architectural concerns become critical again because the agent needs clear boundaries, domain structure, and intent to produce coherent output.
Paulina agrees and observes that this trend may also blur traditional specialization lines, pushing engineers toward becoming more general "software people" rather than narrowly front-end, back-end, or DevOps specialists.
Encoding design docs, rules, and constraints into the repoPaul describes how his team makes architecture actionable for AI by encoding system knowledge directly into the repository. Their approach has several layers:
Design documents — Detailed docs covering the model layer (the actual objects, their purposes, how they relate), workflow construction (how models connect and pass data), and expression language behavior. These live in a /design folder in the open-source repo and describe the intent of every part of the system.
Architectural rules — The agent is explicitly told to follow Domain-Driven Design: proper separation between domains, infrastructure, repositories, and output layers. The DDD skill is loaded so the agent understands and maintains bounded contexts.
Code standards — TypeScript strict mode, no any types, named exports, passing lint and format checks. License compliance is also enforced: because the project is AGPL v3, the agent cannot pull in dependencies with incompatible licenses.
Skills — A newer mechanism for lazy-loading contextual information into the AI agent. Rather than stuffing everything into one enormous prompt, skills are loaded on demand when the agent encounters a specific type of task. This keeps context windows lean and focused.
AGPL v3 (GNU Affero General Public License) is a copyleft license that requires anyone who runs modified software over a network to make the source code available. This creates strict constraints on what dependencies can be used.
Multi-agent development: the full chainA major part of the discussion centers on how Paul's team works with multiple specialized AI agents rather than a single all-knowing assistant. The chain looks like this:
Issue triage agent — When a user opens a GitHub issue, an agent evaluates whether it is a legitimate feature request or bug report. The agent's summary is posted back to the issue immediately, creating context for later stages.
Planning agent — If the issue is legitimate, the system enters plan mode. A specification is generated and posted for the user to review. Users can push back ("that's not how I think it should work"), and the plan is revised until everyone agrees.
Implementation agent — The code is written based on the approved plan, with all the design docs, architectural rules, and skills loaded as context.
Happy-path reviewer — A separate agent reviews the code against standards, checking that it loads correctly and appears to function.
Adversarial reviewer — Added just days before the recording, this agent is told: "You are a grumpy DevOps engineer and I want you to pull this code apart." It looks for security injection points, failure modes, and anything the happy-path reviewer might miss.
Both review agents write their findings as comments on the pull request, creating a visible audit trail. The PR only merges when both agents approve. If the adversarial agent flags a security vulnerability, the implementation goes back for changes.
Paul says this "Jekyll and Hyde" review setup caught a path traversal bug in their CLI during its first week. While the CLI runs locally and the risk was limited, it proved the value of adversarial review.
Path traversal is a vulnerability where an attacker can access files outside the intended directory by manipulating file paths (e.g., using ../ sequences). Even in CLI tools, this can expose sensitive files on a user's machine.
Mattias compares the overall process to a modernized CI/CD pipeline — the same stages exist (commit, test, review, promote, release), but AI replaces some of the manual implementation steps while humans stay focused on architecture, review, and acceptance.
Why external pull requests are disabledOne of the more provocative decisions Paul describes: the open-source Swamp project does not accept external pull requests. GitHub recently added a feature to disable PR creation from non-collaborators entirely, and the team turned it on immediately.
The reasoning is supply chain control. Because the project's code is 100% AI-generated within a tightly controlled context — design docs, architectural rules, skills, adversarial review — they want to ensure that all code entering the system passes through the same pipeline. External PRs would bypass that chain of custody.
Contributors are instead directed to open issues. The team will work through the design collaboratively, plan it together, and then have their agents implement it. Paul frames this not as rejecting collaboration but as controlling the process: "We love contributions, but in the AI world, we cannot control where that code is from or what that code is doing."
Self-reporting bugs: AI filing its own issuesThe team built a skill into Swamp itself so that when the tool encounters a bug during use, it can check out the version of the source code the binary was built against, analyze the problem, and open a GitHub issue automatically with detailed context.
This creates high-quality bug reports that already contain the information needed to reason about a fix. When the implementation agent later picks up that issue, it has precise context — where the bug is, what triggered it, and what the expected behavior should be. Paul says the quality of issues generated this way is significantly higher than typical user-filed bugs.
Testing: the favorite partAlthough the conversation starts with code generation, Paul says testing is actually his favorite part of the workflow. The team runs multiple layers:
Product-level tests: - Unit and integration tests — standard code-level verification - Architectural fitness tests — contract tests, property tests, and DDD boundary checks that verify the domain doesn't leak and the agent followed its instructions
Architectural fitness tests are automated checks that verify a system's structure conforms to its intended architecture. In DDD, this means ensuring bounded contexts don't leak dependencies across domain boundaries.
User-level tests (separate repo): - User flow tests — written from the user's perspective against compiled binaries, not source code. These live in a different repository specifically so they are not influenced by how the code is written. They test scenarios like: create a repository, extend the system, create a workflow, run a model, handle wrong inputs.
Adversarial tests (multiple tiers): 1. Security boundary tests — path traversal, environment variable exposure, supply chain attack vectors. Paul references the recent Trivy incident, where a bot stole an API key and used it to delete all of Trivy's GitHub releases and publish a poisoned VS Code extension. 2. State corruption — what happens when someone tampers with the state layer 3. Concurrency — multiple writes, lock failures, race conditions 4. Resource exhaustion — handling pathological inputs like a 100MB stdout message injected into a workflow
Only after all these layers pass does a build get promoted from nightly to stable. Paul can download and manually test any nightly build that maps back to a specific commit.
Paulina points out that if AI is a force multiplier, there is now even less excuse not to write tests. Paul agrees: "We were scraping the barrel before at coming up with reasons why there shouldn't be any tests. Now that's eliminated."
Plan mode as a safety railPaul repeatedly emphasizes "plan mode," particularly in Claude Code. Before the agent changes anything, it produces a detailed plan describing what it intends to do and why, and waits for human approval.
The hosts immediately draw a parallel to terraform plan — the value is not just automation, but the chance to inspect intended changes before applying them. Paul says this was one of the biggest improvements in AI-assisted development because it reduces horror-story scenarios where an agent goes off and deletes a database or rewrites an application.
He notes that other tools are starting to adopt plan mode because it produces better results across the board. But he also warns that plan mode only helps if people actually read the plan — just like Terraform, the safeguard depends on human discipline. "If there's a big line in the middle that says 'I'm going to delete a database' and you haven't read it — it's the same thing."
Practical lessons for getting good resultsPaul shares several tactical lessons:
He also notes that the generated TypeScript is not always how a human would write it — but that matters less if the result is well-tested, secure, and respects domain boundaries. "The actual syntax of the code itself can change 12 times a day. It doesn't really matter as long as it adheres to the product."
Human oversight at every stageDespite all the automation, Paul is adamant that humans remain involved at every stage. Plans are reviewed, implementations are questioned, pull request comments are inspected, binaries are tested before reaching stable release. He describes it as "continually interacting with Claude Code" rather than just letting things happen.
When Paulina pushes on whether a human still checks things before production, Paul makes clear: yes, always. The release pipeline goes from commit to nightly to manual verification to stable promotion. "I will always download something before it goes to stable."
The context-switching taxPaul acknowledges that running multiple agents in parallel is not for everyone. Context switching has always been expensive for engineers, and commanding multiple agents simultaneously is a new form of it. His advice: if you work best focusing on a single task, don't force the multi-agent style. "It'll be such a context switching killer and it'll cause you to lose focus."
The key shift is that instead of writing code, you are "commanding architecture and commanding design." But that still requires focus and judgment.
AI as a force multiplier, not a replacementPaulina captures the dynamic bluntly: "It's a multiplier. If there is a good thing, you'll get a lot of good thing. If it's a shit, you're going to get a lot of shit."
Paul argues that experienced software and operations people are still essential because they understand architecture, security, constraints, and tradeoffs. AI amplifies whatever is already there — good engineering or bad engineering alike.
He believes engineers who learn to use these tools well become "even more important to your company than you already are." But he also acknowledges that some people will not want to work this way, and that friction between AI-forward and AI-resistant teams is already happening in organizations.
The challenge for juniors and newcomersPaulina raises this personally — she was recently asked to mentor someone entering IT and struggled with how to approach it. She doesn't have a formal IT education (she has an engineering background) and learned on the go. The skills she built through manual work — understanding when code needs refactoring as scale changes, knowing how to structure projects at different sizes — are hard to teach when AI handles so much of the writing.
Paul agrees this is an open question and says the industry is still figuring out the patterns. He believes teaching principles, architecture, and core engineering fundamentals becomes even more important, because tool-specific syntax is increasingly handled by AI. "Do you need to know how to write a Terraform module? Do you need to know how to write a Pulumi provider?" — these are becoming less essential as individual skills, while understanding how systems fit together matters more.
He frames this as an opportunity: "We are now in control of helping shape how this moves forward in the industry." As innovators and early adopters, current practitioners can set the patterns. If they don't, someone else will.
Security, responsibility, and the risk of low-code AIPaulina raises a concrete example from Poland: someone built an app using AI to upload receipts to a centralized accounting system, released it publicly, and exposed all their customers' data.
This leads to a deeper question from Mattias about responsibility: if someone with no engineering background builds an insecure app using an AI tool, who is accountable? The user? The platform? The model provider? The episode doesn't settle this, but Paul argues it reinforces why skilled engineers remain essential. The AI doesn't know the security boundary unless someone explicitly teaches it — "it probably wasn't fed that information that it had to think about the security context."
He expects more specialized skills and agents focused on security, accessibility, and compliance to emerge — calling out the example of loading a security skill and an accessibility skill when you know an app will be public-facing. But he says the ecosystem is not fully there yet.
Cost discipline: structure beats vibe codingPaul addresses economics directly. His five-person team at System Initiative all use Claude Max Pro at $200 per person per month. They do not exceed that cost for the full AI workflow — code generation, reviews, planning, and adversarial testing.
In contrast, he has seen other organizations spend $10,000–$12,000 per month per developer on AI tokens because they let tools roam with huge context windows and vague instructions. His conclusion: tightly scoped tasks are not just better for quality — they are far cheaper.
This maps directly to classic engineering wisdom. Tightly defined stories and tasks were always more efficient to push through a system than "go rebuild this thing and I'll see you in six months." The same principle applies to AI agents.
How to introduce AI in cautious organizationsFor teams in companies that ban or restrict AI, Paul suggests a pragmatic entry point: use agents to analyze, not to write code.
He describes a conversation with someone in London who asked how to get started. Paul's advice: if you already know roughly where a bug lives, ask the agent to analyze the same bug report. If it identifies the same area and the same root cause, you have evidence that the tool can accelerate diagnosis. Show your CTO: "I'm diagnosing bugs 50% faster with this agent. It's not writing code — it's helping me understand where the issue is."
Similar analysis-first use cases work for accessibility reviews, security scans, or code quality assessments. The point is to build trust before expanding scope. Paul notes this approach works faster in the private sector than the public sector, where technology adoption has always been slower.
The pace of change is acceleratingPaul believes the conversation has shifted dramatically in the past six months — from AI horror stories and commiserating over drinks to genuine success stories and conferences forming around agentic engineering practices. He points to two upcoming events:
His prediction: the pace is not linear. "We're honestly exponential at this moment in time." He sidesteps the ethics of AI companies (referencing tensions between Anthropic and OpenAI) to focus on the practical reality that models, reasoning, and tooling are all improving at a compounding rate.
Highlights
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode, Mattias, Andre, and Paulina welcome back returning guest Paul from System Initiative to continue a conversation that started in the previous episode about their project Swamp. The discussion digs into how AI-assisted software development has changed over the past year, and why the real shift is not "AI writes code" but humans orchestrating multiple specialized agents with strong guardrails. Paul walks through the practical workflows, multi-layered testing, architecture-first thinking, cost discipline, and security practices his team has adopted — while the hosts push on how this applies across enterprise environments, mentoring newcomers, and the uncomfortable question of who is responsible when AI-built software fails.
Key Topics The industry crossroads: layoffs, fear, and a new realityBefore diving into technical specifics, Paul acknowledges that the industry is at "a real crazy crossroads." He references Block (formerly Square) cutting roughly 40% of their workforce, citing uncertainty about what AI means for their teams. He wants to be transparent that System Initiative also shrank — but clarifies the company did not cut people because of AI. The decision to reduce headcount came before they even knew what they were going to build next, let alone how they would build it. AI entered the picture only after they started prototyping the next version of their product.
Block's February 2026 layoffs, announced by CEO Jack Dorsey, eliminated over 4,000 positions. The move was framed as an AI-driven restructuring, making it one of the most visible examples of AI anxiety playing out in real corporate decisions.
From GenAI hype to agentic collaborationPaul explains that AI coding quality shifted significantly around October–November of the previous year. Before that, results were inconsistent — sometimes impressive, often garbage. Then the models improved dramatically in both reasoning and code generation.
But the bigger breakthrough, in his view, was not the models themselves. It was the industry's shift from "Gen AI" — one-shot prompting where you hand over a spec and accept whatever comes back — to agentic AI, where the model acts more like a pair programmer. In that setup, the human stays in the loop, challenges the plan, adds constraints, and steers the result toward something that fits the codebase.
He gives a concrete early example: System Initiative had a CLI written in Deno (a TypeScript runtime). Because the models were well-trained on TypeScript libraries and the Deno ecosystem, they started producing decent code. Not beautiful, not perfectly architected — but functional. When Paul began feeding the agent patterns, conventions, and existing code to follow, the output became coherent with their codebase.
This led to a workflow where Paul would open six Claude Code sessions at once in separate Git worktrees — isolated copies of the repository on different branches — each building a small feature in parallel, feeding them bug reports and data, and continuously interacting with the results rather than one-shotting them.
Git worktrees let you check out multiple branches of the same repository simultaneously in separate directories. Each worktree is independent, so you can work on several features at once and merge them back via pull requests.
He later expanded this by running longer tasks on a Mac Mini accessible via Tailscale (a mesh VPN), while handling shorter tasks on his laptop — effectively distributing AI workloads across machines.
Why architecture matters more than everOne of Paul's strongest themes is that AI shifts engineering attention away from syntax and back toward architecture. He argues that AI can generate plenty of code, but without design principles and boundaries it will produce spaghetti on top of existing spaghetti.
He introduces the idea of "the first thousand lines" — an anecdote he read recently claiming that the first thousand lines of code an agent helps write determine its path forward. If those lines are well-structured and follow clear design principles, the agent will build coherently on top of them. If they are messy and unprincipled, everything after will compound the mess.
Paul breaks software development into three layers:
He argues the industry spent the last decade obsessing over "taste" while often mocking "ivory tower architects" — the people who designed systems but didn't write code. In an AI-driven world, those architectural concerns become critical again because the agent needs clear boundaries, domain structure, and intent to produce coherent output.
Paulina agrees and observes that this trend may also blur traditional specialization lines, pushing engineers toward becoming more general "software people" rather than narrowly front-end, back-end, or DevOps specialists.
Encoding design docs, rules, and constraints into the repoPaul describes how his team makes architecture actionable for AI by encoding system knowledge directly into the repository. Their approach has several layers:
Design documents — Detailed docs covering the model layer (the actual objects, their purposes, how they relate), workflow construction (how models connect and pass data), and expression language behavior. These live in a /design folder in the open-source repo and describe the intent of every part of the system.
Architectural rules — The agent is explicitly told to follow Domain-Driven Design: proper separation between domains, infrastructure, repositories, and output layers. The DDD skill is loaded so the agent understands and maintains bounded contexts.
Code standards — TypeScript strict mode, no any types, named exports, passing lint and format checks. License compliance is also enforced: because the project is AGPL v3, the agent cannot pull in dependencies with incompatible licenses.
Skills — A newer mechanism for lazy-loading contextual information into the AI agent. Rather than stuffing everything into one enormous prompt, skills are loaded on demand when the agent encounters a specific type of task. This keeps context windows lean and focused.
AGPL v3 (GNU Affero General Public License) is a copyleft license that requires anyone who runs modified software over a network to make the source code available. This creates strict constraints on what dependencies can be used.
Multi-agent development: the full chainA major part of the discussion centers on how Paul's team works with multiple specialized AI agents rather than a single all-knowing assistant. The chain looks like this:
Issue triage agent — When a user opens a GitHub issue, an agent evaluates whether it is a legitimate feature request or bug report. The agent's summary is posted back to the issue immediately, creating context for later stages.
Planning agent — If the issue is legitimate, the system enters plan mode. A specification is generated and posted for the user to review. Users can push back ("that's not how I think it should work"), and the plan is revised until everyone agrees.
Implementation agent — The code is written based on the approved plan, with all the design docs, architectural rules, and skills loaded as context.
Happy-path reviewer — A separate agent reviews the code against standards, checking that it loads correctly and appears to function.
Adversarial reviewer — Added just days before the recording, this agent is told: "You are a grumpy DevOps engineer and I want you to pull this code apart." It looks for security injection points, failure modes, and anything the happy-path reviewer might miss.
Both review agents write their findings as comments on the pull request, creating a visible audit trail. The PR only merges when both agents approve. If the adversarial agent flags a security vulnerability, the implementation goes back for changes.
Paul says this "Jekyll and Hyde" review setup caught a path traversal bug in their CLI during its first week. While the CLI runs locally and the risk was limited, it proved the value of adversarial review.
Path traversal is a vulnerability where an attacker can access files outside the intended directory by manipulating file paths (e.g., using ../ sequences). Even in CLI tools, this can expose sensitive files on a user's machine.
Mattias compares the overall process to a modernized CI/CD pipeline — the same stages exist (commit, test, review, promote, release), but AI replaces some of the manual implementation steps while humans stay focused on architecture, review, and acceptance.
Why external pull requests are disabledOne of the more provocative decisions Paul describes: the open-source Swamp project does not accept external pull requests. GitHub recently added a feature to disable PR creation from non-collaborators entirely, and the team turned it on immediately.
The reasoning is supply chain control. Because the project's code is 100% AI-generated within a tightly controlled context — design docs, architectural rules, skills, adversarial review — they want to ensure that all code entering the system passes through the same pipeline. External PRs would bypass that chain of custody.
Contributors are instead directed to open issues. The team will work through the design collaboratively, plan it together, and then have their agents implement it. Paul frames this not as rejecting collaboration but as controlling the process: "We love contributions, but in the AI world, we cannot control where that code is from or what that code is doing."
Self-reporting bugs: AI filing its own issuesThe team built a skill into Swamp itself so that when the tool encounters a bug during use, it can check out the version of the source code the binary was built against, analyze the problem, and open a GitHub issue automatically with detailed context.
This creates high-quality bug reports that already contain the information needed to reason about a fix. When the implementation agent later picks up that issue, it has precise context — where the bug is, what triggered it, and what the expected behavior should be. Paul says the quality of issues generated this way is significantly higher than typical user-filed bugs.
Testing: the favorite partAlthough the conversation starts with code generation, Paul says testing is actually his favorite part of the workflow. The team runs multiple layers:
Product-level tests: - Unit and integration tests — standard code-level verification - Architectural fitness tests — contract tests, property tests, and DDD boundary checks that verify the domain doesn't leak and the agent followed its instructions
Architectural fitness tests are automated checks that verify a system's structure conforms to its intended architecture. In DDD, this means ensuring bounded contexts don't leak dependencies across domain boundaries.
User-level tests (separate repo): - User flow tests — written from the user's perspective against compiled binaries, not source code. These live in a different repository specifically so they are not influenced by how the code is written. They test scenarios like: create a repository, extend the system, create a workflow, run a model, handle wrong inputs.
Adversarial tests (multiple tiers): 1. Security boundary tests — path traversal, environment variable exposure, supply chain attack vectors. Paul references the recent Trivy incident, where a bot stole an API key and used it to delete all of Trivy's GitHub releases and publish a poisoned VS Code extension. 2. State corruption — what happens when someone tampers with the state layer 3. Concurrency — multiple writes, lock failures, race conditions 4. Resource exhaustion — handling pathological inputs like a 100MB stdout message injected into a workflow
Only after all these layers pass does a build get promoted from nightly to stable. Paul can download and manually test any nightly build that maps back to a specific commit.
Paulina points out that if AI is a force multiplier, there is now even less excuse not to write tests. Paul agrees: "We were scraping the barrel before at coming up with reasons why there shouldn't be any tests. Now that's eliminated."
Plan mode as a safety railPaul repeatedly emphasizes "plan mode," particularly in Claude Code. Before the agent changes anything, it produces a detailed plan describing what it intends to do and why, and waits for human approval.
The hosts immediately draw a parallel to terraform plan — the value is not just automation, but the chance to inspect intended changes before applying them. Paul says this was one of the biggest improvements in AI-assisted development because it reduces horror-story scenarios where an agent goes off and deletes a database or rewrites an application.
He notes that other tools are starting to adopt plan mode because it produces better results across the board. But he also warns that plan mode only helps if people actually read the plan — just like Terraform, the safeguard depends on human discipline. "If there's a big line in the middle that says 'I'm going to delete a database' and you haven't read it — it's the same thing."
Practical lessons for getting good resultsPaul shares several tactical lessons:
He also notes that the generated TypeScript is not always how a human would write it — but that matters less if the result is well-tested, secure, and respects domain boundaries. "The actual syntax of the code itself can change 12 times a day. It doesn't really matter as long as it adheres to the product."
Human oversight at every stageDespite all the automation, Paul is adamant that humans remain involved at every stage. Plans are reviewed, implementations are questioned, pull request comments are inspected, binaries are tested before reaching stable release. He describes it as "continually interacting with Claude Code" rather than just letting things happen.
When Paulina pushes on whether a human still checks things before production, Paul makes clear: yes, always. The release pipeline goes from commit to nightly to manual verification to stable promotion. "I will always download something before it goes to stable."
The context-switching taxPaul acknowledges that running multiple agents in parallel is not for everyone. Context switching has always been expensive for engineers, and commanding multiple agents simultaneously is a new form of it. His advice: if you work best focusing on a single task, don't force the multi-agent style. "It'll be such a context switching killer and it'll cause you to lose focus."
The key shift is that instead of writing code, you are "commanding architecture and commanding design." But that still requires focus and judgment.
AI as a force multiplier, not a replacementPaulina captures the dynamic bluntly: "It's a multiplier. If there is a good thing, you'll get a lot of good thing. If it's a shit, you're going to get a lot of shit."
Paul argues that experienced software and operations people are still essential because they understand architecture, security, constraints, and tradeoffs. AI amplifies whatever is already there — good engineering or bad engineering alike.
He believes engineers who learn to use these tools well become "even more important to your company than you already are." But he also acknowledges that some people will not want to work this way, and that friction between AI-forward and AI-resistant teams is already happening in organizations.
The challenge for juniors and newcomersPaulina raises this personally — she was recently asked to mentor someone entering IT and struggled with how to approach it. She doesn't have a formal IT education (she has an engineering background) and learned on the go. The skills she built through manual work — understanding when code needs refactoring as scale changes, knowing how to structure projects at different sizes — are hard to teach when AI handles so much of the writing.
Paul agrees this is an open question and says the industry is still figuring out the patterns. He believes teaching principles, architecture, and core engineering fundamentals becomes even more important, because tool-specific syntax is increasingly handled by AI. "Do you need to know how to write a Terraform module? Do you need to know how to write a Pulumi provider?" — these are becoming less essential as individual skills, while understanding how systems fit together matters more.
He frames this as an opportunity: "We are now in control of helping shape how this moves forward in the industry." As innovators and early adopters, current practitioners can set the patterns. If they don't, someone else will.
Security, responsibility, and the risk of low-code AIPaulina raises a concrete example from Poland: someone built an app using AI to upload receipts to a centralized accounting system, released it publicly, and exposed all their customers' data.
This leads to a deeper question from Mattias about responsibility: if someone with no engineering background builds an insecure app using an AI tool, who is accountable? The user? The platform? The model provider? The episode doesn't settle this, but Paul argues it reinforces why skilled engineers remain essential. The AI doesn't know the security boundary unless someone explicitly teaches it — "it probably wasn't fed that information that it had to think about the security context."
He expects more specialized skills and agents focused on security, accessibility, and compliance to emerge — calling out the example of loading a security skill and an accessibility skill when you know an app will be public-facing. But he says the ecosystem is not fully there yet.
Cost discipline: structure beats vibe codingPaul addresses economics directly. His five-person team at System Initiative all use Claude Max Pro at $200 per person per month. They do not exceed that cost for the full AI workflow — code generation, reviews, planning, and adversarial testing.
In contrast, he has seen other organizations spend $10,000–$12,000 per month per developer on AI tokens because they let tools roam with huge context windows and vague instructions. His conclusion: tightly scoped tasks are not just better for quality — they are far cheaper.
This maps directly to classic engineering wisdom. Tightly defined stories and tasks were always more efficient to push through a system than "go rebuild this thing and I'll see you in six months." The same principle applies to AI agents.
How to introduce AI in cautious organizationsFor teams in companies that ban or restrict AI, Paul suggests a pragmatic entry point: use agents to analyze, not to write code.
He describes a conversation with someone in London who asked how to get started. Paul's advice: if you already know roughly where a bug lives, ask the agent to analyze the same bug report. If it identifies the same area and the same root cause, you have evidence that the tool can accelerate diagnosis. Show your CTO: "I'm diagnosing bugs 50% faster with this agent. It's not writing code — it's helping me understand where the issue is."
Similar analysis-first use cases work for accessibility reviews, security scans, or code quality assessments. The point is to build trust before expanding scope. Paul notes this approach works faster in the private sector than the public sector, where technology adoption has always been slower.
The pace of change is acceleratingPaul believes the conversation has shifted dramatically in the past six months — from AI horror stories and commiserating over drinks to genuine success stories and conferences forming around agentic engineering practices. He points to two upcoming events:
His prediction: the pace is not linear. "We're honestly exponential at this moment in time." He sidesteps the ethics of AI companies (referencing tensions between Anthropic and OpenAI) to focus on the practical reality that models, reasoning, and tooling are all improving at a compounding rate.
Highlights
Andrey and Mattias share a fast re:Invent roundup focused on AWS security. What do VPC Encryption Controls, post-quantum TLS, and org-level S3 block public access change for you? Which features should you switch on now, like ECR image signing, JWT checks at ALB, and air-gapped AWS Backup? Want simple wins you can use today?
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode, Andrey and Mattias deliver a security-heavy recap of AWS re:Invent 2025 announcements, while noting that Paulina is absent and wishing her a speedy recovery. Out of the 500+ releases surrounding re:Invent, they narrow the list down to roughly 20 features that security-conscious teams can act on today — covering encryption, access control, detection, backups, container security, and organization-wide guardrails. Along the way, Andrey reveals a new AI-powered product called Boris that watches the AWS release firehose so you don't have to.
Key Topics AWS re:Invent Through a Security LensThe hosts frame the episode as the DevSecOps Talks version of a re:Invent recap, complementing a FivexL webinar held the previous month. Despite the podcast's name covering development, security, and operations, the selected announcements lean heavily toward security. Andrey is upfront about it: if security is your thing, stay tuned; otherwise, manage your expectations.
At the FivexL webinar, attendees were asked to prioritize areas of interest across compute, security, and networking. AI dominated the conversation, and people were also curious about Amazon S3 Vectors — a new S3 storage class purpose-built for vector embeddings used in RAG (Retrieval-Augmented Generation) architectures that power LLM applications. It is cost-efficient but lacks hybrid search at this stage.
VPC Encryption and Post-Quantum ReadinessOne of the first and most praised announcements is VPC Encryption Control for Amazon VPC, a pre-re:Invent release that lets teams audit and enforce encryption in transit within and across VPCs. The hosts highlight how painful it used to be to verify internal traffic encryption — typically requiring traffic mirroring, spinning up instances, and inspecting packets with tools like Wireshark. This feature offers two modes: monitor mode to audit encryption status via VPC flow logs, and enforce mode to block unencrypted resources from attaching to the VPC.
Mattias adds that compliance expectations are expanding. It used to be enough to encrypt traffic over public endpoints, but the bar is moving toward encryption everywhere, including inside the VPC. The hosts also call out a common pattern: offloading SSL at the load balancer and leaving traffic to targets unencrypted. VPC encryption control helps catch exactly this kind of blind spot.
The discussion then shifts to post-quantum cryptography (PQC) support rolling out across AWS services including S3, ALB, NLB, AWS Private CA, KMS, ACM, and Secrets Manager. AWS now supports ML-KEM (Module Lattice-Based Key Encapsulation Mechanism), a NIST-standardized post-quantum algorithm, along with ML-DSA (Module Lattice-Based Digital Signature Algorithm) for Private CA certificates.
The rationale: state-level actors are already recording encrypted traffic today in a "harvest now, decrypt later" strategy, betting that future quantum computers will crack current encryption. Andrey notes that operational quantum computing feels closer than ever, making it worthwhile to enable post-quantum protections now — especially for sensitive data traversing public networks.
S3 Security Controls and Access ManagementSeveral S3-related updates stand out. Attribute-Based Access Control (ABAC) for S3 allows access decisions based on resource tags rather than only enumerating specific actions in policies. This is a powerful way to scope permissions — for example, granting access to all buckets tagged with a specific project — though it must be enabled on a per-bucket basis, which the hosts note is a drawback even if necessary to avoid breaking existing security models.
The bigger crowd-pleaser is S3 Block Public Access at the organization level. Previously available at the bucket and account level, this control can now be applied across an entire AWS Organization. The hosts call it well overdue and present it as the ultimate "turn it on and forget it" control: in 2026, there is no good reason to have a public S3 bucket.
Container Image SigningAmazon ECR Managed Image Signing is a welcome addition. ECR now provides a managed service for signing container images, leveraging AWS Signer for key management and certificate lifecycle. Once configured with a signing rule, ECR automatically signs images as they are pushed. This eliminates the operational overhead of setting up and maintaining container image signing infrastructure — previously a significant barrier for teams wanting to verify image provenance in their supply chains.
Backups, Air-Gapping, and Ransomware ResilienceAWS Backup gets significant attention. The hosts discuss air-gapped AWS Backup Vault support as a primary backup target, positioning it as especially relevant for teams where ransomware is on the threat list. These logically air-gapped vaults live in an Amazon-owned account and are locked by default with a compliance vault lock to ensure immutability.
The strong recommendation: enable AWS Backup for any important data, and keep backups isolated in a separate account from your workloads. If an attacker compromises your production account, they should not be able to reach your recovery copies. Related updates include KMS customer-managed key support for air-gapped vaults for better encryption flexibility, and GuardDuty Malware Protection for AWS Backup, which can scan backup artifacts for malware before restoration.
Data Protection in DatabasesDynamic data masking in Aurora PostgreSQL draws praise from both hosts. Using the new pg_columnmask extension, teams can configure column-level masking policies so that queries return masked data instead of actual values — for example, replacing credit card numbers with wildcards. The data in the database remains unmodified; masking happens at query time based on user roles.
Mattias compares it to capabilities already present in databases like Snowflake and highlights how useful it is when sharing data with external partners or other teams. When the idea of using masked production data for testing comes up, the hosts gently push back — don't do that — but both agree that masking at the database layer is a strong control because it reduces the risk of accidental data exposure through APIs or front-end applications.
Identity, IAM, and Federation ImprovementsThe episode covers several IAM-related features. AWS IAM Outbound Identity Federation allows federating AWS identities to external services via JWT, effectively letting you use AWS identity as a platform for authenticating to third-party services — similar to how you connect GitHub or other services to AWS today, but in the other direction.
The AWS Login CLI command provides short-lived credentials for IAM users who don't have AWS IAM Identity Center (SSO) configured. The hosts see it as a better alternative than storing static IAM credentials locally, but also question whether teams should still be relying on IAM users at all — their recommendation is to set up IAM Identity Center and move on.
The AWS Source VPC ARN condition key gets particular enthusiasm. It allows IAM policies to check which VPC a request originated from, enabling conditions like "allow this action only if the request comes from this VPC." For teams doing attribute-based access control in IAM, this is a significant addition.
AWS Secrets Manager Managed External Secrets is another useful feature that removes a common operational burden. Previously, rotating third-party SaaS credentials required writing and maintaining custom Lambda functions. Managed external secrets provides built-in rotation for partner integrations — Salesforce, BigID, and Snowflake at launch — with no Lambda functions needed.
Better Security at the Network and Service LayerJWT verification in AWS Application Load Balancer simplifies machine-to-machine and service-to-service authentication. Teams previously had to roll their own Lambda-based JWT verification; now it is supported out of the box. The recommendation is straightforward: drop the Lambda and use the built-in capability.
AWS Network Firewall Proxy is in public preview. While the hosts have not explored it deeply, their read is that it could help with more advanced network inspection scenarios — not just outgoing internet traffic through NAT gateways, but potentially also traffic heading toward internal corporate data centers.
Developer-Oriented: REST API StreamingAlthough the episode is mainly security-focused, the hosts include REST API streaming in Amazon API Gateway as a nod to developers. This enables progressive response payload streaming, which is especially relevant for LLM use cases where streaming tokens to clients is the expected interaction pattern. Mattias notes that applications are moving beyond small JSON payloads — streaming is becoming table stakes as data volumes grow.
Centralized Observability and DetectionCloudWatch unified management for operational, security, and compliance data promises cross-account visibility from a single pane of glass, without requiring custom log aggregation pipelines built from Lambdas and glue code. The hosts are optimistic but immediately flag the cost: CloudWatch data ingest pricing can escalate quickly when dealing with high-volume sources like access logs. Deep pockets may be required.
Detection is a recurring theme throughout the episode. The hosts discuss CloudTrail Insights for data events (useful if you are already logging data-plane events — another deep-pockets feature), extended threat detection for EC2 and ECS in GuardDuty using AI-powered analysis to correlate security signals across network activity, runtime behavior, and API calls, and the public preview of AWS Security Agent for automated security investigation.
On GuardDuty specifically, the recommendation is clear: if you don't have it enabled, go enable it — it gives you a good baseline that works out of the box across your services with minimal setup. You can always graduate to more sophisticated tooling later, but GuardDuty is the gap-stopper you start with.
Mattias drives the broader point home: incidents are inevitable, and what you can control is how fast you detect and respond. AWS is clearly investing heavily in the detection side, and teams should enable these capabilities as fast as possible.
Control Tower, Organizations, and Guardrails at ScaleSeveral updates make governance easier to adopt at scale: - Dedicated controls for AWS Control Tower without requiring a full Control Tower deployment — you can now use Control Tower guardrails à la carte. - Automatic enrollment in Control Tower — a feature the hosts feel should have existed already. - Required tags in Organizations stack policies — enforcing tagging standards at the organization level. - Amazon Inspector organization-wide management — centralized vulnerability scanning across all accounts. - Billing transfer for AWS Organizations — useful for AWS resellers managing multiple organizations. - Delete protection for CloudWatch Log Groups — a small but important safeguard.
Mattias says plainly: everyone should use Control Tower.
MCP Servers and AWS's Evolving AI ApproachThe conversation shifts to the public preview of AWS MCP (Model Context Protocol) servers. Unlike traditional locally-hosted MCP servers that proxy LLM requests to API calls, AWS is taking a different approach with remote, fully managed MCP servers hosted on AWS infrastructure. These allow AI agents and AI-native IDEs to interact with AWS services over HTTPS without running anything locally.
AWS launched four managed MCP servers — AWS, EKS, ECS, and SageMaker — that consolidate capabilities like AWS documentation access, API execution across 15,000+ AWS APIs, and pre-built agent workflows. However, the IAM model is still being worked out: you currently need separate permissions to call the MCP server and to perform the underlying AWS actions it invokes. The hosts treat this as interesting but still evolving.
Boris: AI for AWS Change AwarenessToward the end of the episode, Andrey reveals a personal project: Boris (getboris.ai), an AI-powered DevOps teammate he has been building. Boris connects to the systems an engineering team already uses and provides evidence-based answers and operational automation.
The specific feature Andrey has been working on takes the AWS RSS feed — where new announcements land daily — and cross-references it against what a customer actually has running in their AWS Organization. Instead of manually sifting through hundreds of releases, Boris sends a digest highlighting only the announcements relevant to your environment and explaining how you would benefit.
Mattias immediately connects this to the same problem in security: teams are overwhelmed by the constant flow of feature updates and vulnerability news. Having an AI that filters and contextualizes that information is, in his words, "brilliant."
Andrey also announces that Boris has been accepted into the Tehnopol AI Accelerator in Tallinn, Estonia — a program run by the Tehnopol Science and Business Park that supports early-stage AI startups — selected from more than 100 companies.
HighlightsAndrey and Mattias share a fast re:Invent roundup focused on AWS security. What do VPC Encryption Controls, post-quantum TLS, and org-level S3 block public access change for you? Which features should you switch on now, like ECR image signing, JWT checks at ALB, and air-gapped AWS Backup? Want simple wins you can use today?
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummaryIn this episode, Andrey and Mattias deliver a security-heavy recap of AWS re:Invent 2025 announcements, while noting that Paulina is absent and wishing her a speedy recovery. Out of the 500+ releases surrounding re:Invent, they narrow the list down to roughly 20 features that security-conscious teams can act on today — covering encryption, access control, detection, backups, container security, and organization-wide guardrails. Along the way, Andrey reveals a new AI-powered product called Boris that watches the AWS release firehose so you don't have to.
Key Topics AWS re:Invent Through a Security LensThe hosts frame the episode as the DevSecOps Talks version of a re:Invent recap, complementing a FivexL webinar held the previous month. Despite the podcast's name covering development, security, and operations, the selected announcements lean heavily toward security. Andrey is upfront about it: if security is your thing, stay tuned; otherwise, manage your expectations.
At the FivexL webinar, attendees were asked to prioritize areas of interest across compute, security, and networking. AI dominated the conversation, and people were also curious about Amazon S3 Vectors — a new S3 storage class purpose-built for vector embeddings used in RAG (Retrieval-Augmented Generation) architectures that power LLM applications. It is cost-efficient but lacks hybrid search at this stage.
VPC Encryption and Post-Quantum ReadinessOne of the first and most praised announcements is VPC Encryption Control for Amazon VPC, a pre-re:Invent release that lets teams audit and enforce encryption in transit within and across VPCs. The hosts highlight how painful it used to be to verify internal traffic encryption — typically requiring traffic mirroring, spinning up instances, and inspecting packets with tools like Wireshark. This feature offers two modes: monitor mode to audit encryption status via VPC flow logs, and enforce mode to block unencrypted resources from attaching to the VPC.
Mattias adds that compliance expectations are expanding. It used to be enough to encrypt traffic over public endpoints, but the bar is moving toward encryption everywhere, including inside the VPC. The hosts also call out a common pattern: offloading SSL at the load balancer and leaving traffic to targets unencrypted. VPC encryption control helps catch exactly this kind of blind spot.
The discussion then shifts to post-quantum cryptography (PQC) support rolling out across AWS services including S3, ALB, NLB, AWS Private CA, KMS, ACM, and Secrets Manager. AWS now supports ML-KEM (Module Lattice-Based Key Encapsulation Mechanism), a NIST-standardized post-quantum algorithm, along with ML-DSA (Module Lattice-Based Digital Signature Algorithm) for Private CA certificates.
The rationale: state-level actors are already recording encrypted traffic today in a "harvest now, decrypt later" strategy, betting that future quantum computers will crack current encryption. Andrey notes that operational quantum computing feels closer than ever, making it worthwhile to enable post-quantum protections now — especially for sensitive data traversing public networks.
S3 Security Controls and Access ManagementSeveral S3-related updates stand out. Attribute-Based Access Control (ABAC) for S3 allows access decisions based on resource tags rather than only enumerating specific actions in policies. This is a powerful way to scope permissions — for example, granting access to all buckets tagged with a specific project — though it must be enabled on a per-bucket basis, which the hosts note is a drawback even if necessary to avoid breaking existing security models.
The bigger crowd-pleaser is S3 Block Public Access at the organization level. Previously available at the bucket and account level, this control can now be applied across an entire AWS Organization. The hosts call it well overdue and present it as the ultimate "turn it on and forget it" control: in 2026, there is no good reason to have a public S3 bucket.
Container Image SigningAmazon ECR Managed Image Signing is a welcome addition. ECR now provides a managed service for signing container images, leveraging AWS Signer for key management and certificate lifecycle. Once configured with a signing rule, ECR automatically signs images as they are pushed. This eliminates the operational overhead of setting up and maintaining container image signing infrastructure — previously a significant barrier for teams wanting to verify image provenance in their supply chains.
Backups, Air-Gapping, and Ransomware ResilienceAWS Backup gets significant attention. The hosts discuss air-gapped AWS Backup Vault support as a primary backup target, positioning it as especially relevant for teams where ransomware is on the threat list. These logically air-gapped vaults live in an Amazon-owned account and are locked by default with a compliance vault lock to ensure immutability.
The strong recommendation: enable AWS Backup for any important data, and keep backups isolated in a separate account from your workloads. If an attacker compromises your production account, they should not be able to reach your recovery copies. Related updates include KMS customer-managed key support for air-gapped vaults for better encryption flexibility, and GuardDuty Malware Protection for AWS Backup, which can scan backup artifacts for malware before restoration.
Data Protection in DatabasesDynamic data masking in Aurora PostgreSQL draws praise from both hosts. Using the new pg_columnmask extension, teams can configure column-level masking policies so that queries return masked data instead of actual values — for example, replacing credit card numbers with wildcards. The data in the database remains unmodified; masking happens at query time based on user roles.
Mattias compares it to capabilities already present in databases like Snowflake and highlights how useful it is when sharing data with external partners or other teams. When the idea of using masked production data for testing comes up, the hosts gently push back — don't do that — but both agree that masking at the database layer is a strong control because it reduces the risk of accidental data exposure through APIs or front-end applications.
Identity, IAM, and Federation ImprovementsThe episode covers several IAM-related features. AWS IAM Outbound Identity Federation allows federating AWS identities to external services via JWT, effectively letting you use AWS identity as a platform for authenticating to third-party services — similar to how you connect GitHub or other services to AWS today, but in the other direction.
The AWS Login CLI command provides short-lived credentials for IAM users who don't have AWS IAM Identity Center (SSO) configured. The hosts see it as a better alternative than storing static IAM credentials locally, but also question whether teams should still be relying on IAM users at all — their recommendation is to set up IAM Identity Center and move on.
The AWS Source VPC ARN condition key gets particular enthusiasm. It allows IAM policies to check which VPC a request originated from, enabling conditions like "allow this action only if the request comes from this VPC." For teams doing attribute-based access control in IAM, this is a significant addition.
AWS Secrets Manager Managed External Secrets is another useful feature that removes a common operational burden. Previously, rotating third-party SaaS credentials required writing and maintaining custom Lambda functions. Managed external secrets provides built-in rotation for partner integrations — Salesforce, BigID, and Snowflake at launch — with no Lambda functions needed.
Better Security at the Network and Service LayerJWT verification in AWS Application Load Balancer simplifies machine-to-machine and service-to-service authentication. Teams previously had to roll their own Lambda-based JWT verification; now it is supported out of the box. The recommendation is straightforward: drop the Lambda and use the built-in capability.
AWS Network Firewall Proxy is in public preview. While the hosts have not explored it deeply, their read is that it could help with more advanced network inspection scenarios — not just outgoing internet traffic through NAT gateways, but potentially also traffic heading toward internal corporate data centers.
Developer-Oriented: REST API StreamingAlthough the episode is mainly security-focused, the hosts include REST API streaming in Amazon API Gateway as a nod to developers. This enables progressive response payload streaming, which is especially relevant for LLM use cases where streaming tokens to clients is the expected interaction pattern. Mattias notes that applications are moving beyond small JSON payloads — streaming is becoming table stakes as data volumes grow.
Centralized Observability and DetectionCloudWatch unified management for operational, security, and compliance data promises cross-account visibility from a single pane of glass, without requiring custom log aggregation pipelines built from Lambdas and glue code. The hosts are optimistic but immediately flag the cost: CloudWatch data ingest pricing can escalate quickly when dealing with high-volume sources like access logs. Deep pockets may be required.
Detection is a recurring theme throughout the episode. The hosts discuss CloudTrail Insights for data events (useful if you are already logging data-plane events — another deep-pockets feature), extended threat detection for EC2 and ECS in GuardDuty using AI-powered analysis to correlate security signals across network activity, runtime behavior, and API calls, and the public preview of AWS Security Agent for automated security investigation.
On GuardDuty specifically, the recommendation is clear: if you don't have it enabled, go enable it — it gives you a good baseline that works out of the box across your services with minimal setup. You can always graduate to more sophisticated tooling later, but GuardDuty is the gap-stopper you start with.
Mattias drives the broader point home: incidents are inevitable, and what you can control is how fast you detect and respond. AWS is clearly investing heavily in the detection side, and teams should enable these capabilities as fast as possible.
Control Tower, Organizations, and Guardrails at ScaleSeveral updates make governance easier to adopt at scale: - Dedicated controls for AWS Control Tower without requiring a full Control Tower deployment — you can now use Control Tower guardrails à la carte. - Automatic enrollment in Control Tower — a feature the hosts feel should have existed already. - Required tags in Organizations stack policies — enforcing tagging standards at the organization level. - Amazon Inspector organization-wide management — centralized vulnerability scanning across all accounts. - Billing transfer for AWS Organizations — useful for AWS resellers managing multiple organizations. - Delete protection for CloudWatch Log Groups — a small but important safeguard.
Mattias says plainly: everyone should use Control Tower.
MCP Servers and AWS's Evolving AI ApproachThe conversation shifts to the public preview of AWS MCP (Model Context Protocol) servers. Unlike traditional locally-hosted MCP servers that proxy LLM requests to API calls, AWS is taking a different approach with remote, fully managed MCP servers hosted on AWS infrastructure. These allow AI agents and AI-native IDEs to interact with AWS services over HTTPS without running anything locally.
AWS launched four managed MCP servers — AWS, EKS, ECS, and SageMaker — that consolidate capabilities like AWS documentation access, API execution across 15,000+ AWS APIs, and pre-built agent workflows. However, the IAM model is still being worked out: you currently need separate permissions to call the MCP server and to perform the underlying AWS actions it invokes. The hosts treat this as interesting but still evolving.
Boris: AI for AWS Change AwarenessToward the end of the episode, Andrey reveals a personal project: Boris (getboris.ai), an AI-powered DevOps teammate he has been building. Boris connects to the systems an engineering team already uses and provides evidence-based answers and operational automation.
The specific feature Andrey has been working on takes the AWS RSS feed — where new announcements land daily — and cross-references it against what a customer actually has running in their AWS Organization. Instead of manually sifting through hundreds of releases, Boris sends a digest highlighting only the announcements relevant to your environment and explaining how you would benefit.
Mattias immediately connects this to the same problem in security: teams are overwhelmed by the constant flow of feature updates and vulnerability news. Having an AI that filters and contextualizes that information is, in his words, "brilliant."
Andrey also announces that Boris has been accepted into the Tehnopol AI Accelerator in Tallinn, Estonia — a program run by the Tehnopol Science and Business Park that supports early-stage AI startups — selected from more than 100 companies.
Highlights
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummarySystem Initiative has undergone a dramatic transformation: from a visual SaaS infrastructure platform with 17 employees to Swamp, a fully open-source CLI built for AI agents, maintained by a five-person team whose initials literally spell the product name. Paul Stack returns for his third appearance on the show to explain why the old model failed — and why handing an AI agent raw CLI access to your cloud is, as Andrey puts it, just "console-clicking in the terminal." The conversation gets sharp when the hosts push on what problem Swamp actually solves, whether ops teams are becoming the next bottleneck in AI-era delivery, and why Paul believes the right move is not replacing Terraform but giving AI a structured system it can reason about. Paul also drops a parting bombshell: he hasn't written a single line of code in four weeks.
Key Topics System Initiative's pivot from visual editor to AI-first CLIPaul Stack explains that System Initiative spent over five years iterating on a visual infrastructure tool where users could drag, drop, and connect systems. Despite the ambition, the team eventually concluded that visual composition was too slow, too cumbersome, and too alien for practitioners accustomed to code, artifacts, and reviewable changes.
The shift started in summer 2025 when Paul spiked a public OpenAPI-spec API. A customer then built an early MCP (Model Context Protocol) server on top of it — a prototype that worked but had no thought given to token usage or tool abstraction. System Initiative responded by building its own official MCP server and pairing it with a CLI. The results were dramatically better: customers could iterate easily from the command line or through AI coding tools like Claude Code.
By Christmas 2025 the writing was on the wall. The CLI-plus-agent approach was producing better outcomes, while the company was still carrying hundreds of thousands of lines of code for a distributed SaaS platform built for a previous product direction. In mid-January 2026, the company made the call to rethink everything from first principles.
The team behind the nameThe restructuring was painful. System Initiative went from 17 people to five. Paul explains the reasoning candidly: when you don't know what the tool is going to be, keeping a large team around is unfair to them, bad for their careers, and expensive. The five who stayed were the CEO, VP of Business, COO, Paul (who ran product), and Nick Steinmetz, the head of infrastructure — who also happened to be System Initiative's most active internal user, having used the platform to build System Initiative itself.
Those five people's initials spell SWAMP. The name was unintentional but stuck — and Paul notes with a grin that if they ever remove the "P," it becomes "SWAM," so he's safe even if he leaves. Beyond the joke, the name fits: Swamp stores operational data in a local .swamp/ directory — not a neatly formatted data lake, but a structured store that AI agents can pull from to reason about infrastructure state and history.
Why raw AI agent access to infrastructure is dangerousA major theme in the conversation is that letting an AI agent operate infrastructure directly — through the AWS CLI or raw API calls — is fundamentally unreliable. Andrey lays out the problem clearly: this kind of interaction is equivalent to clicking around the cloud console, just automated through a terminal. It is not repeatable, not reviewable, and inherits the non-deterministic behavior of LLMs. If the agent's context window fills up, it starts to forget earlier decisions and improvises — a terrifying prospect for production infrastructure.
What made System Initiative's earlier MCP-based direction compelling, in Andrey's view, was the combination of guardrails, repeatability, and human review. The agent generates a structured specification, a human reviews it, and only then is it applied. Paul agrees and calls this the "agentic loop with the human loop" — the strongest pattern they found.
Token costs and the case for local-first architecturePaul shares a hard-won lesson from building MCP integrations: a poorly designed MCP server burns enormous amounts of tokens and creates unnecessary costs for users. He spent three weeks in December reworking the server to use progressive context reveal rather than flooding the model with data. Even so, the fundamental problem with a SaaS-first architecture remained — constantly transmitting context between a central API and the user's agent was expensive regardless of optimization.
That experience pushed the team toward a local-first design. Swamp keeps data on the user's machine, close to where the agent operates, giving AI the context it needs without the round-trip overhead and cost of a remote service.
What Swamp actually isSwamp is a general-purpose, open-source CLI automation tool — not just another infrastructure-as-code framework. Its core building blocks are:
Critically, Swamp ships with zero built-in models — no pre-packaged AWS EC2, VPC, or GCP resource definitions. Instead, the AI agent uses installed skills to generate models on the fly. Paul describes a user who joined the Discord that very morning, asked Swamp to create a schema for managing Let's Encrypt certificates, and it worked on the first attempt without writing any code.
Nick Steinmetz provides another example: he manages his homelab Proxmox hypervisor entirely through Swamp — creating and starting VMs, inspecting hypervisor state, and monitoring utilization. He recently connected it to Discord so friends can run commands like @swamp create vm to spin up Minecraft and gaming servers on demand.
How Swamp fits with AI coding toolsThe hosts spend significant time pinning down where Swamp sits relative to tools like Claude Code, bash access, and existing automation. Paul is clear: Swamp is not an AI wrapper or chatbot. It is a structured runtime that gives agents guardrails and reusable patterns.
Mattias works through several analogies to help frame it — is it like n8n or Zapier for the CLI? A CLI-based Jenkins where jobs are agents? Paul settles on this: it is a workflow engine driven by typed models, where data can be chained between steps using CEL (Common Expression Language) expressions — the same dot-notation referencing used in Kubernetes API declarations. A simple example: create a VPC in step one, then reference VPC.resource.attributes.vpcid as input to a subnet model in step two.
In Paul's personal workflow, he uses Claude Code to generate models and workflows, checks them into Git for peer review, and then runs them manually or through CI at a time of his choosing. He has explicitly configured Claude with a permission deny on workflow run — the agent helps build automation but never executes it. The same CLI works whether a person or an agent runs it; the difference is timing and approval.
Reusability, composition, and Terraform interopSwamp workflows are parameterized and reusable across environments. If they grow unwieldy, workflows can orchestrate other workflows, collect outputs, and manage success conditions — similar to GitHub Actions calling other actions.
Paul also demonstrates that Swamp can sit alongside existing tooling rather than replacing it. In a live Discord session, he built infrastructure models in Swamp and then asked the AI agent to generate the equivalent Terraform configuration. Because the agent had typed models with explicit relationships, it produced correct Terraform with proper resource dependencies. This positions Swamp less as a replacement mandate and more as a reasoning and control layer that can output to whatever format teams already use.
When one of the hosts compares Swamp to general build systems like Gradle, Paul draws a key distinction: traditional tools were designed for humans to write, review, and debate. Swamp is designed for AI agents to inspect and operate within. He references Anton Babenko's widely-used terraform-aws-vpc module — with its 237+ input variables — as an example of a human-centric design that agents struggle with due to version dependencies, module structure complexity, and stylistic decisions baked in over years. Swamp instead provides the agent with structured context, explicit typing, and historical artifacts it can query.
Open source, AGPL v3, and monetizationPaulina asks the natural question: if Swamp is fully open source under AGPL v3, how does the company make money?
Paul is candid that monetization is not the immediate priority — the focus is building a tool that resonates with users first. But he outlines a potential model: a marketplace-style ecosystem where users can publish their own models and workflows, while System Initiative offers supported, maintained, and paid-for versions of commonly needed building blocks. He draws a loose comparison to Docker Hub's model of community images alongside official ones.
The deeper argument is strategic: Paul believes there is no longer a durable moat in software. If users dislike a tool today, AI makes it increasingly feasible to build their own. Rather than trying to control all schemas and code, the team wants to make Swamp so extensible that users build on top of it rather than walking away from it.
Are ops teams becoming the next bottleneck?Paul argues that software development productivity is accelerating so fast with AI that ops teams risk becoming the next bottleneck — echoing earlier industry transitions from physical servers to cloud and from manual provisioning to infrastructure as code. Development teams can now move at a pace that traditional infrastructure workflows cannot match.
Andrey agrees with the premise but pushes back on where the bottleneck actually sits today. In his experience — spending "day and night burning tokens" on AI-assisted development — the real constraint is testing, not deployment. He describes pipelines that can go from idea to pull request automatically, but stall without a strong test harness and end-to-end validation. Without sufficient tests, you never even reach the deployment phase.
Paul accepts the framing and says the goal of Swamp is to strip away lower-value friction — fighting with file layouts, naming conventions, writing boilerplate models — so teams can invest their time where engineering rigor still matters most: testing, validation, and production safety.
Swamp as an addition, not a forced replacementPaul closes with an important positioning point: Swamp does not require teams to discard their Terraform, Pulumi, or existing infrastructure investments. It can be introduced alongside current tooling to interrogate infrastructure, validate what existing IaC does, and extend automation in AI-native ways. The extensibility is the point — users control when things run, what models to build, and how to integrate with their existing stack.
Highlights "Giving an agent raw CLI access to your cloud is basically console-clicking in the terminal." — AndreyAndrey challenges the assumption that AI-driven infrastructure is automatically safer. If an agent is just shelling out to the AWS CLI, the result may be fast — but it is non-deterministic, non-repeatable, and forget-prone once the context window fills up.
The future of infra automation needs guardrails before it needs speed. Listen to hear why structured workflows beat flashy demos.
"The best loop was the agentic loop with the human loop." — Paul StackThe breakthrough was not autonomous infrastructure execution. It was letting the AI generate structured specs while humans stay in charge of review and execution. Paul even blocks Claude Code from running workflows directly on his machine.
If "human in the loop" sounds conservative, this episode makes the case that it is the only production-safe pattern we have. Listen for the full argument.
"There is no longer a moat in software." — Paul StackPaul argues that AI has changed the economics of building software so fundamentally that no team can rely on implementation complexity as a competitive advantage. If users dislike your tool, they can build their own — faster than ever before.
That belief is why Swamp is open source, extensible, and ships with zero built-in models. Listen for a candid take on product strategy when anyone can clone your work.
"Ops teams are going to become the bottlenecks that we once were." — Paul StackAs development velocity explodes with AI, Paul warns that infrastructure teams risk slowing everything down — the same pattern that played out in the shifts from physical servers to cloud and from cloud to IaC.
Andrey fires back: the real bottleneck today is testing, not deployment. Listen for a sharp debate on where delivery pipelines are actually stuck.
"I haven't written a single line of code in four weeks." — Paul StackPaul reveals that the entire Swamp repository is AI-generated, with four machines running in parallel to churn out plans and implementations — including customer feature requests. The team teases a future episode to compare notes on AI-driven development workflows.
If that claim doesn't make you want to hear the follow-up, nothing will.
ResourcesSwamp CLI on GitHub — The open-source, AGPL v3 licensed CLI tool discussed in the episode. Models, workflows, and a local .swamp/ data directory designed for AI agent interaction.
System Initiative — The company behind Swamp, originally known for its visual infrastructure platform, now pivoted to AI-native CLI automation.
Model Context Protocol (MCP) — Anthropic's open protocol for connecting AI models to external tools and data sources. Paul discusses the challenges of building MCP servers that are token-efficient.
Claude Code — Anthropic's agentic coding tool that runs in the terminal. Used throughout the episode as the primary AI agent interface for Swamp workflows.
CEL — Common Expression Language — The expression language Swamp uses for chaining data between workflow steps, similar to how Kubernetes uses it for API declarations and validation policies.
Proxmox Virtual Environment — The open-source hypervisor platform that Nick Steinmetz manages entirely through Swamp in his homelab, including Discord-driven VM creation.
terraform-aws-modules/vpc — Anton Babenko's widely-used Terraform VPC module, referenced by Paul as an example of human-centric IaC design with 237+ inputs that agents struggle to navigate.
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
DevSecOps Talks podcast website
DevSecOps Talks podcast YouTube channel
SummarySystem Initiative has undergone a dramatic transformation: from a visual SaaS infrastructure platform with 17 employees to Swamp, a fully open-source CLI built for AI agents, maintained by a five-person team whose initials literally spell the product name. Paul Stack returns for his third appearance on the show to explain why the old model failed — and why handing an AI agent raw CLI access to your cloud is, as Andrey puts it, just "console-clicking in the terminal." The conversation gets sharp when the hosts push on what problem Swamp actually solves, whether ops teams are becoming the next bottleneck in AI-era delivery, and why Paul believes the right move is not replacing Terraform but giving AI a structured system it can reason about. Paul also drops a parting bombshell: he hasn't written a single line of code in four weeks.
Key Topics System Initiative's pivot from visual editor to AI-first CLIPaul Stack explains that System Initiative spent over five years iterating on a visual infrastructure tool where users could drag, drop, and connect systems. Despite the ambition, the team eventually concluded that visual composition was too slow, too cumbersome, and too alien for practitioners accustomed to code, artifacts, and reviewable changes.
The shift started in summer 2025 when Paul spiked a public OpenAPI-spec API. A customer then built an early MCP (Model Context Protocol) server on top of it — a prototype that worked but had no thought given to token usage or tool abstraction. System Initiative responded by building its own official MCP server and pairing it with a CLI. The results were dramatically better: customers could iterate easily from the command line or through AI coding tools like Claude Code.
By Christmas 2025 the writing was on the wall. The CLI-plus-agent approach was producing better outcomes, while the company was still carrying hundreds of thousands of lines of code for a distributed SaaS platform built for a previous product direction. In mid-January 2026, the company made the call to rethink everything from first principles.
The team behind the nameThe restructuring was painful. System Initiative went from 17 people to five. Paul explains the reasoning candidly: when you don't know what the tool is going to be, keeping a large team around is unfair to them, bad for their careers, and expensive. The five who stayed were the CEO, VP of Business, COO, Paul (who ran product), and Nick Steinmetz, the head of infrastructure — who also happened to be System Initiative's most active internal user, having used the platform to build System Initiative itself.
Those five people's initials spell SWAMP. The name was unintentional but stuck — and Paul notes with a grin that if they ever remove the "P," it becomes "SWAM," so he's safe even if he leaves. Beyond the joke, the name fits: Swamp stores operational data in a local .swamp/ directory — not a neatly formatted data lake, but a structured store that AI agents can pull from to reason about infrastructure state and history.
Why raw AI agent access to infrastructure is dangerousA major theme in the conversation is that letting an AI agent operate infrastructure directly — through the AWS CLI or raw API calls — is fundamentally unreliable. Andrey lays out the problem clearly: this kind of interaction is equivalent to clicking around the cloud console, just automated through a terminal. It is not repeatable, not reviewable, and inherits the non-deterministic behavior of LLMs. If the agent's context window fills up, it starts to forget earlier decisions and improvises — a terrifying prospect for production infrastructure.
What made System Initiative's earlier MCP-based direction compelling, in Andrey's view, was the combination of guardrails, repeatability, and human review. The agent generates a structured specification, a human reviews it, and only then is it applied. Paul agrees and calls this the "agentic loop with the human loop" — the strongest pattern they found.
Token costs and the case for local-first architecturePaul shares a hard-won lesson from building MCP integrations: a poorly designed MCP server burns enormous amounts of tokens and creates unnecessary costs for users. He spent three weeks in December reworking the server to use progressive context reveal rather than flooding the model with data. Even so, the fundamental problem with a SaaS-first architecture remained — constantly transmitting context between a central API and the user's agent was expensive regardless of optimization.
That experience pushed the team toward a local-first design. Swamp keeps data on the user's machine, close to where the agent operates, giving AI the context it needs without the round-trip overhead and cost of a remote service.
What Swamp actually isSwamp is a general-purpose, open-source CLI automation tool — not just another infrastructure-as-code framework. Its core building blocks are:
Critically, Swamp ships with zero built-in models — no pre-packaged AWS EC2, VPC, or GCP resource definitions. Instead, the AI agent uses installed skills to generate models on the fly. Paul describes a user who joined the Discord that very morning, asked Swamp to create a schema for managing Let's Encrypt certificates, and it worked on the first attempt without writing any code.
Nick Steinmetz provides another example: he manages his homelab Proxmox hypervisor entirely through Swamp — creating and starting VMs, inspecting hypervisor state, and monitoring utilization. He recently connected it to Discord so friends can run commands like @swamp create vm to spin up Minecraft and gaming servers on demand.
How Swamp fits with AI coding toolsThe hosts spend significant time pinning down where Swamp sits relative to tools like Claude Code, bash access, and existing automation. Paul is clear: Swamp is not an AI wrapper or chatbot. It is a structured runtime that gives agents guardrails and reusable patterns.
Mattias works through several analogies to help frame it — is it like n8n or Zapier for the CLI? A CLI-based Jenkins where jobs are agents? Paul settles on this: it is a workflow engine driven by typed models, where data can be chained between steps using CEL (Common Expression Language) expressions — the same dot-notation referencing used in Kubernetes API declarations. A simple example: create a VPC in step one, then reference VPC.resource.attributes.vpcid as input to a subnet model in step two.
In Paul's personal workflow, he uses Claude Code to generate models and workflows, checks them into Git for peer review, and then runs them manually or through CI at a time of his choosing. He has explicitly configured Claude with a permission deny on workflow run — the agent helps build automation but never executes it. The same CLI works whether a person or an agent runs it; the difference is timing and approval.
Reusability, composition, and Terraform interopSwamp workflows are parameterized and reusable across environments. If they grow unwieldy, workflows can orchestrate other workflows, collect outputs, and manage success conditions — similar to GitHub Actions calling other actions.
Paul also demonstrates that Swamp can sit alongside existing tooling rather than replacing it. In a live Discord session, he built infrastructure models in Swamp and then asked the AI agent to generate the equivalent Terraform configuration. Because the agent had typed models with explicit relationships, it produced correct Terraform with proper resource dependencies. This positions Swamp less as a replacement mandate and more as a reasoning and control layer that can output to whatever format teams already use.
When one of the hosts compares Swamp to general build systems like Gradle, Paul draws a key distinction: traditional tools were designed for humans to write, review, and debate. Swamp is designed for AI agents to inspect and operate within. He references Anton Babenko's widely-used terraform-aws-vpc module — with its 237+ input variables — as an example of a human-centric design that agents struggle with due to version dependencies, module structure complexity, and stylistic decisions baked in over years. Swamp instead provides the agent with structured context, explicit typing, and historical artifacts it can query.
Open source, AGPL v3, and monetizationPaulina asks the natural question: if Swamp is fully open source under AGPL v3, how does the company make money?
Paul is candid that monetization is not the immediate priority — the focus is building a tool that resonates with users first. But he outlines a potential model: a marketplace-style ecosystem where users can publish their own models and workflows, while System Initiative offers supported, maintained, and paid-for versions of commonly needed building blocks. He draws a loose comparison to Docker Hub's model of community images alongside official ones.
The deeper argument is strategic: Paul believes there is no longer a durable moat in software. If users dislike a tool today, AI makes it increasingly feasible to build their own. Rather than trying to control all schemas and code, the team wants to make Swamp so extensible that users build on top of it rather than walking away from it.
Are ops teams becoming the next bottleneck?Paul argues that software development productivity is accelerating so fast with AI that ops teams risk becoming the next bottleneck — echoing earlier industry transitions from physical servers to cloud and from manual provisioning to infrastructure as code. Development teams can now move at a pace that traditional infrastructure workflows cannot match.
Andrey agrees with the premise but pushes back on where the bottleneck actually sits today. In his experience — spending "day and night burning tokens" on AI-assisted development — the real constraint is testing, not deployment. He describes pipelines that can go from idea to pull request automatically, but stall without a strong test harness and end-to-end validation. Without sufficient tests, you never even reach the deployment phase.
Paul accepts the framing and says the goal of Swamp is to strip away lower-value friction — fighting with file layouts, naming conventions, writing boilerplate models — so teams can invest their time where engineering rigor still matters most: testing, validation, and production safety.
Swamp as an addition, not a forced replacementPaul closes with an important positioning point: Swamp does not require teams to discard their Terraform, Pulumi, or existing infrastructure investments. It can be introduced alongside current tooling to interrogate infrastructure, validate what existing IaC does, and extend automation in AI-native ways. The extensibility is the point — users control when things run, what models to build, and how to integrate with their existing stack.
Highlights "Giving an agent raw CLI access to your cloud is basically console-clicking in the terminal." — AndreyAndrey challenges the assumption that AI-driven infrastructure is automatically safer. If an agent is just shelling out to the AWS CLI, the result may be fast — but it is non-deterministic, non-repeatable, and forget-prone once the context window fills up.
The future of infra automation needs guardrails before it needs speed. Listen to hear why structured workflows beat flashy demos.
"The best loop was the agentic loop with the human loop." — Paul StackThe breakthrough was not autonomous infrastructure execution. It was letting the AI generate structured specs while humans stay in charge of review and execution. Paul even blocks Claude Code from running workflows directly on his machine.
If "human in the loop" sounds conservative, this episode makes the case that it is the only production-safe pattern we have. Listen for the full argument.
"There is no longer a moat in software." — Paul StackPaul argues that AI has changed the economics of building software so fundamentally that no team can rely on implementation complexity as a competitive advantage. If users dislike your tool, they can build their own — faster than ever before.
That belief is why Swamp is open source, extensible, and ships with zero built-in models. Listen for a candid take on product strategy when anyone can clone your work.
"Ops teams are going to become the bottlenecks that we once were." — Paul StackAs development velocity explodes with AI, Paul warns that infrastructure teams risk slowing everything down — the same pattern that played out in the shifts from physical servers to cloud and from cloud to IaC.
Andrey fires back: the real bottleneck today is testing, not deployment. Listen for a sharp debate on where delivery pipelines are actually stuck.
"I haven't written a single line of code in four weeks." — Paul StackPaul reveals that the entire Swamp repository is AI-generated, with four machines running in parallel to churn out plans and implementations — including customer feature requests. The team teases a future episode to compare notes on AI-driven development workflows.
If that claim doesn't make you want to hear the follow-up, nothing will.
ResourcesSwamp CLI on GitHub — The open-source, AGPL v3 licensed CLI tool discussed in the episode. Models, workflows, and a local .swamp/ data directory designed for AI agent interaction.
System Initiative — The company behind Swamp, originally known for its visual infrastructure platform, now pivoted to AI-native CLI automation.
Model Context Protocol (MCP) — Anthropic's open protocol for connecting AI models to external tools and data sources. Paul discusses the challenges of building MCP servers that are token-efficient.
Claude Code — Anthropic's agentic coding tool that runs in the terminal. Used throughout the episode as the primary AI agent interface for Swamp workflows.
CEL — Common Expression Language — The expression language Swamp uses for chaining data between workflow steps, similar to how Kubernetes uses it for API declarations and validation policies.
Proxmox Virtual Environment — The open-source hypervisor platform that Nick Steinmetz manages entirely through Swamp in his homelab, including Discord-driven VM creation.
terraform-aws-modules/vpc — Anton Babenko's widely-used Terraform VPC module, referenced by Paul as an example of human-centric IaC design with 237+ inputs that agents struggle to navigate.
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
It’s been a while since OpenTofu was released to the public, so we wanted to check in on where it stands today. How is the community adopting it? What’s the public sentiment? And how does it differ from Terraform in terms of features?
This time we’re joined by Cole Bittel, an experienced SRE, platform engineer, and contributor to OpenTofu. He shares his hands-on experience migrating to OpenTofu, and we look into the problems teams face with infrastructure as code and how both Terraform and OpenTofu approach solving them.
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
It’s been a while since OpenTofu was released to the public, so we wanted to check in on where it stands today. How is the community adopting it? What’s the public sentiment? And how does it differ from Terraform in terms of features?
This time we’re joined by Cole Bittel, an experienced SRE, platform engineer, and contributor to OpenTofu. He shares his hands-on experience migrating to OpenTofu, and we look into the problems teams face with infrastructure as code and how both Terraform and OpenTofu approach solving them.
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
DevSecOps Talks podcast LinkedIn page
Still pasting tokens into Slack? What types of secrets are at risk, and which tools fit which consumer—humans, CI/CD, or workloads? Where do most teams stumble, and how do you fix it fast? Hear our no-nonsense checklist.
Connect with us on LinkedIn or X (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
The video version of this episode is available on our YouTube channel
LinkedIn page of the DevSecOps Talks team is here
Still pasting tokens into Slack? What types of secrets are at risk, and which tools fit which consumer—humans, CI/CD, or workloads? Where do most teams stumble, and how do you fix it fast? Hear our no-nonsense checklist.
Connect with us on LinkedIn or X (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
The video version of this episode is available on our YouTube channel
LinkedIn page of the DevSecOps Talks team is here
Passkeys are gaining attention as a new way to log in without passwords. How do they work, and how do they compare to traditional multi-factor authentication (MFA)? In this episode, we explore the history of passwords, the strengths and weaknesses of common MFA methods, and the potential of passkeys to enhance security. What threats do passkeys mitigate, and what still remain?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Passkeys are gaining attention as a new way to log in without passwords. How do they work, and how do they compare to traditional multi-factor authentication (MFA)? In this episode, we explore the history of passwords, the strengths and weaknesses of common MFA methods, and the potential of passkeys to enhance security. What threats do passkeys mitigate, and what still remain?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this guest episode, we chat with Davlet Dzhakishev, co-founder of Cloudgeni, who’s working on an AI-powered approach to fixing compliance issues in IaC. What’s the state of tools in this space? Where does his idea fit in? And how should we think about the relationship between compliance and security?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this guest episode, we chat with Davlet Dzhakishev, co-founder of Cloudgeni, who’s working on an AI-powered approach to fixing compliance issues in IaC. What’s the state of tools in this space? Where does his idea fit in? And how should we think about the relationship between compliance and security?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
We are looking into recently announced AWS Resource Control Policies. What are they? How are they different from Service Control Policies? What is a Data Perimeter? Tune in to find out!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
We are looking into recently announced AWS Resource Control Policies. What are they? How are they different from Service Control Policies? What is a Data Perimeter? Tune in to find out!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Andrey has been exploring GitHub Actions and has some insights to share. How have CI/CD solutions transformed over time, and what innovations do GitHub Actions bring to the table? Julien drops a few tools that could be useful for GitHub Actions users.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Andrey has been exploring GitHub Actions and has some insights to share. How have CI/CD solutions transformed over time, and what innovations do GitHub Actions bring to the table? Julien drops a few tools that could be useful for GitHub Actions users.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Welcome to the first DevSecOps Talks episode of the new year! It's been a whole year since ChatGPT hit the scene – but how has AI adoption shaped our world since then? Join Julien, Mattias, and Andrey as they dive into the impact of AI on their workflows. How have their daily tech tools and practices evolved with AI integration? Plus, Julien gives us an insider's look at running models locally. Are these AI tools enhancing our efficiency? Tune in to find out how these advancements are reshaping the landscape of DevSecOps.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Welcome to the first DevSecOps Talks episode of the new year! It's been a whole year since ChatGPT hit the scene – but how has AI adoption shaped our world since then? Join Julien, Mattias, and Andrey as they dive into the impact of AI on their workflows. How have their daily tech tools and practices evolved with AI integration? Plus, Julien gives us an insider's look at running models locally. Are these AI tools enhancing our efficiency? Tune in to find out how these advancements are reshaping the landscape of DevSecOps.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Is the grass greener outside the cloud? This episode dives into the trend of companies (notably Hey and Dropbox) migrating away from cloud services. Why are they leaving, and who would benefit from such a move? We also scrutinize the common belief that public clouds are overly expensive. Join us as we dissect various cloud cost optimization tools and techniques.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Is the grass greener outside the cloud? This episode dives into the trend of companies (notably Hey and Dropbox) migrating away from cloud services. Why are they leaving, and who would benefit from such a move? We also scrutinize the common belief that public clouds are overly expensive. Join us as we dissect various cloud cost optimization tools and techniques.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
You know our fondness for Terraform, but we are also open to exploring other tools. This episode is no different. We are joined by Igor Soroka, an expert in AWS serverless technology whose tool of choice is AWS CDK, but at the same time, he is no stranger to Terraform. We ask him practical questions about the tool and get answers based on his experience applying it to real-life projects. If you have been curious about CDK, how it functions, and if it's appropriate for you, then tune in to learn more.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
You know our fondness for Terraform, but we are also open to exploring other tools. This episode is no different. We are joined by Igor Soroka, an expert in AWS serverless technology whose tool of choice is AWS CDK, but at the same time, he is no stranger to Terraform. We ask him practical questions about the tool and get answers based on his experience applying it to real-life projects. If you have been curious about CDK, how it functions, and if it's appropriate for you, then tune in to learn more.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this episode, Mattias is joined by Ben Goodman, the founder of dragondrop.cloud, a platform that offers Terraform Best Practices as a Pull Request. They discuss the best workflows for Terraform, open-source tools that can be used in conjunction with Terraform, the most effective best practices, and common pitfalls to avoid when implementing infrastructure as code using Terraform.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this episode, Mattias is joined by Ben Goodman, the founder of dragondrop.cloud, a platform that offers Terraform Best Practices as a Pull Request. They discuss the best workflows for Terraform, open-source tools that can be used in conjunction with Terraform, the most effective best practices, and common pitfalls to avoid when implementing infrastructure as code using Terraform.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this episode of DevSecOps Talks, join Andrey, Julien, and Mattias as they dive into the world of Backstage, the notable internal development platform. Mattias is keen to peel back the layers and understand what makes people think of Backstage as a must-have in modern DevOps toolchains. Andrey highlights the platform's core feature: a comprehensive registry that keeps track of every software service running within a company. Could this signify a revival of IT change management, but with a twist? We've moved on from the days of cumbersome ticketing systems to streamlined internal development platforms. The team also ponders the future role of infrastructure engineers as they navigate the rising tides of AI—will AI become the new face behind these developer portals? Tune in to find out!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this episode of DevSecOps Talks, join Andrey, Julien, and Mattias as they dive into the world of Backstage, the notable internal development platform. Mattias is keen to peel back the layers and understand what makes people think of Backstage as a must-have in modern DevOps toolchains. Andrey highlights the platform's core feature: a comprehensive registry that keeps track of every software service running within a company. Could this signify a revival of IT change management, but with a twist? We've moved on from the days of cumbersome ticketing systems to streamlined internal development platforms. The team also ponders the future role of infrastructure engineers as they navigate the rising tides of AI—will AI become the new face behind these developer portals? Tune in to find out!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Our dialogue with Paul Stack resumes on DevSecOps Talks, almost two years after our initial podcast about his work on Pulumi (episode 25). As a warm-up, we talk about what prompted his move from Pulumi and his take on Open Terraform drama. The main topic of the episode is Paul's current focus, System Initiative; we probe into its purpose, the progress so far, and the promise it holds for redefining how we think of doing Infrastructure as Code and DevSecOps workflows in general.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
Our dialogue with Paul Stack resumes on DevSecOps Talks, almost two years after our initial podcast about his work on Pulumi (episode 25). As a warm-up, we talk about what prompted his move from Pulumi and his take on Open Terraform drama. The main topic of the episode is Paul's current focus, System Initiative; we probe into its purpose, the progress so far, and the promise it holds for redefining how we think of doing Infrastructure as Code and DevSecOps workflows in general.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In this episode of DevSecOps Talks, we dive deep into HashiCorp's recent shift to the Business Source License and its implications. Join Andrey, Julien, and Mattias as they unpack what this means for practitioners and explore the timeline of OpenTF initiative. Stay informed about what comes ahead with our latest discussion. Tune in!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
In this episode of DevSecOps Talks, we dive deep into HashiCorp's recent shift to the Business Source License and its implications. Join Andrey, Julien, and Mattias as they unpack what this means for practitioners and explore the timeline of OpenTF initiative. Stay informed about what comes ahead with our latest discussion. Tune in!
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
We had the opportunity to talk with Neatsun Ziv, one of the founders of Ox Security, about the Open Source Software Supply Chain Attack Reference Framework (https://pbom.dev). We delved deeper into possible attack vectors and explored ways to mitigate some of them. During our discussions, we also had a couple of unusual takes on supply chain security. If you are looking to understand the Open Source Software Supply Chain, then this episode is perfect for you.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
We had the opportunity to talk with Neatsun Ziv, one of the founders of Ox Security, about the Open Source Software Supply Chain Attack Reference Framework (https://pbom.dev). We delved deeper into possible attack vectors and explored ways to mitigate some of them. During our discussions, we also had a couple of unusual takes on supply chain security. If you are looking to understand the Open Source Software Supply Chain, then this episode is perfect for you.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
This time we got to talk about Lingon, an open-source project developed by Julian and Jacob who is a frequent podcast guest. Discover the motivations behind Lingon's creation and how it bridges the gap between Terraform and Kubernetes. Learn how Lingon simplifies infrastructure management, tackles frustrations with YAML and HCL, and offers greater control and automation.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
This time we got to talk about Lingon, an open-source project developed by Julian and Jacob who is a frequent podcast guest. Discover the motivations behind Lingon's creation and how it bridges the gap between Terraform and Kubernetes. Learn how Lingon simplifies infrastructure management, tackles frustrations with YAML and HCL, and offers greater control and automation.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
Diving into the world of bare-metal servers, Mattias takes the helm solo for this episode. He's accompanied by special guests Michael Wagner and Ian Evans from Metify, the company that powers Mojo - a leading platform for bare-metal provisioning automation.
While we often chat about the big cloud service providers, this time we're switching gears. If you've been curious about how real-world, physical servers are set up and managed, this episode is just for you. Join Mattias, Michael, and Ian as they dive into the nuts and bolts of setting up servers - a topic that Mattias is super passionate about.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
Diving into the world of bare-metal servers, Mattias takes the helm solo for this episode. He's accompanied by special guests Michael Wagner and Ian Evans from Metify, the company that powers Mojo - a leading platform for bare-metal provisioning automation.
While we often chat about the big cloud service providers, this time we're switching gears. If you've been curious about how real-world, physical servers are set up and managed, this episode is just for you. Join Mattias, Michael, and Ian as they dive into the nuts and bolts of setting up servers - a topic that Mattias is super passionate about.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
In this episode, we discuss the evolution of AWS networking capabilities from EC2-classic to VPC and advanced networking features. Andrey highlights that while many companies only use VPC and VPC peerings, there are lesser-known features that can significantly change how we approach networking setups on AWS.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
In this episode, we discuss the evolution of AWS networking capabilities from EC2-classic to VPC and advanced networking features. Andrey highlights that while many companies only use VPC and VPC peerings, there are lesser-known features that can significantly change how we approach networking setups on AWS.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes or hear from you, our listeners.
This is a mixed bag of an episode, we chat about all sorts of digital tools and security practices that we use in our day-to-day lives. We start by talking about password managers, and why Julien still using LastPass after the recent LastPass data breach. Julien gives us the lowdown on his personal approach to handling passwords and two-factor authentication (2FA) tokens, showing us why strong security measures matter.
Julien also shares his favorite email alias service and we discuss services for sharing sensitive information to keep mail inboxes cleaner and more private.
We also spoke about ChatGPT, an AI language model from OpenAI - will it replace jobs? should we be using it? And how?
Just a heads up, we aren't sponsored by companies we mention in this episode. We're just sharing our personal experiences and the stuff we like to use.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
This is a mixed bag of an episode, we chat about all sorts of digital tools and security practices that we use in our day-to-day lives. We start by talking about password managers, and why Julien still using LastPass after the recent LastPass data breach. Julien gives us the lowdown on his personal approach to handling passwords and two-factor authentication (2FA) tokens, showing us why strong security measures matter.
Julien also shares his favorite email alias service and we discuss services for sharing sensitive information to keep mail inboxes cleaner and more private.
We also spoke about ChatGPT, an AI language model from OpenAI - will it replace jobs? should we be using it? And how?
Just a heads up, we aren't sponsored by companies we mention in this episode. We're just sharing our personal experiences and the stuff we like to use.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
Julien has extensive experience building data platforms for data engineering, so we got him talking and sharing. If infra for data engineering is your cup of tea, then this episode is for you.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
Julien has extensive experience building data platforms for data engineering, so we got him talking and sharing. If infra for data engineering is your cup of tea, then this episode is for you.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We discussed tracing before but never got around to explaining details such as fundamentals, terminology, etc. This time Julien goes into detail about what tracing is, what the benefits are, the basic terms you need to understand, and where to start. Great episode for those who are considering adding tracing capabilities to their systems.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We discussed tracing before but never got around to explaining details such as fundamentals, terminology, etc. This time Julien goes into detail about what tracing is, what the benefits are, the basic terms you need to understand, and where to start. Great episode for those who are considering adding tracing capabilities to their systems.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are happy to welcome back Jacob Lärfors, CEO and Senior Consultant from Verifa, to talk about software supply chain attacks. It feels important to raise this topic since those attacks start to be utilized more often by sophisticated adversaries. At the same time, software supply chain security is something that companies often overlook. We as practitioners have so many things to consider and do that, in most cases, we do not have enough cognitive capacity left when looking into our library sources. What are the things we need to be aware of, and what are the low-hanging fruits we could utilize to help developers do their job securely?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are happy to welcome back Jacob Lärfors, CEO and Senior Consultant from Verifa, to talk about software supply chain attacks. It feels important to raise this topic since those attacks start to be utilized more often by sophisticated adversaries. At the same time, software supply chain security is something that companies often overlook. We as practitioners have so many things to consider and do that, in most cases, we do not have enough cognitive capacity left when looking into our library sources. What are the things we need to be aware of, and what are the low-hanging fruits we could utilize to help developers do their job securely?
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
Have you heard any recent news from Docker? We haven't. That is why we decided to check up on Docker to see how it is doing and go through the tool's history and adoption. Clueless about the difference between Docker, Containerd, CRI-O? We got you covered. Also, we will highlight a couple of new handy capabilities added recently.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
Have you heard any recent news from Docker? We haven't. That is why we decided to check up on Docker to see how it is doing and go through the tool's history and adoption. Clueless about the difference between Docker, Containerd, CRI-O? We got you covered. Also, we will highlight a couple of new handy capabilities added recently.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are excited about the new breed of tools coming to the market. We often had to put together tools to find out what was in production and what broke it. Your monitoring tools go as far as only telling you that something isn't working as expected but not why it is so, and then you have to scramble to figure out what versions of services are in production, were there any recent deploys, etc. So you can understand what has changed to narrow down possible causes. Our good friend Mike and his team are building the tool to answer exactly such questions, so we thought you might be interested in hearing him out.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are excited about the new breed of tools coming to the market. We often had to put together tools to find out what was in production and what broke it. Your monitoring tools go as far as only telling you that something isn't working as expected but not why it is so, and then you have to scramble to figure out what versions of services are in production, were there any recent deploys, etc. So you can understand what has changed to narrow down possible causes. Our good friend Mike and his team are building the tool to answer exactly such questions, so we thought you might be interested in hearing him out.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are discussing what has happened in Terraform world since the 1.0 release last year and if there are new features worth mentioning, trends in Terraform development, etc. As well as doing a recap of the road to 1.0 and how long it took us to get there.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
We are discussing what has happened in Terraform world since the 1.0 release last year and if there are new features worth mentioning, trends in Terraform development, etc. As well as doing a recap of the road to 1.0 and how long it took us to get there.
Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer your questions, hear suggestions for new episodes or just hear from you, our listeners.
If you follow CloudNative hype wave, you might feel that Prometheus is the must-use monitoring tool for everything CloudNative. Plus, almost everything nowadays has a Prometheus exporter. Just get that helm chart installed, and here you go - metrics question sorted out. Want to monitor endpoints - here is BlackBox exporter for you. Want to get notifications - AlertManager got you covered. And so on and so on. But is it all rainbows and unicorns? You probably guessed that it depends. This time, Semyon is joining us to air his grievances with Prometheus and share insights on how to cook it if you decide to go down this route.
If you follow CloudNative hype wave, you might feel that Prometheus is the must-use monitoring tool for everything CloudNative. Plus, almost everything nowadays has a Prometheus exporter. Just get that helm chart installed, and here you go - metrics question sorted out. Want to monitor endpoints - here is BlackBox exporter for you. Want to get notifications - AlertManager got you covered. And so on and so on. But is it all rainbows and unicorns? You probably guessed that it depends. This time, Semyon is joining us to air his grievances with Prometheus and share insights on how to cook it if you decide to go down this route.
Communication in co-located teams is quite often complicated. It is even more complex and, at the same time, important in distributed teams. Have you ever got an issue report that says this thing is failing? No logs, no explanation of context, no nothing. Pretty sure we've all been in such situations. How do you step up your communication game? This episode of DevSecOps Talks is about great communication tips for DevSecOps practitioners in distributed (and not only) teams.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Communication in co-located teams is quite often complicated. It is even more complex and, at the same time, important in distributed teams. Have you ever got an issue report that says this thing is failing? No logs, no explanation of context, no nothing. Pretty sure we've all been in such situations. How do you step up your communication game? This episode of DevSecOps Talks is about great communication tips for DevSecOps practitioners in distributed (and not only) teams.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
web3 has gotten a lot of attention lately; thus, it is time for us to separate facts from the hype. In this episode, we are trying to understand its implications for us as DevSecOps practitioners.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
web3 has gotten a lot of attention lately; thus, it is time for us to separate facts from the hype. In this episode, we are trying to understand its implications for us as DevSecOps practitioners.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Andrey feels frustrated that he has to develop a way to configure environments for every customer. Think for yourself - you arrive at a new project or company. It is day one, and you need to get the right tools as well as the correct environment configuration. During this episode, we are trying to figure out how companies solve it. And is there a standard solution? What are the options?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Andrey feels frustrated that he has to develop a way to configure environments for every customer. Think for yourself - you arrive at a new project or company. It is day one, and you need to get the right tools as well as the correct environment configuration. During this episode, we are trying to figure out how companies solve it. And is there a standard solution? What are the options?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
us-east-1 will never go down, and if it would, half of the internet would go down. It is what people used to say. So, us-east-1 went down big time. What does it mean for us as practitioners? What should we consider going forward? In this episode, we talk through the incident and disaster recovery strategies you can consider to keep your company up
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
us-east-1 will never go down, and if it would, half of the internet would go down. It is what people used to say. So, us-east-1 went down big time. What does it mean for us as practitioners? What should we consider going forward? In this episode, we talk through the incident and disaster recovery strategies you can consider to keep your company up
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
We have had Git around for more than 15 years, and during that time, it has become a standard de-facto to share code and track code changes. While Git is a superior version control system to most of what we have seen before, it has been 15 years since the first release. Should we be looking for new ways to approach version control systems? Is the time right for the next generation of tools in this area?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
We have had Git around for more than 15 years, and during that time, it has become a standard de-facto to share code and track code changes. While Git is a superior version control system to most of what we have seen before, it has been 15 years since the first release. Should we be looking for new ways to approach version control systems? Is the time right for the next generation of tools in this area?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Our first episode was about Infrastructure as code, and we feel that it is time to revisit the topic after almost two years. Another reason is the release of the second edition of Infrastructure as Code book by Keif Morris. Thus, in this episode, we revisit the definition of Infrastructure as code and try to summarize what has changed over the years. We hope you like it!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Our first episode was about Infrastructure as code, and we feel that it is time to revisit the topic after almost two years. Another reason is the release of the second edition of Infrastructure as Code book by Keif Morris. Thus, in this episode, we revisit the definition of Infrastructure as code and try to summarize what has changed over the years. We hope you like it!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Julien gives his impressions of Google Cloud Next 2021, and Andrey recaps HashiConf Global 2021 as well as gives his take with the twist on why do we might need HashiCorp Waypoint
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Julien gives his impressions of Google Cloud Next 2021, and Andrey recaps HashiConf Global 2021 as well as gives his take with the twist on why do we might need HashiCorp Waypoint
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Everyone seems to be talking about service mesh. Mattias, Julien, and Andrey are trying to separate hype and real value. Most importantly, they dig into when is the good time for the organization is to embrace service mesh and what are the prerequisites
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Everyone seems to be talking about service mesh. Mattias, Julien, and Andrey are trying to separate hype and real value. Most importantly, they dig into when is the good time for the organization is to embrace service mesh and what are the prerequisites
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
As a follow-up to the [last episode about hiring an infrastructure automation person](https://devsecops.fm/episodes/31-hiring/) we decided to reverse the view and talk about how do you get hired as an infrastructure automation person. This episode is full of career advice for people who are just only from university as well as people who already have experience in the industry.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
As a follow-up to the [last episode about hiring an infrastructure automation person](https://devsecops.fm/episodes/31-hiring/) we decided to reverse the view and talk about how do you get hired as an infrastructure automation person. This episode is full of career advice for people who are just only from university as well as people who already have experience in the industry.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Have you ever conducted an interview to hire an infrastructure automation person? What would you ask? How do you check their skills? And what skills are essential? Tune in for our tips on hiring and finding the right person for your team!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Have you ever conducted an interview to hire an infrastructure automation person? What would you ask? How do you check their skills? And what skills are essential? Tune in for our tips on hiring and finding the right person for your team!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Logs, metrics, and traces are the three pillars of observability. Where should you start? What are the common mistakes to avoid? And if you are to pick one - which one should you do?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Logs, metrics, and traces are the three pillars of observability. Where should you start? What are the common mistakes to avoid? And if you are to pick one - which one should you do?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
This time we are talking unikernles! Ian Eyberg from NanoVMs joins us to discuss how far this technology is from prime time. And it turns out that you don't have to be a kernel developer to take advantage of unikernes. Today, there are tools available to package, distribute, and run them locally as well as in the public cloud. While talking to Ian, it felt that the state of the technology is very similar to Linux containers at the beginning of 2010x, just before Docker made Linux containers available for everyone.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
This time we are talking unikernles! Ian Eyberg from NanoVMs joins us to discuss how far this technology is from prime time. And it turns out that you don't have to be a kernel developer to take advantage of unikernes. Today, there are tools available to package, distribute, and run them locally as well as in the public cloud. While talking to Ian, it felt that the state of the technology is very similar to Linux containers at the beginning of 2010x, just before Docker made Linux containers available for everyone.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
The real cloud lock-in is security! Every service/cloud provider has its own levels of granularity regarding resources. Cloud engineering is mainly about compute, storage, and networking and how to make them scale. Scaling security is often left out as it is hard to measure on so many levels.
We think that it is a myth and that we can measure how many steps it takes to add, modify or remove access rights. It all starts with monitoring, knowing what is there in a cloud infrastructure is a very good first step. By making it easy to see and manage access rights, we make it easier for ourselves to keep resources secured.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
The real cloud lock-in is security! Every service/cloud provider has its own levels of granularity regarding resources. Cloud engineering is mainly about compute, storage, and networking and how to make them scale. Scaling security is often left out as it is hard to measure on so many levels.
We think that it is a myth and that we can measure how many steps it takes to add, modify or remove access rights. It all starts with monitoring, knowing what is there in a cloud infrastructure is a very good first step. By making it easy to see and manage access rights, we make it easier for ourselves to keep resources secured.
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
AWS released AWS Bottlerocket OS in March of 2020, and version 1.0.0 got released in August 2020. What is it? Should you be using it? What are the benefits? Is it ready for prime time? We answer all of those questions during this episode of DevSecOps Talks. Tune in!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
AWS released AWS Bottlerocket OS in March of 2020, and version 1.0.0 got released in August 2020. What is it? Should you be using it? What are the benefits? Is it ready for prime time? We answer all of those questions during this episode of DevSecOps Talks. Tune in!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
Johan Abildskov (@RandomSort, see episode 6) is back, and we are talking branching strategies! In particular, why you shouldn't be doing git-flow, and what are other options out there. This conversation takes us down memory lane to a more broad discussion about version control systems, mono-repositories, continuous integration, and delivery. We hope you will like it!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
Johan Abildskov (@RandomSort, see episode 6) is back, and we are talking branching strategies! In particular, why you shouldn't be doing git-flow, and what are other options out there. This conversation takes us down memory lane to a more broad discussion about version control systems, mono-repositories, continuous integration, and delivery. We hope you will like it!
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
This time we are joined by Paul Stack (@stack72, Pulumi developer, former Terraform developer) and podcast friend Jacob Lärfors to talk about
- what is Pulumi is?
- understand the difference between Pulumi vs. Terraform (and if we should compare them at all)
- What is hard about Pulumi?
- What people ask the most? What are the common confusions?
- Cross-language infra libraries? How is it even possible?!
- Is there a possibility of a supply chain attack via Pulumi library?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
This time we are joined by Paul Stack (@stack72, Pulumi developer, former Terraform developer) and podcast friend Jacob Lärfors to talk about
- what is Pulumi is?
- understand the difference between Pulumi vs. Terraform (and if we should compare them at all)
- What is hard about Pulumi?
- What people ask the most? What are the common confusions?
- Cross-language infra libraries? How is it even possible?!
- Is there a possibility of a supply chain attack via Pulumi library?
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion.
Last week (week 6, 2021), seven data breaches were announced. In this episode, we discuss the possible scenarios for preventing attackers from getting a hold of your data, whether private or company data. And tips on how to mitigate the consequences of data leaks in cases when you have no control over data management (think of breach of 3rd party service).
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Last week (week 6, 2021), seven data breaches were announced. In this episode, we discuss the possible scenarios for preventing attackers from getting a hold of your data, whether private or company data. And tips on how to mitigate the consequences of data leaks in cases when you have no control over data management (think of breach of 3rd party service).
Connect with us on LinkedIn or Twitter https://devsecops.fm/about/ and tell us about your questions, and we will answer them in the show.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
How do you run Kubernetes in the cloud? Still using Kops? Or is it time to jump to the managed offerings? We go through the list of things you might be missing out on if not yet using a managed solution. Also, in this episode - what do you always configure in the k8s cluster? CNI, Ingress, IAM, and even more!
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
How do you run Kubernetes in the cloud? Still using Kops? Or is it time to jump to the managed offerings? We go through the list of things you might be missing out on if not yet using a managed solution. Also, in this episode - what do you always configure in the k8s cluster? CNI, Ingress, IAM, and even more!
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
It's been almost a year since we started the podcast, but we never took time to explain who we are and what problems we solve for our customers/employers. So in this episode, you will find more details about us and, as usual, references to useful tools, talks, and techniques.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
It's been almost a year since we started the podcast, but we never took time to explain who we are and what problems we solve for our customers/employers. So in this episode, you will find more details about us and, as usual, references to useful tools, talks, and techniques.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
AWS had a severe incident at the end of November. Kinesis in us-east-1 went dark for quite some time, and a ripple effect caused degradation of other services like CloudWatch, ECS, and others. As a Cloud Engineering practitioner, how do you get yourself and your organization ready for a such turn of events?
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
AWS had a severe incident at the end of November. Kinesis in us-east-1 went dark for quite some time, and a ripple effect caused degradation of other services like CloudWatch, ECS, and others. As a Cloud Engineering practitioner, how do you get yourself and your organization ready for a such turn of events?
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Andrey wants monitoring to be more magical, or does he want a wrong thing? What are the sane defaults? And why do we have to set up boilerplate monitoring again and again? Mattias shares what he does for monitoring security events. Julien explains why using logs to debug in a microservices architecture is costly and inefficient.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Andrey wants monitoring to be more magical, or does he want a wrong thing? What are the sane defaults? And why do we have to set up boilerplate monitoring again and again? Mattias shares what he does for monitoring security events. Julien explains why using logs to debug in a microservices architecture is costly and inefficient.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
How to decommission resources from your cloud environment to keep it clean?
What to do when a resource is created without being in the infrastructure code?
Andrey is going through a checklist he uses to delete resources and the utility serverless functions he wrote.
ArgoCD is a project that does GitOps and automatically delete resources in Kubernetes namespaces if they are not defined.
We talked about the different layers of abstraction for infrastructure as code and where it makes sense to have a terraform controller in a Kubernetes cluster to manage the application dependencies.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
How to decommission resources from your cloud environment to keep it clean?
What to do when a resource is created without being in the infrastructure code?
Andrey is going through a checklist he uses to delete resources and the utility serverless functions he wrote.
ArgoCD is a project that does GitOps and automatically delete resources in Kubernetes namespaces if they are not defined.
We talked about the different layers of abstraction for infrastructure as code and where it makes sense to have a terraform controller in a Kubernetes cluster to manage the application dependencies.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Initially, we planned this episode as a discussion about HashiCorp Nomad and invited Jacob Lärfors. He recently published a great article about his experience working with Nomad (see link in the show notes). However, because of a few postponements, and with HashiConf that happened just a week ago, we decided to extend the podcast’s scope to go over all of the announcements that they did during the conference. So here it is - HashiConf special: all you need to know about everything that HashiCorp announced during the conference plus a discussion about Nomad!
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Initially, we planned this episode as a discussion about HashiCorp Nomad and invited Jacob Lärfors. He recently published a great article about his experience working with Nomad (see link in the show notes). However, because of a few postponements, and with HashiConf that happened just a week ago, we decided to extend the podcast’s scope to go over all of the announcements that they did during the conference. So here it is - HashiConf special: all you need to know about everything that HashiCorp announced during the conference plus a discussion about Nomad!
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
This is the first episode in the new format - 30 minutes short and crisp episodes, i.e., less water and side discussions, focusing on the topic, duration under (well, almost under) 30 minutes. We hope you like it!
The topic of this episode is building docker images - automation, security, best practices.
In this episode, we discuss:
In some of the information overlaps with episode #3 but greatly extends information provided before https://devsecops.fm/episodes/docker-secure-build/
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
This is the first episode in the new format - 30 minutes short and crisp episodes, i.e., less water and side discussions, focusing on the topic, duration under (well, almost under) 30 minutes. We hope you like it!
The topic of this episode is building docker images - automation, security, best practices.
In this episode, we discuss:
In some of the information overlaps with episode #3 but greatly extends information provided before https://devsecops.fm/episodes/docker-secure-build/
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
In this episode, we discuss options for splitting your deployment stages. We hear people coming up with all possible type of environments - dev, test/QA, integration, stage, prod, etc How many do you actually need? What is the reason for having all those stages? Maybe do you need less? Why not deploy directly to production using some fancy technique?
Put it simply - stage or not to stage?
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
In this episode, we discuss options for splitting your deployment stages. We hear people coming up with all possible type of environments - dev, test/QA, integration, stage, prod, etc How many do you actually need? What is the reason for having all those stages? Maybe do you need less? Why not deploy directly to production using some fancy technique?
Put it simply - stage or not to stage?
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Let's talk about security in the era of remote work. Most of us have experienced a flaky VPN connection. What are the alternatives? SSH certificates? Yubikey?
We discussed various topics around security inside a cluster and outside.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Let's talk about security in the era of remote work. Most of us have experienced a flaky VPN connection. What are the alternatives? SSH certificates? Yubikey?
We discussed various topics around security inside a cluster and outside.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
This time, we are joined by Henrik Høegh who shares his unique perspective on applying the theory of constraint to IT transformation as well as how it applies in the world of Cloud Native. We go back to the origin of DevOps, discussing the various problems companies are facing when transforming their organizations and adopting cultural changes.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
This time, we are joined by Henrik Høegh who shares his unique perspective on applying the theory of constraint to IT transformation as well as how it applies in the world of Cloud Native. We go back to the origin of DevOps, discussing the various problems companies are facing when transforming their organizations and adopting cultural changes.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Mattias wants to setup HashiCorp Vault and quizzes Andrey how to do that.
We cover a lot of ground - from basic Vault concepts to setting it up and hardening.
Mattias wants to setup HashiCorp Vault and quizzes Andrey how to do that.
We cover a lot of ground - from basic Vault concepts to setting it up and hardening.
Julien and Andrey got together to define the scale and ways to automate the scaling of your infrastructure in response to changes in load patterns. What are the prerequisites implementing scaling? What is cooling down, warm up, horizontal and vertical scaling, scale-up, and scale in? What are the metrics that could be useful for making scaling decisions? And last but not least, the very unexpected spin that Julien gives to the conversation.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
Julien and Andrey got together to define the scale and ways to automate the scaling of your infrastructure in response to changes in load patterns. What are the prerequisites implementing scaling? What is cooling down, warm up, horizontal and vertical scaling, scale-up, and scale in? What are the metrics that could be useful for making scaling decisions? And last but not least, the very unexpected spin that Julien gives to the conversation.
Visit https://devsecops.fm to see show notes and https://gitter.im/devsecopstalks/community to join a discussion
This time we are discussing the white paper by Summit Route - AWS Security Maturity Roadmap 2020. Tune in to learn more about the white paper and recommendations that we pile up on top of it. To view show notes visit https://devsecops.fm Chat with hosts and suggest topics for upcoming episodes at our Gitter channel https://gitter.im/devsecopstalks/community
This time we are discussing the white paper by Summit Route - AWS Security Maturity Roadmap 2020. Tune in to learn more about the white paper and recommendations that we pile up on top of it. To view show notes visit https://devsecops.fm Chat with hosts and suggest topics for upcoming episodes at our Gitter channel https://gitter.im/devsecopstalks/community
Our guest speaker is Anton Babenko he is DevSecOps Talks podcast fan, AWS Community Hero, Terraform fanatic, HashiCorp Ambassador and a prolific open source contributor. After listening to episode #9 Terraform in CI and #1 Infrastructure as code, Anton decided that enough is enough and volunteered to give his point of view on Terragrunt since he though that we are missing a few important points. In this episode, we are discussing the use cases of Terragrunt, a wrapper around Terraform for working with multiple environment and modules.
Our guest speaker is Anton Babenko he is DevSecOps Talks podcast fan, AWS Community Hero, Terraform fanatic, HashiCorp Ambassador and a prolific open source contributor. After listening to episode #9 Terraform in CI and #1 Infrastructure as code, Anton decided that enough is enough and volunteered to give his point of view on Terragrunt since he though that we are missing a few important points. In this episode, we are discussing the use cases of Terragrunt, a wrapper around Terraform for working with multiple environment and modules.
How do you start to implement a CI pipeline when dealing with infrastructure as code implemented via Terraform? What are the security concerns when the credentials to the whole kingdom are used in an automated process? In this episode, we discuss the various security and feasibility aspects of using Terraform in a CI pipeline.
We start the episode by catching up with what we’ve been working on. Feel free to skip to 11:52 if you want to go directly to the topic. Having an automated process to deploy and manage infrastructure has advantages such as fast feedback and collaboration. The code for the infrastructure is treated like an application that is versioned, tested, and deployed.
Show notes are available at https://devsecops.fm/episodes/terraform-in-ci/
How do you start to implement a CI pipeline when dealing with infrastructure as code implemented via Terraform? What are the security concerns when the credentials to the whole kingdom are used in an automated process? In this episode, we discuss the various security and feasibility aspects of using Terraform in a CI pipeline.
We start the episode by catching up with what we’ve been working on. Feel free to skip to 11:52 if you want to go directly to the topic. Having an automated process to deploy and manage infrastructure has advantages such as fast feedback and collaboration. The code for the infrastructure is treated like an application that is versioned, tested, and deployed.
Show notes are available at https://devsecops.fm/episodes/terraform-in-ci/
Andrey tells us the story of how DevOps came into existence and took over the market. We discuss the marketing around it, its relationship with DevSecOps. We tried to shed a light on what is marketing strategy versus implementing DevOps in an organization. We also compared DevOps to SRE (Site Reliability Engineering)
Andrey tells us the story of how DevOps came into existence and took over the market. We discuss the marketing around it, its relationship with DevSecOps. We tried to shed a light on what is marketing strategy versus implementing DevOps in an organization. We also compared DevOps to SRE (Site Reliability Engineering)
In this episode, Mattias, Julien, and Andrey share tips and tricks on how to stay on top of what is going on in the industry, resources they use for continuous learning. Make sure to visit devsecops.fm to check out show notes that contain references to resources mentioned during discussion and more
In this episode, Mattias, Julien, and Andrey share tips and tricks on how to stay on top of what is going on in the industry, resources they use for continuous learning. Make sure to visit devsecops.fm to check out show notes that contain references to resources mentioned during discussion and more
This time Johan Abildskov, a Senior Consultant with Praqma/Eficode, joins us to talk about SemVer (Semantic Versioning), and we finally get to hear what Julien has to say about it. We get to explore different options regarding versioning and how it helps humans communicate. At the end of the podcast, everyone gets to share their approach and recommendations for versioning things.
This time Johan Abildskov, a Senior Consultant with Praqma/Eficode, joins us to talk about SemVer (Semantic Versioning), and we finally get to hear what Julien has to say about it. We get to explore different options regarding versioning and how it helps humans communicate. At the end of the podcast, everyone gets to share their approach and recommendations for versioning things.
We had a couple of possible topics for this episode but before getting started with them we decided to discuss what technological problems we were solving during the last two weeks. Well, turns out there was quite a lot to discuss. Tune in for tips on ssh session logging on the ssh server, preventing downloads from AWS S3 even if you got read access, credentials in Git repository 🤦, why you should (or should not) do K8S and more.
SummaryIn this free-form early episode of DevSecOps Talks, a casual "what have you been up to" catch-up turns into a sharp exchange on the gap between security in theory and security in practice. One host discovers plaintext service account keys, database passwords, and a production SSH tunnel all committed straight into a Git repository — and the team walks through how to unwind that without breaking delivery. Julien Bisconti argues that security tooling is fundamentally failing developers because it is too hard to use under real delivery pressure. The episode also delivers strong opinions on why teams should not default to Kubernetes, the hidden complexity of S3 encryption with KMS keys, and why Google's BeyondCorp model makes VPNs look like a relic.
Key Topics SSH session logging, bastion hosts, and compliance visibilityThe episode opens with a deep dive into SSH session logging for bastion hosts in AWS. One of the hosts explains how AWS Systems Manager Session Manager can be used to access instances without VPNs or direct inbound connectivity — the SSM agent on each instance calls home to AWS, and AWS proxies the connection back. That model is attractive for hybrid and on-prem environments because it removes networking complexity around NAT, port forwarding, and VPN setup. It also provides session logging, IAM-based access control, and command output recording.
But the drawbacks surface quickly. Session Manager logs users in as a generic SSM agent user with /usr/bin as the working directory. Documentation is sparse, and Bash is launched in shell mode to support color interpretation, which pollutes session logs with escape characters. A bigger concern is that access control rests entirely on IAM credentials — in an environment with fully dynamic, short-lived credentials that is manageable, but it becomes risky anywhere static keys exist.
The host describes trying to map Session Manager logins to individual users, only to find that it requires static IAM identities with specially named tags containing usernames — a non-starter for environments where everything is dynamic.
That leads into alternative approaches. An AWS blog post describes forcing SSH connections through the Unix script utility to record sessions, then uploading logs to S3. But even that is fragile: logs are owned by the user, so technically the user can delete or overwrite them. A more robust path is tlog, a terminal I/O logger that writes session data in JSON format to the systemd journal, where it cannot be easily tampered with. From there, the CloudWatch agent can export journal data to S3 for long-term storage.
The broader point is that command logging sounds simple in compliance conversations, but in practice it becomes a deep rabbit hole full of bypasses, noise, and design tradeoffs.
Monitoring user activity without drowning in logsThe hosts compare notes on monitoring shell activity. One host mentions using auditd to track user actions on bastion hosts in a previous environment, but the log volume was overwhelming — even Elasticsearch struggled to keep up with the ingestion rate.
That sparks a discussion around anomaly detection and heuristics. The real challenge is not collecting logs but determining what is unusual and worth investigating. Failed SSH login alerts are mentioned as a useful signal, though another host pushes back: "Should you have SSH with the password at all? You should have a key." The point stands — without careful tuning, even sensible alerts generate noise faster than teams can act on them.
The exchange captures a recurring DevSecOps reality: collecting telemetry is the easy part; turning it into something actionable is where most teams get stuck.
S3 bucket security, public access controls, and KMS encryption surprisesThe conversation shifts to AWS S3 security. Public buckets remain a common source of breaches, but AWS now offers S3 Block Public Access — account- and bucket-level settings that prevent public access regardless of individual object ACLs. In Terraform, this is a dedicated resource block.
The more nuanced insight is about encryption. The host explains the difference between S3 server-side encryption with the default AWS-managed key (SSE-S3) and encryption with a customer-managed KMS key (SSE-KMS). With SSE-S3, S3 decrypts objects transparently for any client with read access to the bucket. With a customer-managed KMS key, S3 cannot decrypt the object unless the requester also has kms:Decrypt permission on that specific key.
This became a real problem in a cross-account, cross-region workflow involving Go Lambda binaries. Go Lambdas require the deployment artifact to reside in the same region as the function. The team was copying artifacts between accounts and regions, had granted S3 read permissions, but downloads kept failing. CloudTrail logs revealed the real culprit: "I cannot decrypt." The consumers lacked KMS key access. In that case, the fix was switching to SSE-S3 since the artifacts did not require the stronger protection of a customer-managed key.
The host is careful to note that AWS documentation on cross-account S3 access does not prominently flag this encryption interaction — a gap that can cost teams hours of debugging.
Plaintext secrets in Git: a frighteningly common anti-patternOne of the most memorable segments comes when a host describes reviewing an application stack and finding service account keys committed in cleartext in the repository root. The repository also contained a large configuration file with usernames, passwords, API credentials for mail services, login providers, and multiple environments (dev, prod) — all in plain text.
But the worst part: for local development, the team SSH-tunneled into the production SQL server, mapping remote port 3306 to local port 3307. An SSH key providing direct access to the production database was sitting right there in the repo.
The reaction is immediate — this is exactly the kind of setup that accumulates when convenience wins over security for too long. But rather than proposing a risky teardown, the host outlines an incremental migration plan:
Andrey pushes the thinking further: injecting secrets at build time is still risky because anyone who gets the Docker image gets the secrets. The better model is runtime secret retrieval — workloads authenticate dynamically at startup and fetch only the secrets they need. HashiCorp Vault is the concrete example: in a Kubernetes environment, a pod uses its Kubernetes service account to authenticate to Vault, obtains a short-lived token, and retrieves static or dynamic secrets. If someone steals the image and runs it outside the cluster, they cannot authenticate and get nothing.
Vault versus cloud-native secret managementThe secrets discussion expands into a broader comparison. Andrey, who has been doing public speaking about Vault and fielding consulting requests around it, frames the choice pragmatically.
For hybrid-cloud or multi-cloud environments, Vault is likely the best option because it provides a unified interface for secret management, dynamic credentials, and synchronization across providers.
For single-cloud commitments — say, all-in on AWS — native services can cover many of the same use cases: AWS STS for temporary credentials, RDS IAM authentication for database logins, AWS Secrets Manager (which may even be running Vault underneath, as one host speculates), and AWS Certificate Manager for TLS certificates. If the organization is not going multi-cloud, the overhead of running Vault may not be justified.
The recommendation is not ideological. It depends on architecture, portability needs, and operational complexity.
When Vault works technically but fails organizationallyJulien Bisconti adds an important caveat from experience. He describes deploying Vault in a multi-availability-zone setup with full redundancy — technically solid. But the project "went to a halt completely" when it hit governance questions: who should access what, under which rules, and who owns the policies. It became a political war, and the entire deployment had to be rolled back.
The lesson: security tools are good at automating technical workflows, but if the underlying organizational process is broken, you automate a broken process. Security, monitoring, deployment, and access control are deeply entangled, and tooling alone cannot untangle them.
Security tooling fails because developers cannot use itJulien brings the strongest developer-empathy argument of the episode. Developers do not ignore security because they are careless — they bypass it because secure workflows are too awkward under delivery pressure. A manager does not understand why the developer is blocked, pressure mounts, and the result is // just hardcode that here, I don't care, it works.
Even simple tasks illustrate the problem. Julien asks: can you generate an SSL certificate with OpenSSL from memory right now? Most engineers cannot — it is something they do every few months and have to look up each time. He references the famous XKCD comic about entering the correct tar command with ten seconds left.
This evolves into a philosophical observation. One host identifies as a "tool builder" rather than a "product builder" — someone who enjoys building mechanisms but does not always think deeply about end-user experience. That mindset, common among infrastructure and security engineers, may explain why so many DevSecOps tools are powerful but painful to adopt. The gap is not in capability but in usability.
VPNs, zero trust, and the BeyondCorp modelJulien argues that VPNs are an increasingly painful abstraction. Even Cisco — the company that essentially built enterprise VPN technology — had to raise capacity limits during the COVID-19 pandemic because their own infrastructure could not handle the load. Split tunneling introduces its own vulnerabilities, and full-tunnel VPN creates a bottleneck for everything.
He points to Google's BeyondCorp model, published in 2014, which established the principle that network location should not determine access. The analogy: do you build a castle with walls where anyone inside has full access, or do you put a guard in every room checking credentials? The latter — zero trust — is harder to implement, but it limits blast radius and removes the binary "in or out" problem.
Andrey connects this to the emerging service mesh ecosystem. Technologies like Consul Connect implement zero-trust networking at the application level with mutual TLS and identity-based authorization. The hosts note that the service mesh space is still fragmented — just as there was a "war of orchestrators" before Kubernetes emerged as the default, there is now a "war of service meshes" still playing out.
Kubernetes hype versus simpler orchestrationA significant portion of the episode is a productive debate about orchestration choices. Andrey argues strongly against defaulting to Kubernetes. He describes a hybrid-cloud project in Africa running the full HashiCorp stack: Consul for service discovery, configuration, and networking; Nomad for workload scheduling. A team member with relatively little experience got the stack up and running in days.
Andrey outlines the operational weight of Kubernetes: cluster version upgrades where in-place upgrades may skip new security defaults (making full cluster recreation the recommended path), autoscaler configuration layers (pod autoscaler, cluster autoscaler, resource limits), ingress management, YAML sprawl from Helm charts, and a platform that evolves so rapidly it demands continuous learning. He especially warns against running databases in Kubernetes — the statefulness adds pain.
For single-cloud AWS, he argues that ECS is often the better choice: the control plane is free (or nearly so), the per-node overhead is minimal compared to Kubernetes, and AWS handles the operational burden.
Mattias pushes back with a practical counterpoint. Kubernetes provides a consistent platform for diverse workloads — containers, databases, monitoring, custom jobs — all managed through the same interface. Helm charts for common components like nginx-ingress, cert-manager, and external-dns make the ecosystem approachable. The value is in standardization and adaptability.
The hosts also note GKE's pricing evolution: Google introduced a per-cluster management fee (roughly $0.10/hour per control plane) to discourage sprawl and encourage consolidation — a signal that even managed Kubernetes has real costs.
The disagreement is honest but constructive. The shared conclusion: start with what the business needs, then pick the simplest tool that gets you there. "The best battle is the battle you don't fight." And as Julien notes, teams that avoid the Kubernetes default often demonstrate deeper architectural thinking — choosing based on the hype is an insurance policy, but it is not the same as choosing based on needs.
Slack bots, workflow automation, and the security surfaceNear the end, Mattias raises the topic of Slack bots for operational tasks — deployment reporting, status checks, and interactive queries. Andrey reframes the conversation around security: if Slack becomes part of a privileged control plane — for example, a bot that handles privilege escalation by requesting approvals through Slack messages — then request spoofing, account compromise, and weak isolation become serious concerns.
The idea of a privilege-escalation bot is interesting (request access via Slack, get approval from designated approvers, receive time-limited credentials with full audit logging), but the attack surface is real. Slack provides a powerful collaboration platform for building workflows without custom UIs, but once it handles access decisions, security design matters as much as convenience.
Highlights "All the service account keys were in clear text. In the repo."A host describes opening up a client's application stack and finding cloud service keys, usernames, passwords, API credentials, and an SSH key that tunnels directly into the production SQL server — all committed to Git in plain text. It is the kind of discovery that instantly explains years of hidden risk.
How do you unwind that without breaking delivery? The hosts walk through an incremental migration plan in this episode of DevSecOps Talks.
"Security tooling is actually not that usable."Julien Bisconti delivers a sharp truth: developers do not bypass security because they are careless. They do it because secure workflows are too slow, too confusing, and too far removed from how they actually work. When the pressure comes from a manager who does not understand the blocker, the shortcut wins every time.
A candid take on why hardcoded secrets keep showing up in real codebases. Listen to the full discussion on DevSecOps Talks.
"I really applaud people who don't choose Kubernetes — that means they actually know what they're doing."One of the spicier platform takes of the episode. The argument is not that Kubernetes is bad, but that defaulting to it without analyzing your actual needs is a sign of hype-driven architecture. If a simpler stack solves the problem, picking the biggest platform just creates more operational burden.
Hear the full Kubernetes-versus-Nomad-versus-ECS debate on DevSecOps Talks.
"If your process is not good, you're going to automate a bad process."Julien recounts deploying Vault with full HA and multi-AZ redundancy, only to have the project grind to a halt over organizational politics — who should access what, and who decides. The tooling worked perfectly. The organization did not.
A reminder that DevSecOps maturity is not just about picking better tools. Catch the full story on DevSecOps Talks.
"Once somebody is inside, they have the keys to the kingdom."The VPN and zero-trust discussion delivers one of the strongest security arguments of the episode. Julien explains why broad network access — the castle-and-moat model — is the wrong abstraction for modern systems, and why identity-based, fine-grained access control is worth the implementation cost.
If the old perimeter model still shapes how your team thinks about infrastructure security, this part of the episode will resonate. Listen on DevSecOps Talks.
ResourcesAWS Systems Manager Session Manager — AWS documentation for Session Manager, which provides secure instance access without SSH keys, open ports, or bastion hosts, with built-in session logging.
tlog — Terminal I/O Logger — Open-source terminal session recording tool that logs to systemd journal in JSON format, making sessions searchable and tamper-resistant. Discussed in the episode as a more robust alternative to the Unix script command.
AWS S3 Block Public Access — AWS documentation on account- and bucket-level settings to prevent public access to S3 resources, regardless of individual object ACLs or bucket policies.
Troubleshooting Cross-Account Access to KMS-Encrypted S3 Buckets — AWS guidance on the exact issue discussed in the episode: S3 downloads failing because the requester lacks KMS key permissions, even when bucket-level access is granted.
BeyondCorp: A New Approach to Enterprise Security — Google's foundational 2014 paper on zero-trust networking, which established the principle that network location should not determine access. Referenced by Julien in the VPN discussion.
HashiCorp Nomad — A lightweight workload orchestrator with native Consul and Vault integrations. Discussed as a simpler alternative to Kubernetes, especially for hybrid-cloud and small-team environments.
Consul Service Mesh (Consul Connect) — HashiCorp's service mesh solution providing zero-trust networking through mutual TLS and identity-based authorization. Mentioned as the networking layer in the Africa hybrid-cloud project.
XKCD 1168: tar — The comic Julien references about the impossibility of remembering command-line flags — a humorous illustration of why security tooling needs better usability.
We had a couple of possible topics for this episode but before getting started with them we decided to discuss what technological problems we were solving during the last two weeks. Well, turns out there was quite a lot to discuss. Tune in for tips on ssh session logging on the ssh server, preventing downloads from AWS S3 even if you got read access, credentials in Git repository 🤦, why you should (or should not) do K8S and more.
SummaryIn this free-form early episode of DevSecOps Talks, a casual "what have you been up to" catch-up turns into a sharp exchange on the gap between security in theory and security in practice. One host discovers plaintext service account keys, database passwords, and a production SSH tunnel all committed straight into a Git repository — and the team walks through how to unwind that without breaking delivery. Julien Bisconti argues that security tooling is fundamentally failing developers because it is too hard to use under real delivery pressure. The episode also delivers strong opinions on why teams should not default to Kubernetes, the hidden complexity of S3 encryption with KMS keys, and why Google's BeyondCorp model makes VPNs look like a relic.
Key Topics SSH session logging, bastion hosts, and compliance visibilityThe episode opens with a deep dive into SSH session logging for bastion hosts in AWS. One of the hosts explains how AWS Systems Manager Session Manager can be used to access instances without VPNs or direct inbound connectivity — the SSM agent on each instance calls home to AWS, and AWS proxies the connection back. That model is attractive for hybrid and on-prem environments because it removes networking complexity around NAT, port forwarding, and VPN setup. It also provides session logging, IAM-based access control, and command output recording.
But the drawbacks surface quickly. Session Manager logs users in as a generic SSM agent user with /usr/bin as the working directory. Documentation is sparse, and Bash is launched in shell mode to support color interpretation, which pollutes session logs with escape characters. A bigger concern is that access control rests entirely on IAM credentials — in an environment with fully dynamic, short-lived credentials that is manageable, but it becomes risky anywhere static keys exist.
The host describes trying to map Session Manager logins to individual users, only to find that it requires static IAM identities with specially named tags containing usernames — a non-starter for environments where everything is dynamic.
That leads into alternative approaches. An AWS blog post describes forcing SSH connections through the Unix script utility to record sessions, then uploading logs to S3. But even that is fragile: logs are owned by the user, so technically the user can delete or overwrite them. A more robust path is tlog, a terminal I/O logger that writes session data in JSON format to the systemd journal, where it cannot be easily tampered with. From there, the CloudWatch agent can export journal data to S3 for long-term storage.
The broader point is that command logging sounds simple in compliance conversations, but in practice it becomes a deep rabbit hole full of bypasses, noise, and design tradeoffs.
Monitoring user activity without drowning in logsThe hosts compare notes on monitoring shell activity. One host mentions using auditd to track user actions on bastion hosts in a previous environment, but the log volume was overwhelming — even Elasticsearch struggled to keep up with the ingestion rate.
That sparks a discussion around anomaly detection and heuristics. The real challenge is not collecting logs but determining what is unusual and worth investigating. Failed SSH login alerts are mentioned as a useful signal, though another host pushes back: "Should you have SSH with the password at all? You should have a key." The point stands — without careful tuning, even sensible alerts generate noise faster than teams can act on them.
The exchange captures a recurring DevSecOps reality: collecting telemetry is the easy part; turning it into something actionable is where most teams get stuck.
S3 bucket security, public access controls, and KMS encryption surprisesThe conversation shifts to AWS S3 security. Public buckets remain a common source of breaches, but AWS now offers S3 Block Public Access — account- and bucket-level settings that prevent public access regardless of individual object ACLs. In Terraform, this is a dedicated resource block.
The more nuanced insight is about encryption. The host explains the difference between S3 server-side encryption with the default AWS-managed key (SSE-S3) and encryption with a customer-managed KMS key (SSE-KMS). With SSE-S3, S3 decrypts objects transparently for any client with read access to the bucket. With a customer-managed KMS key, S3 cannot decrypt the object unless the requester also has kms:Decrypt permission on that specific key.
This became a real problem in a cross-account, cross-region workflow involving Go Lambda binaries. Go Lambdas require the deployment artifact to reside in the same region as the function. The team was copying artifacts between accounts and regions, had granted S3 read permissions, but downloads kept failing. CloudTrail logs revealed the real culprit: "I cannot decrypt." The consumers lacked KMS key access. In that case, the fix was switching to SSE-S3 since the artifacts did not require the stronger protection of a customer-managed key.
The host is careful to note that AWS documentation on cross-account S3 access does not prominently flag this encryption interaction — a gap that can cost teams hours of debugging.
Plaintext secrets in Git: a frighteningly common anti-patternOne of the most memorable segments comes when a host describes reviewing an application stack and finding service account keys committed in cleartext in the repository root. The repository also contained a large configuration file with usernames, passwords, API credentials for mail services, login providers, and multiple environments (dev, prod) — all in plain text.
But the worst part: for local development, the team SSH-tunneled into the production SQL server, mapping remote port 3306 to local port 3307. An SSH key providing direct access to the production database was sitting right there in the repo.
The reaction is immediate — this is exactly the kind of setup that accumulates when convenience wins over security for too long. But rather than proposing a risky teardown, the host outlines an incremental migration plan:
Andrey pushes the thinking further: injecting secrets at build time is still risky because anyone who gets the Docker image gets the secrets. The better model is runtime secret retrieval — workloads authenticate dynamically at startup and fetch only the secrets they need. HashiCorp Vault is the concrete example: in a Kubernetes environment, a pod uses its Kubernetes service account to authenticate to Vault, obtains a short-lived token, and retrieves static or dynamic secrets. If someone steals the image and runs it outside the cluster, they cannot authenticate and get nothing.
Vault versus cloud-native secret managementThe secrets discussion expands into a broader comparison. Andrey, who has been doing public speaking about Vault and fielding consulting requests around it, frames the choice pragmatically.
For hybrid-cloud or multi-cloud environments, Vault is likely the best option because it provides a unified interface for secret management, dynamic credentials, and synchronization across providers.
For single-cloud commitments — say, all-in on AWS — native services can cover many of the same use cases: AWS STS for temporary credentials, RDS IAM authentication for database logins, AWS Secrets Manager (which may even be running Vault underneath, as one host speculates), and AWS Certificate Manager for TLS certificates. If the organization is not going multi-cloud, the overhead of running Vault may not be justified.
The recommendation is not ideological. It depends on architecture, portability needs, and operational complexity.
When Vault works technically but fails organizationallyJulien Bisconti adds an important caveat from experience. He describes deploying Vault in a multi-availability-zone setup with full redundancy — technically solid. But the project "went to a halt completely" when it hit governance questions: who should access what, under which rules, and who owns the policies. It became a political war, and the entire deployment had to be rolled back.
The lesson: security tools are good at automating technical workflows, but if the underlying organizational process is broken, you automate a broken process. Security, monitoring, deployment, and access control are deeply entangled, and tooling alone cannot untangle them.
Security tooling fails because developers cannot use itJulien brings the strongest developer-empathy argument of the episode. Developers do not ignore security because they are careless — they bypass it because secure workflows are too awkward under delivery pressure. A manager does not understand why the developer is blocked, pressure mounts, and the result is // just hardcode that here, I don't care, it works.
Even simple tasks illustrate the problem. Julien asks: can you generate an SSL certificate with OpenSSL from memory right now? Most engineers cannot — it is something they do every few months and have to look up each time. He references the famous XKCD comic about entering the correct tar command with ten seconds left.
This evolves into a philosophical observation. One host identifies as a "tool builder" rather than a "product builder" — someone who enjoys building mechanisms but does not always think deeply about end-user experience. That mindset, common among infrastructure and security engineers, may explain why so many DevSecOps tools are powerful but painful to adopt. The gap is not in capability but in usability.
VPNs, zero trust, and the BeyondCorp modelJulien argues that VPNs are an increasingly painful abstraction. Even Cisco — the company that essentially built enterprise VPN technology — had to raise capacity limits during the COVID-19 pandemic because their own infrastructure could not handle the load. Split tunneling introduces its own vulnerabilities, and full-tunnel VPN creates a bottleneck for everything.
He points to Google's BeyondCorp model, published in 2014, which established the principle that network location should not determine access. The analogy: do you build a castle with walls where anyone inside has full access, or do you put a guard in every room checking credentials? The latter — zero trust — is harder to implement, but it limits blast radius and removes the binary "in or out" problem.
Andrey connects this to the emerging service mesh ecosystem. Technologies like Consul Connect implement zero-trust networking at the application level with mutual TLS and identity-based authorization. The hosts note that the service mesh space is still fragmented — just as there was a "war of orchestrators" before Kubernetes emerged as the default, there is now a "war of service meshes" still playing out.
Kubernetes hype versus simpler orchestrationA significant portion of the episode is a productive debate about orchestration choices. Andrey argues strongly against defaulting to Kubernetes. He describes a hybrid-cloud project in Africa running the full HashiCorp stack: Consul for service discovery, configuration, and networking; Nomad for workload scheduling. A team member with relatively little experience got the stack up and running in days.
Andrey outlines the operational weight of Kubernetes: cluster version upgrades where in-place upgrades may skip new security defaults (making full cluster recreation the recommended path), autoscaler configuration layers (pod autoscaler, cluster autoscaler, resource limits), ingress management, YAML sprawl from Helm charts, and a platform that evolves so rapidly it demands continuous learning. He especially warns against running databases in Kubernetes — the statefulness adds pain.
For single-cloud AWS, he argues that ECS is often the better choice: the control plane is free (or nearly so), the per-node overhead is minimal compared to Kubernetes, and AWS handles the operational burden.
Mattias pushes back with a practical counterpoint. Kubernetes provides a consistent platform for diverse workloads — containers, databases, monitoring, custom jobs — all managed through the same interface. Helm charts for common components like nginx-ingress, cert-manager, and external-dns make the ecosystem approachable. The value is in standardization and adaptability.
The hosts also note GKE's pricing evolution: Google introduced a per-cluster management fee (roughly $0.10/hour per control plane) to discourage sprawl and encourage consolidation — a signal that even managed Kubernetes has real costs.
The disagreement is honest but constructive. The shared conclusion: start with what the business needs, then pick the simplest tool that gets you there. "The best battle is the battle you don't fight." And as Julien notes, teams that avoid the Kubernetes default often demonstrate deeper architectural thinking — choosing based on the hype is an insurance policy, but it is not the same as choosing based on needs.
Slack bots, workflow automation, and the security surfaceNear the end, Mattias raises the topic of Slack bots for operational tasks — deployment reporting, status checks, and interactive queries. Andrey reframes the conversation around security: if Slack becomes part of a privileged control plane — for example, a bot that handles privilege escalation by requesting approvals through Slack messages — then request spoofing, account compromise, and weak isolation become serious concerns.
The idea of a privilege-escalation bot is interesting (request access via Slack, get approval from designated approvers, receive time-limited credentials with full audit logging), but the attack surface is real. Slack provides a powerful collaboration platform for building workflows without custom UIs, but once it handles access decisions, security design matters as much as convenience.
Highlights "All the service account keys were in clear text. In the repo."A host describes opening up a client's application stack and finding cloud service keys, usernames, passwords, API credentials, and an SSH key that tunnels directly into the production SQL server — all committed to Git in plain text. It is the kind of discovery that instantly explains years of hidden risk.
How do you unwind that without breaking delivery? The hosts walk through an incremental migration plan in this episode of DevSecOps Talks.
"Security tooling is actually not that usable."Julien Bisconti delivers a sharp truth: developers do not bypass security because they are careless. They do it because secure workflows are too slow, too confusing, and too far removed from how they actually work. When the pressure comes from a manager who does not understand the blocker, the shortcut wins every time.
A candid take on why hardcoded secrets keep showing up in real codebases. Listen to the full discussion on DevSecOps Talks.
"I really applaud people who don't choose Kubernetes — that means they actually know what they're doing."One of the spicier platform takes of the episode. The argument is not that Kubernetes is bad, but that defaulting to it without analyzing your actual needs is a sign of hype-driven architecture. If a simpler stack solves the problem, picking the biggest platform just creates more operational burden.
Hear the full Kubernetes-versus-Nomad-versus-ECS debate on DevSecOps Talks.
"If your process is not good, you're going to automate a bad process."Julien recounts deploying Vault with full HA and multi-AZ redundancy, only to have the project grind to a halt over organizational politics — who should access what, and who decides. The tooling worked perfectly. The organization did not.
A reminder that DevSecOps maturity is not just about picking better tools. Catch the full story on DevSecOps Talks.
"Once somebody is inside, they have the keys to the kingdom."The VPN and zero-trust discussion delivers one of the strongest security arguments of the episode. Julien explains why broad network access — the castle-and-moat model — is the wrong abstraction for modern systems, and why identity-based, fine-grained access control is worth the implementation cost.
If the old perimeter model still shapes how your team thinks about infrastructure security, this part of the episode will resonate. Listen on DevSecOps Talks.
ResourcesAWS Systems Manager Session Manager — AWS documentation for Session Manager, which provides secure instance access without SSH keys, open ports, or bastion hosts, with built-in session logging.
tlog — Terminal I/O Logger — Open-source terminal session recording tool that logs to systemd journal in JSON format, making sessions searchable and tamper-resistant. Discussed in the episode as a more robust alternative to the Unix script command.
AWS S3 Block Public Access — AWS documentation on account- and bucket-level settings to prevent public access to S3 resources, regardless of individual object ACLs or bucket policies.
Troubleshooting Cross-Account Access to KMS-Encrypted S3 Buckets — AWS guidance on the exact issue discussed in the episode: S3 downloads failing because the requester lacks KMS key permissions, even when bucket-level access is granted.
BeyondCorp: A New Approach to Enterprise Security — Google's foundational 2014 paper on zero-trust networking, which established the principle that network location should not determine access. Referenced by Julien in the VPN discussion.
HashiCorp Nomad — A lightweight workload orchestrator with native Consul and Vault integrations. Discussed as a simpler alternative to Kubernetes, especially for hybrid-cloud and small-team environments.
Consul Service Mesh (Consul Connect) — HashiCorp's service mesh solution providing zero-trust networking through mutual TLS and identity-based authorization. Mentioned as the networking layer in the Africa hybrid-cloud project.
XKCD 1168: tar — The comic Julien references about the impossibility of remembering command-line flags — a humorous illustration of why security tooling needs better usability.
In this episode Mattias is trying to convince that running docker in k8s is more security then VM. Did he success ? listen and find out.
SummaryMattias makes a bold claim: Docker containers are more secure than virtual machines. Andrey and Julien push back hard — and by the end, the three hosts explicitly agree to disagree. Along the way, they dig into why container breakouts are harder than people assume, how Lambda micro VMs can be exploited through warm TMP folders, why "containers do not contain" without extra kernel controls, and whether good monitoring matters more for security than any isolation technology. Recorded during COVID-19 lockdowns in 2020, the debate captures a moment when the container-vs-VM argument was far from settled.
Key Topics Docker vs. VM security: technology vs. ways of workingMattias opens the main debate by arguing that Docker containers are more secure than VMs in practice. His reasoning: containers are smaller, more focused, and more ephemeral than traditional virtual machines, which reduces attack surface. In a typical VM, you find mail agents, host-based intrusion detection, syslog, monitoring tools, and other services all coexisting with the application. In a container, you ideally run only the application itself.
Andrey pushes back immediately. He argues Mattias is comparing operational models, not technology. A well-run VM can also be immutable and minimal — you redeploy from a new image the same way you replace a container. Likewise, a badly built container can be long-lived, bloated, and full of unnecessary tools. Andrey has seen enterprises that run containers for months, SSH into them, and treat them like VMs.
Mattias concedes the point but maintains that the standard approach differs: VMs are typically kept running longer with more tools, while the standard approach for containers in Kubernetes is to rotate them and keep a smaller footprint. Andrey counters that most Docker images run as root by default, giving attackers more privilege than they would have on a typical VM where processes run under limited service accounts. This is one of the sharpest exchanges in the episode — better tooling does not fix insecure defaults.
The hosts eventually agree that both technologies can be secured well, but do not reach consensus on which is easier. Andrey summarizes it cleanly: containers make it "a little bit easier" to do the right thing because they narrow the focus to the application rather than the entire operating system, but it is absolutely possible to reach the same security level with VMs.
Why container breakout is not as trivial as people implyMattias challenges the common assumption that containers are unsafe because "you can break out of them." He points out that every container breakout CVE he has reviewed requires significant preconditions: either running an attacker-controlled image or running in privileged mode. You cannot take a standard Ubuntu container image, run a single command, and escape. The threat is real but requires chained attacks, not a single exploit.
Julien and Andrey accept the premise but note that the comparison matters. VM isolation is fundamentally stronger at the hypervisor level. Container breakout may be hard, but it is architecturally easier than VM escape. The discussion reframes the question: runtime security is less about one isolation boundary and more about how many obstacles an attacker must pass through.
Micro VMs, Firecracker, and Lambda attack vectorsAndrey brings up an important middle ground between containers and VMs: micro VMs. AWS Lambda runs on Firecracker, an open-source micro VM monitor. Lambdas are ephemeral, have read-only file systems, minimal tooling, and no access to source code or settings — making them quite secure by design.
But Andrey describes a real attack path researchers have demonstrated. The /tmp directory in Lambda is writable. If an attacker exploits a vulnerability to get code execution within the Lambda, and the Lambda is kept warm (invoked within 15 minutes so it stays in memory), the /tmp folder persists between invocations. An attacker can download tools incrementally across multiple Lambda runs, building up capability over time. From there, they can explore IAM permissions, exfiltrate data by encoding it in resource tags, or even override the Lambda function itself.
The point is that even well-designed ephemeral environments have attack paths when defenders are not paying attention. Security depends on hardening and monitoring, not just on the isolation primitive.
Containers do not contain: AppArmor, Seccomp, and policy controlsJulien delivers the episode's sharpest technical point: "Containers do not contain." They are primarily Linux namespace isolation and need additional kernel controls — AppArmor profiles and Seccomp filters — to properly restrict what applications can do at runtime. Without those extra layers, a container running as root is effectively root on the host machine, and a container with host network access is the same as running directly on the server.
This shifts security responsibility in uncomfortable ways. In VM environments, operations and security teams traditionally handle access controls. In containerized environments, developers are often expected to define security profiles for their workloads — but they may not know which system calls or privileges their applications need. Julien describes this as a fundamental organizational gap: the people writing the workload and the people securing the workload are rarely working hand in hand.
Mattias suggests that platform teams can solve this by enforcing policies centrally. He references tools like Open Policy Agent to set standards for what gets deployed into a cluster, rather than relying on every developer to configure security correctly.
Kubernetes makes monitoring and response easierMattias makes a strong case for container platforms as detection and response environments. He describes working with Falco, a runtime security tool, and highlights a powerful capability: if someone opens a shell inside a container, Falco can detect that behavior and the container is killed automatically. That kind of automated response is natural in an environment built around disposable workloads. On a VM, shells are a normal part of operations, making the same detection much harder to act on.
Julien extends this into a broader argument about monitoring and security being inseparable. He argues that when monitoring is poor, access control becomes chaotic — developers need broad production access just to debug issues. But with strong observability, teams can use feature flags, targeted routing, and centralized logging instead of SSH-ing into production. Good monitoring reduces the need for risky access patterns.
Julien offers a practical example: instead of blocking developers from opening shells in containers, observe that they are doing it and ask why. If they need logs, build a secure log access API. If they need to debug, improve the observability tooling. Monitoring turns security violations into product requirements.
Minimizing container imagesJulien mentions using DockerSlim (now SlimToolkit) to strip unnecessary components from container images, reducing attack surface without requiring deep knowledge of every dependency. It is not a complete security solution, but it is an easy first step that removes much of the bloat containers inherit from their base images.
For organizations with compliance requirements, Julien notes that third-party security vendors provide validated runtime solutions — useful for audit purposes where you need a third party to confirm that the running workload matches what was built internally.
Bundling dependencies with the applicationMattias raises a concern about how containerization changes dependency management. In older models, operations maintained the web server (Apache, Nginx) separately from the application. In containers, the web server, runtime, and application are bundled together. That means patching the web server requires rebuilding and redeploying the entire container, even when the application code has not changed.
Andrey reframes this as a different packaging model, not a new problem. With Java WAR files deployed to Tomcat, you already had dependency coupling — you just managed it differently. Containers actually improve the situation in one way: each application owns its own dependency lifecycle instead of sharing an application server. One application can upgrade independently without affecting others on the same host.
Both hosts note that dedicated application servers are fading. Modern applications in Go, Python, and Node.js often handle HTTP directly, removing the need for a separate web server entirely. The ingress controller in Kubernetes handles routing at the cluster level, which is a separate concern from the application.
The hosts agree to disagreeThe episode ends without consensus. Mattias remains firmly convinced that containers, run properly in Kubernetes, are more secure than VMs. Julien's final position: "Containers can be as secure as VMs, but they need more work to get there." Andrey advocates for a layered approach — use both VMs and containers, with container security focused on application concerns and VM security focused on operational and resource isolation. He also notes that CoreOS, once the go-to minimal container OS, had recently been discontinued by IBM, leaving teams to find alternatives like Fedora CoreOS.
Highlights "Containers do not contain."Julien delivers the episode's most quotable line, reminding listeners that containers are mostly Linux namespacing — not real isolation boundaries. Without AppArmor, Seccomp, and careful configuration, a container is far less restrictive than people assume. A sharp reality check for anyone treating "containerized" as synonymous with "secure." Listen to the full episode on DevSecOps Talks to hear why container security is never just about packaging.
"If somebody pops a shell in a container, that container is killed."Mattias describes working with Falco and highlights a capability that captures the strongest pro-container argument: disposable workloads change the incident response model entirely. On a VM, a shell is normal. In a container, it is an alarm — and the platform can act on it automatically. Listen to the episode to hear how the hosts connect runtime detection, monitoring, and automated response.
"Most of the Docker images out there are running as root."Just when the debate leans in Docker's favor, Mattias himself brings it crashing back. On VMs, running as root is rare. In containers, it is the default. Better tooling does not fix insecure defaults — and this remains one of the most practical risks in container environments. Hear the full back-and-forth on DevSecOps Talks.
"We have to separate apples from bananas — the technology from the ways of working."Mattias draws a sharp line that reframes the entire debate. Are containers actually more secure, or are teams comparing modern container practices against outdated VM operations? A useful reminder that architecture arguments often hide workflow arguments underneath. Listen to the full conversation for the spirited disagreement that follows.
"Monitoring very much goes hand in hand with security."Julien makes the case that bad observability leads directly to bad access control. When developers cannot see what is happening in production safely, they need more privileges, more access, and more risky workarounds. Fix the monitoring, and many security problems solve themselves. Listen to the episode on DevSecOps Talks to hear why observability might be the most underrated security control.
"Containers can be as secure as VMs, but they need more work."Julien's final verdict — delivered over Mattias's loud objections — perfectly captures the episode's unresolved tension. The hosts explicitly agree to disagree, making this one of the more honest security debates you will hear on a podcast. Catch the full exchange on DevSecOps Talks.
ResourcesFalco — CNCF-graduated runtime security tool that detects anomalous behavior in containers and Kubernetes using eBPF. Mentioned by Mattias for its ability to automatically kill containers when suspicious activity like shell access is detected.
Firecracker — Open-source micro VM monitor built by AWS, powering Lambda and Fargate. Discussed by Andrey as an example of ephemeral, hardened execution environments and their attack surfaces.
SlimToolkit (formerly DockerSlim) — Tool for analyzing and minimizing container images, automatically generating AppArmor and Seccomp profiles. Mentioned by Julien as a practical way to reduce attack surface without deep security expertise.
Open Policy Agent (OPA) — General-purpose policy engine for enforcing security and operational policies across Kubernetes clusters. Referenced by Mattias for centrally enforcing deployment standards.
AppArmor — Linux kernel security module that restricts application capabilities through mandatory access control profiles. Discussed by Julien as an essential add-on for meaningful container isolation.
Seccomp (Secure Computing Mode) — Linux kernel facility that restricts which system calls a process can make. Used by Docker and Kubernetes to reduce the container attack surface by blocking unnecessary syscalls.
Fedora CoreOS — Successor to CoreOS Container Linux (discontinued 2020), a minimal, auto-updating operating system designed for running containerized workloads. Relevant context for Andrey's mention of CoreOS being killed by IBM.
In this episode Mattias is trying to convince that running docker in k8s is more security then VM. Did he success ? listen and find out.
SummaryMattias makes a bold claim: Docker containers are more secure than virtual machines. Andrey and Julien push back hard — and by the end, the three hosts explicitly agree to disagree. Along the way, they dig into why container breakouts are harder than people assume, how Lambda micro VMs can be exploited through warm TMP folders, why "containers do not contain" without extra kernel controls, and whether good monitoring matters more for security than any isolation technology. Recorded during COVID-19 lockdowns in 2020, the debate captures a moment when the container-vs-VM argument was far from settled.
Key Topics Docker vs. VM security: technology vs. ways of workingMattias opens the main debate by arguing that Docker containers are more secure than VMs in practice. His reasoning: containers are smaller, more focused, and more ephemeral than traditional virtual machines, which reduces attack surface. In a typical VM, you find mail agents, host-based intrusion detection, syslog, monitoring tools, and other services all coexisting with the application. In a container, you ideally run only the application itself.
Andrey pushes back immediately. He argues Mattias is comparing operational models, not technology. A well-run VM can also be immutable and minimal — you redeploy from a new image the same way you replace a container. Likewise, a badly built container can be long-lived, bloated, and full of unnecessary tools. Andrey has seen enterprises that run containers for months, SSH into them, and treat them like VMs.
Mattias concedes the point but maintains that the standard approach differs: VMs are typically kept running longer with more tools, while the standard approach for containers in Kubernetes is to rotate them and keep a smaller footprint. Andrey counters that most Docker images run as root by default, giving attackers more privilege than they would have on a typical VM where processes run under limited service accounts. This is one of the sharpest exchanges in the episode — better tooling does not fix insecure defaults.
The hosts eventually agree that both technologies can be secured well, but do not reach consensus on which is easier. Andrey summarizes it cleanly: containers make it "a little bit easier" to do the right thing because they narrow the focus to the application rather than the entire operating system, but it is absolutely possible to reach the same security level with VMs.
Why container breakout is not as trivial as people implyMattias challenges the common assumption that containers are unsafe because "you can break out of them." He points out that every container breakout CVE he has reviewed requires significant preconditions: either running an attacker-controlled image or running in privileged mode. You cannot take a standard Ubuntu container image, run a single command, and escape. The threat is real but requires chained attacks, not a single exploit.
Julien and Andrey accept the premise but note that the comparison matters. VM isolation is fundamentally stronger at the hypervisor level. Container breakout may be hard, but it is architecturally easier than VM escape. The discussion reframes the question: runtime security is less about one isolation boundary and more about how many obstacles an attacker must pass through.
Micro VMs, Firecracker, and Lambda attack vectorsAndrey brings up an important middle ground between containers and VMs: micro VMs. AWS Lambda runs on Firecracker, an open-source micro VM monitor. Lambdas are ephemeral, have read-only file systems, minimal tooling, and no access to source code or settings — making them quite secure by design.
But Andrey describes a real attack path researchers have demonstrated. The /tmp directory in Lambda is writable. If an attacker exploits a vulnerability to get code execution within the Lambda, and the Lambda is kept warm (invoked within 15 minutes so it stays in memory), the /tmp folder persists between invocations. An attacker can download tools incrementally across multiple Lambda runs, building up capability over time. From there, they can explore IAM permissions, exfiltrate data by encoding it in resource tags, or even override the Lambda function itself.
The point is that even well-designed ephemeral environments have attack paths when defenders are not paying attention. Security depends on hardening and monitoring, not just on the isolation primitive.
Containers do not contain: AppArmor, Seccomp, and policy controlsJulien delivers the episode's sharpest technical point: "Containers do not contain." They are primarily Linux namespace isolation and need additional kernel controls — AppArmor profiles and Seccomp filters — to properly restrict what applications can do at runtime. Without those extra layers, a container running as root is effectively root on the host machine, and a container with host network access is the same as running directly on the server.
This shifts security responsibility in uncomfortable ways. In VM environments, operations and security teams traditionally handle access controls. In containerized environments, developers are often expected to define security profiles for their workloads — but they may not know which system calls or privileges their applications need. Julien describes this as a fundamental organizational gap: the people writing the workload and the people securing the workload are rarely working hand in hand.
Mattias suggests that platform teams can solve this by enforcing policies centrally. He references tools like Open Policy Agent to set standards for what gets deployed into a cluster, rather than relying on every developer to configure security correctly.
Kubernetes makes monitoring and response easierMattias makes a strong case for container platforms as detection and response environments. He describes working with Falco, a runtime security tool, and highlights a powerful capability: if someone opens a shell inside a container, Falco can detect that behavior and the container is killed automatically. That kind of automated response is natural in an environment built around disposable workloads. On a VM, shells are a normal part of operations, making the same detection much harder to act on.
Julien extends this into a broader argument about monitoring and security being inseparable. He argues that when monitoring is poor, access control becomes chaotic — developers need broad production access just to debug issues. But with strong observability, teams can use feature flags, targeted routing, and centralized logging instead of SSH-ing into production. Good monitoring reduces the need for risky access patterns.
Julien offers a practical example: instead of blocking developers from opening shells in containers, observe that they are doing it and ask why. If they need logs, build a secure log access API. If they need to debug, improve the observability tooling. Monitoring turns security violations into product requirements.
Minimizing container imagesJulien mentions using DockerSlim (now SlimToolkit) to strip unnecessary components from container images, reducing attack surface without requiring deep knowledge of every dependency. It is not a complete security solution, but it is an easy first step that removes much of the bloat containers inherit from their base images.
For organizations with compliance requirements, Julien notes that third-party security vendors provide validated runtime solutions — useful for audit purposes where you need a third party to confirm that the running workload matches what was built internally.
Bundling dependencies with the applicationMattias raises a concern about how containerization changes dependency management. In older models, operations maintained the web server (Apache, Nginx) separately from the application. In containers, the web server, runtime, and application are bundled together. That means patching the web server requires rebuilding and redeploying the entire container, even when the application code has not changed.
Andrey reframes this as a different packaging model, not a new problem. With Java WAR files deployed to Tomcat, you already had dependency coupling — you just managed it differently. Containers actually improve the situation in one way: each application owns its own dependency lifecycle instead of sharing an application server. One application can upgrade independently without affecting others on the same host.
Both hosts note that dedicated application servers are fading. Modern applications in Go, Python, and Node.js often handle HTTP directly, removing the need for a separate web server entirely. The ingress controller in Kubernetes handles routing at the cluster level, which is a separate concern from the application.
The hosts agree to disagreeThe episode ends without consensus. Mattias remains firmly convinced that containers, run properly in Kubernetes, are more secure than VMs. Julien's final position: "Containers can be as secure as VMs, but they need more work to get there." Andrey advocates for a layered approach — use both VMs and containers, with container security focused on application concerns and VM security focused on operational and resource isolation. He also notes that CoreOS, once the go-to minimal container OS, had recently been discontinued by IBM, leaving teams to find alternatives like Fedora CoreOS.
Highlights "Containers do not contain."Julien delivers the episode's most quotable line, reminding listeners that containers are mostly Linux namespacing — not real isolation boundaries. Without AppArmor, Seccomp, and careful configuration, a container is far less restrictive than people assume. A sharp reality check for anyone treating "containerized" as synonymous with "secure." Listen to the full episode on DevSecOps Talks to hear why container security is never just about packaging.
"If somebody pops a shell in a container, that container is killed."Mattias describes working with Falco and highlights a capability that captures the strongest pro-container argument: disposable workloads change the incident response model entirely. On a VM, a shell is normal. In a container, it is an alarm — and the platform can act on it automatically. Listen to the episode to hear how the hosts connect runtime detection, monitoring, and automated response.
"Most of the Docker images out there are running as root."Just when the debate leans in Docker's favor, Mattias himself brings it crashing back. On VMs, running as root is rare. In containers, it is the default. Better tooling does not fix insecure defaults — and this remains one of the most practical risks in container environments. Hear the full back-and-forth on DevSecOps Talks.
"We have to separate apples from bananas — the technology from the ways of working."Mattias draws a sharp line that reframes the entire debate. Are containers actually more secure, or are teams comparing modern container practices against outdated VM operations? A useful reminder that architecture arguments often hide workflow arguments underneath. Listen to the full conversation for the spirited disagreement that follows.
"Monitoring very much goes hand in hand with security."Julien makes the case that bad observability leads directly to bad access control. When developers cannot see what is happening in production safely, they need more privileges, more access, and more risky workarounds. Fix the monitoring, and many security problems solve themselves. Listen to the episode on DevSecOps Talks to hear why observability might be the most underrated security control.
"Containers can be as secure as VMs, but they need more work."Julien's final verdict — delivered over Mattias's loud objections — perfectly captures the episode's unresolved tension. The hosts explicitly agree to disagree, making this one of the more honest security debates you will hear on a podcast. Catch the full exchange on DevSecOps Talks.
ResourcesFalco — CNCF-graduated runtime security tool that detects anomalous behavior in containers and Kubernetes using eBPF. Mentioned by Mattias for its ability to automatically kill containers when suspicious activity like shell access is detected.
Firecracker — Open-source micro VM monitor built by AWS, powering Lambda and Fargate. Discussed by Andrey as an example of ephemeral, hardened execution environments and their attack surfaces.
SlimToolkit (formerly DockerSlim) — Tool for analyzing and minimizing container images, automatically generating AppArmor and Seccomp profiles. Mentioned by Julien as a practical way to reduce attack surface without deep security expertise.
Open Policy Agent (OPA) — General-purpose policy engine for enforcing security and operational policies across Kubernetes clusters. Referenced by Mattias for centrally enforcing deployment standards.
AppArmor — Linux kernel security module that restricts application capabilities through mandatory access control profiles. Discussed by Julien as an essential add-on for meaningful container isolation.
Seccomp (Secure Computing Mode) — Linux kernel facility that restricts which system calls a process can make. Used by Docker and Kubernetes to reduce the container attack surface by blocking unnecessary syscalls.
Fedora CoreOS — Successor to CoreOS Container Linux (discontinued 2020), a minimal, auto-updating operating system designed for running containerized workloads. Relevant context for Andrey's mention of CoreOS being killed by IBM.
Your docker images and build are be coming the base for our platform. But are they secure? In this episode we talk about how you can secure your docker images.
SummaryIn this early DevSecOps Talks episode, Mattias, Andrey, and Julien dig into Docker security as a supply chain problem — and quickly dismantle the assumption that a signed container means you know what is inside. Julien pushes back sharply: signing only gives a "semantic guarantee" that an image is what it claims to be, not that it is safe. Mattias argues that containers were designed to be convenient, not secure by default, while Andrey points out that containerization has fundamentally changed the patching game — once the OS, web server, and application are packaged together, every security fix becomes a rebuild-and-redeploy exercise. The hosts make the case for layered scanning, slim runtime images, multi-stage builds, and continuous rebuilding as the practical path to running containers safely in production.
Key Topics Container images vs. running containersThe conversation starts by separating two distinct parts of container security: the image and the running container.
Mattias explains that a container image can be treated much like any other file or archive — a zip or tar file sitting on disk. Because of that, teams can sign images cryptographically to verify origin and integrity, similar to how Node.js developers sign releases with their private keys. That gives consumers confidence that the image came from a known source and has not been tampered with.
But Julien pushes back on a common misunderstanding: signing does not mean the contents are inherently safe. As he puts it, you get a "semantic guarantee that this image is what it's pretending to be" — but not proof that everything inside is secure. Authenticity is not the same as security.
The hosts frame this as a trust problem. In a production cluster, teams often want to prevent engineers or workloads from pulling arbitrary images and running them without controls. Signed images and curated registries help, but they do not eliminate the need for careful validation.
Trust, Docker Hub, and the container supply chainA major part of the episode focuses on how much trust teams should place in public images, including those from Docker Hub.
Andrey raises the practical reality: if you are running four different languages, you cannot build and maintain base images for all of them. It is much easier to grab the latest Node.js, Python, Ruby, or Java images from Docker Hub and build from there. Julien and Mattias acknowledge that reality, but caution against treating "official" or branded images as automatically secure.
Julien walks through the different trust levels on Docker Hub:
That leads into a broader discussion of supply chain attacks. Julien references real examples where Node.js libraries on npm were taken over by malicious parties after the original maintainer walked away. The same risk applies to container images.
Julien points out that large organizations sometimes go as far as rebuilding all dependencies from source — he mentions having heard of teams that do not pull jar files from Maven Central but build their own from source to verify exactly what they are shipping. While that is not feasible for every team, the principle stands: reduce blind trust and increase verification where the environment demands it.
Why container security is not just image signingThe discussion then shifts from image authenticity to runtime security.
Mattias explains that containers rely on Linux kernel primitives — namespaces for process isolation, along with controls for networking, memory, and disk. These low-level APIs are useful for resource sharing and scaling, but they were not originally designed as strong security boundaries. As he puts it, "the container does not contain things, it's just an abstraction." Container breakout vulnerabilities matter because an attacker who can exploit the runtime or host interface may reach beyond the container itself.
This leads to one of the episode's sharpest observations from Mattias: containers became popular because they are efficient and convenient to operate — you can bin-pack them on the same hardware and run far more applications per server. But from a security perspective, "it was not designed to be secure by default, it was designed to be convenient." That gap between convenience and security is what teams must actively address through scanning, hardening, and runtime controls.
CVE scanning: registries, dependencies, and source codeThe hosts spend a good amount of time discussing scanning tools and where each fits in the security pipeline.
Mattias notes that most container registries now offer built-in vulnerability scanning, sometimes called container analysis APIs. Julien suggests a practical AWS-based pattern: if you do not want to pay for Docker Hub premium but still want to use public images, you can pull from Docker Hub, push into AWS Elastic Container Registry (ECR), and take advantage of its built-in CVE scanning. Then you restrict your production orchestrators to pull only from ECR.
Julien draws an important distinction between types of scanning:
Julien initially states that registry scans do not cover source code, then corrects himself to clarify the distinction more precisely: registries scan installed OS packages, while separate tools scan programming language dependencies. Neither deeply analyzes your own custom code. That leaves an unknown component in the stack that teams need to address through other means — code review, testing, and secure development practices.
Andrey also mentions using Anchore, which he describes as the foundation for many of these CVE scanning capabilities.
The shift from OS patching to image rebuildingOne of the most practical insights comes from Andrey, who compares containers to older operational models.
In traditional environments, teams could patch the operating system or update components like Nginx independently of the application. With containers, those layers are packaged together. If a new Nginx vulnerability is disclosed, the team needs to rebuild and redeploy the entire image that contains both the web server and the application code.
This changes patching from an infrastructure task into an application delivery task. Security updates are no longer something ops handles in isolation — they flow through the same build-and-deploy pipeline as feature code.
The hosts argue that this is why security must be a concern from the earliest stages. As Andrey puts it, referencing Julien's earlier point: security belongs in the first commit, because that is when it is cheapest and easiest to get right. A green build today does not guarantee a safe deployment tomorrow if new CVEs are published against the packages already running in production.
Slim images, distroless approaches, and DockerSlimMattias argues strongly for reducing container contents to the bare minimum. He highlights DockerSlim (now SlimToolkit), a project he uses frequently that strips images down to only the components essential for the application. In his example, a Maven-based application image dropped from roughly 600 MB to 140 MB — with no bash shell or other standard OS tooling left in the result.
Julien reinforces the security rationale: "the less code you have, the less vulnerability you have, and that's what you want in production." He mentions Alpine Linux and Google's distroless images as complementary approaches that aim for the same goal — minimal OS footprint in production containers.
The common theme is that production containers should not carry build tools, shells, package managers, or debugging utilities. Every unnecessary binary is a potential attack surface. The best production image is not the one easiest to build, but the one that contains the least unnecessary code.
Multi-stage builds and separate build vs. runtime imagesThe hosts spend considerable time on one of the most practical Docker security patterns: multi-stage builds.
Julien explains the concept of build stages — using an intermediate container with all build dependencies to compile the application, then copying only the final artifact into a much smaller production image. This separation means the production image does not need compilers, package managers, or the full dependency tree.
Andrey confirms this maps directly to Docker's multi-stage build feature: "You just build your Docker build in one stage and then just copy build results to the next stage." He also points out the developer experience benefit — since the build environment is defined inside the Dockerfile itself, developers do not need to set up different language toolchains on their local machines when working across multiple microservices.
Julien adds a performance angle: pulling a pre-built container image with cached dependencies is often much faster than resolving and fetching all dependencies from scratch. He has seen Maven builds that took 20 minutes purely because they had to re-fetch all artifacts every time. Pre-building and caching the dependency layer can dramatically improve total build-to-production time.
Continuous rebuilding and reducing attacker persistenceAndrey recommends reducing the lifetime of deployed images by rebuilding base images and all derived containers regularly — potentially every week — pulling in the latest patches each time. While this adds operational overhead, it shortens the window of exposure and makes it significantly harder for attackers to maintain persistence in stale environments.
Julien frames this as a recurring maintenance budget that every engineering team must accept. As he puts it, "if you don't spend at least one day per week updating the stuff, it's going to accumulate over a year or something. And then you have to spend two weeks fixing all that." The compound interest on security debt is steep.
Tags, digests, signing, and private registriesNear the end of the episode, Mattias raises a practical deployment question: how should teams store and reference images securely? He contrasts mutable tags (which can be overwritten on Docker Hub) with immutable SHA-based digests, image signing, and private registries — and admits there are so many options it is hard to know where to start.
Julien recommends implementing all of these controls, but not all at once. He advocates for an incremental approach: define your security objectives, then build toward them layer by layer. Start with what gives the most immediate protection and expand from there.
The hosts do not present a single silver bullet. Instead, they emphasize defense in depth: scanning at every level (code dependencies, container base images, production images), signing for authenticity, private registries for access control, and infrastructure-level enforcement.
Build pipeline security and handling secretsThe episode closes by touching on a problem the hosts agree deserves its own dedicated discussion: securing the build system itself.
Mattias points out that the build server has access to source code, credentials, signing keys, registries, and deployment systems. If an attacker compromises it, they can inject malicious code during the build process — effectively poisoning everything downstream.
The hosts then discuss the challenge of passing credentials into container builds for private dependencies. Andrey notes that recent Docker versions support passing SSH agents and secrets more safely during builds. He recommends using short-lived credentials (like AWS STS tokens with 15-minute expiration) so that even if credentials leak into image layers, they are already expired by the time anyone could exploit them. He also mentions using IMG, a daemonless image builder, as an alternative to Docker that avoids the need for a Docker daemon during builds.
Julien takes a different approach to runtime secrets: encrypting them with KMS and storing them in a cloud bucket, then fetching them only at container startup. He observes that the real cloud vendor lock-in is never the runtime — "it's always the IAM" — because authorization and access control mechanisms are deeply cloud-specific and difficult to migrate.
Julien adds that handling build secrets often becomes an awkward "dance" of fetching credentials, granting temporary access, and cleaning up afterward. It works, but it remains operationally clumsy.
The hosts agree that build server hardening and the connection between security and cost management (which Julien briefly mentions as natural partners, since understanding who has access to what benefits both) are topics worthy of their own future episodes.
Highlights "You don't know what's inside — you only have a semantic guarantee."Julien cuts through a common assumption in container security: signing an image proves origin, not safety. That distinction shapes the entire episode, as the hosts explore why authenticity, trust, and actual security are three separate problems. Listen to this episode of DevSecOps Talks for a grounded discussion on what image signing can — and cannot — guarantee.
"Containers were designed to be convenient, not secure by default."Mattias makes one of the sharpest points of the episode: containers became popular because they are efficient and easy to operate, not because they provide strong isolation. The container "does not contain things, it's just an abstraction." That is why runtime hardening and vulnerability management still matter so much. Listen to DevSecOps Talks to hear why container adoption created as many security questions as it solved.
"Official on Docker Hub doesn't mean secure — scan a Jenkins image and you'd be surprised."Julien challenges the idea that a branded or official image should be trusted blindly. Even well-known organization-backed images can contain a surprising number of CVEs, and reputable sources can still introduce malicious changes — intentionally or by mistake. Listen to this DevSecOps Talks episode for a practical conversation about defining trust in your container supply chain.
"The less code you have, the less vulnerability you have."Julien sums up a recurring theme: smaller runtime images are not just cleaner — they are fundamentally safer. From DockerSlim shrinking a 600 MB Maven image to 140 MB, to Alpine and distroless approaches, the hosts argue for removing everything production does not absolutely need. Listen to DevSecOps Talks to hear why image size and security are more connected than many teams realize.
"Nginx gets a CVE? Now you have to rebuild your entire app."Andrey highlights how containerization merged the OS patching cycle with the application delivery cycle. In the old world, ops could patch Nginx without touching the app. In the container world, every security update means a full image rebuild and redeploy — making security an application delivery concern, not just an infrastructure one. Listen to this DevSecOps Talks episode for a practical take on why modern patching must flow through the CI/CD pipeline.
"If you don't spend one day a week updating, you'll spend two weeks fixing it later."Julien describes dependency and image maintenance as a non-negotiable recurring budget. Skip the updates and the security debt compounds fast — turning routine maintenance into an emergency remediation project. Listen to DevSecOps Talks for an honest take on the operational cost of staying secure in containerized environments.
"The real lock-in is never the runtime — it's always the IAM."In a brief but pointed aside about handling secrets in containers, Julien observes that authorization and access control are the truly cloud-specific parts of any architecture. Runtime workloads can move; IAM policies cannot. Listen to this DevSecOps Talks episode for a candid discussion on where the real complexity lies in cloud-native security.
ResourcesSlimToolkit (formerly DockerSlim) — Open-source tool that minifies container images by removing non-essential components, reducing image size and attack surface without code changes. Mentioned by Mattias in the episode.
Google Distroless Container Images — Minimal container base images from Google that contain only the application and its runtime dependencies, stripping out shells, package managers, and OS utilities.
Docker Multi-Stage Builds — Official Docker documentation on using multiple build stages to produce smaller, cleaner production images by separating the build environment from the runtime image.
Docker Content Trust — Docker's built-in mechanism for cryptographic signing and verification of image integrity and publisher identity using Notary.
Amazon ECR Image Scanning — AWS documentation on scanning container images for OS and language package vulnerabilities in Elastic Container Registry, mentioned by Julien as a practical alternative to paid Docker Hub scanning.
Snyk Container — Developer security tool for scanning container images and application dependencies for known vulnerabilities, with remediation guidance and base image upgrade recommendations.
Anchore Container Scanning — SBOM-powered container vulnerability scanning platform, referenced by Andrey as the engine behind many registry-level CVE scanning capabilities.
Alpine Linux Docker Image — Minimal 5 MB base image built on musl libc and BusyBox, widely used as a lightweight, security-conscious alternative to full Linux distribution base images.
Your docker images and build are be coming the base for our platform. But are they secure? In this episode we talk about how you can secure your docker images.
SummaryIn this early DevSecOps Talks episode, Mattias, Andrey, and Julien dig into Docker security as a supply chain problem — and quickly dismantle the assumption that a signed container means you know what is inside. Julien pushes back sharply: signing only gives a "semantic guarantee" that an image is what it claims to be, not that it is safe. Mattias argues that containers were designed to be convenient, not secure by default, while Andrey points out that containerization has fundamentally changed the patching game — once the OS, web server, and application are packaged together, every security fix becomes a rebuild-and-redeploy exercise. The hosts make the case for layered scanning, slim runtime images, multi-stage builds, and continuous rebuilding as the practical path to running containers safely in production.
Key Topics Container images vs. running containersThe conversation starts by separating two distinct parts of container security: the image and the running container.
Mattias explains that a container image can be treated much like any other file or archive — a zip or tar file sitting on disk. Because of that, teams can sign images cryptographically to verify origin and integrity, similar to how Node.js developers sign releases with their private keys. That gives consumers confidence that the image came from a known source and has not been tampered with.
But Julien pushes back on a common misunderstanding: signing does not mean the contents are inherently safe. As he puts it, you get a "semantic guarantee that this image is what it's pretending to be" — but not proof that everything inside is secure. Authenticity is not the same as security.
The hosts frame this as a trust problem. In a production cluster, teams often want to prevent engineers or workloads from pulling arbitrary images and running them without controls. Signed images and curated registries help, but they do not eliminate the need for careful validation.
Trust, Docker Hub, and the container supply chainA major part of the episode focuses on how much trust teams should place in public images, including those from Docker Hub.
Andrey raises the practical reality: if you are running four different languages, you cannot build and maintain base images for all of them. It is much easier to grab the latest Node.js, Python, Ruby, or Java images from Docker Hub and build from there. Julien and Mattias acknowledge that reality, but caution against treating "official" or branded images as automatically secure.
Julien walks through the different trust levels on Docker Hub:
That leads into a broader discussion of supply chain attacks. Julien references real examples where Node.js libraries on npm were taken over by malicious parties after the original maintainer walked away. The same risk applies to container images.
Julien points out that large organizations sometimes go as far as rebuilding all dependencies from source — he mentions having heard of teams that do not pull jar files from Maven Central but build their own from source to verify exactly what they are shipping. While that is not feasible for every team, the principle stands: reduce blind trust and increase verification where the environment demands it.
Why container security is not just image signingThe discussion then shifts from image authenticity to runtime security.
Mattias explains that containers rely on Linux kernel primitives — namespaces for process isolation, along with controls for networking, memory, and disk. These low-level APIs are useful for resource sharing and scaling, but they were not originally designed as strong security boundaries. As he puts it, "the container does not contain things, it's just an abstraction." Container breakout vulnerabilities matter because an attacker who can exploit the runtime or host interface may reach beyond the container itself.
This leads to one of the episode's sharpest observations from Mattias: containers became popular because they are efficient and convenient to operate — you can bin-pack them on the same hardware and run far more applications per server. But from a security perspective, "it was not designed to be secure by default, it was designed to be convenient." That gap between convenience and security is what teams must actively address through scanning, hardening, and runtime controls.
CVE scanning: registries, dependencies, and source codeThe hosts spend a good amount of time discussing scanning tools and where each fits in the security pipeline.
Mattias notes that most container registries now offer built-in vulnerability scanning, sometimes called container analysis APIs. Julien suggests a practical AWS-based pattern: if you do not want to pay for Docker Hub premium but still want to use public images, you can pull from Docker Hub, push into AWS Elastic Container Registry (ECR), and take advantage of its built-in CVE scanning. Then you restrict your production orchestrators to pull only from ECR.
Julien draws an important distinction between types of scanning:
Julien initially states that registry scans do not cover source code, then corrects himself to clarify the distinction more precisely: registries scan installed OS packages, while separate tools scan programming language dependencies. Neither deeply analyzes your own custom code. That leaves an unknown component in the stack that teams need to address through other means — code review, testing, and secure development practices.
Andrey also mentions using Anchore, which he describes as the foundation for many of these CVE scanning capabilities.
The shift from OS patching to image rebuildingOne of the most practical insights comes from Andrey, who compares containers to older operational models.
In traditional environments, teams could patch the operating system or update components like Nginx independently of the application. With containers, those layers are packaged together. If a new Nginx vulnerability is disclosed, the team needs to rebuild and redeploy the entire image that contains both the web server and the application code.
This changes patching from an infrastructure task into an application delivery task. Security updates are no longer something ops handles in isolation — they flow through the same build-and-deploy pipeline as feature code.
The hosts argue that this is why security must be a concern from the earliest stages. As Andrey puts it, referencing Julien's earlier point: security belongs in the first commit, because that is when it is cheapest and easiest to get right. A green build today does not guarantee a safe deployment tomorrow if new CVEs are published against the packages already running in production.
Slim images, distroless approaches, and DockerSlimMattias argues strongly for reducing container contents to the bare minimum. He highlights DockerSlim (now SlimToolkit), a project he uses frequently that strips images down to only the components essential for the application. In his example, a Maven-based application image dropped from roughly 600 MB to 140 MB — with no bash shell or other standard OS tooling left in the result.
Julien reinforces the security rationale: "the less code you have, the less vulnerability you have, and that's what you want in production." He mentions Alpine Linux and Google's distroless images as complementary approaches that aim for the same goal — minimal OS footprint in production containers.
The common theme is that production containers should not carry build tools, shells, package managers, or debugging utilities. Every unnecessary binary is a potential attack surface. The best production image is not the one easiest to build, but the one that contains the least unnecessary code.
Multi-stage builds and separate build vs. runtime imagesThe hosts spend considerable time on one of the most practical Docker security patterns: multi-stage builds.
Julien explains the concept of build stages — using an intermediate container with all build dependencies to compile the application, then copying only the final artifact into a much smaller production image. This separation means the production image does not need compilers, package managers, or the full dependency tree.
Andrey confirms this maps directly to Docker's multi-stage build feature: "You just build your Docker build in one stage and then just copy build results to the next stage." He also points out the developer experience benefit — since the build environment is defined inside the Dockerfile itself, developers do not need to set up different language toolchains on their local machines when working across multiple microservices.
Julien adds a performance angle: pulling a pre-built container image with cached dependencies is often much faster than resolving and fetching all dependencies from scratch. He has seen Maven builds that took 20 minutes purely because they had to re-fetch all artifacts every time. Pre-building and caching the dependency layer can dramatically improve total build-to-production time.
Continuous rebuilding and reducing attacker persistenceAndrey recommends reducing the lifetime of deployed images by rebuilding base images and all derived containers regularly — potentially every week — pulling in the latest patches each time. While this adds operational overhead, it shortens the window of exposure and makes it significantly harder for attackers to maintain persistence in stale environments.
Julien frames this as a recurring maintenance budget that every engineering team must accept. As he puts it, "if you don't spend at least one day per week updating the stuff, it's going to accumulate over a year or something. And then you have to spend two weeks fixing all that." The compound interest on security debt is steep.
Tags, digests, signing, and private registriesNear the end of the episode, Mattias raises a practical deployment question: how should teams store and reference images securely? He contrasts mutable tags (which can be overwritten on Docker Hub) with immutable SHA-based digests, image signing, and private registries — and admits there are so many options it is hard to know where to start.
Julien recommends implementing all of these controls, but not all at once. He advocates for an incremental approach: define your security objectives, then build toward them layer by layer. Start with what gives the most immediate protection and expand from there.
The hosts do not present a single silver bullet. Instead, they emphasize defense in depth: scanning at every level (code dependencies, container base images, production images), signing for authenticity, private registries for access control, and infrastructure-level enforcement.
Build pipeline security and handling secretsThe episode closes by touching on a problem the hosts agree deserves its own dedicated discussion: securing the build system itself.
Mattias points out that the build server has access to source code, credentials, signing keys, registries, and deployment systems. If an attacker compromises it, they can inject malicious code during the build process — effectively poisoning everything downstream.
The hosts then discuss the challenge of passing credentials into container builds for private dependencies. Andrey notes that recent Docker versions support passing SSH agents and secrets more safely during builds. He recommends using short-lived credentials (like AWS STS tokens with 15-minute expiration) so that even if credentials leak into image layers, they are already expired by the time anyone could exploit them. He also mentions using IMG, a daemonless image builder, as an alternative to Docker that avoids the need for a Docker daemon during builds.
Julien takes a different approach to runtime secrets: encrypting them with KMS and storing them in a cloud bucket, then fetching them only at container startup. He observes that the real cloud vendor lock-in is never the runtime — "it's always the IAM" — because authorization and access control mechanisms are deeply cloud-specific and difficult to migrate.
Julien adds that handling build secrets often becomes an awkward "dance" of fetching credentials, granting temporary access, and cleaning up afterward. It works, but it remains operationally clumsy.
The hosts agree that build server hardening and the connection between security and cost management (which Julien briefly mentions as natural partners, since understanding who has access to what benefits both) are topics worthy of their own future episodes.
Highlights "You don't know what's inside — you only have a semantic guarantee."Julien cuts through a common assumption in container security: signing an image proves origin, not safety. That distinction shapes the entire episode, as the hosts explore why authenticity, trust, and actual security are three separate problems. Listen to this episode of DevSecOps Talks for a grounded discussion on what image signing can — and cannot — guarantee.
"Containers were designed to be convenient, not secure by default."Mattias makes one of the sharpest points of the episode: containers became popular because they are efficient and easy to operate, not because they provide strong isolation. The container "does not contain things, it's just an abstraction." That is why runtime hardening and vulnerability management still matter so much. Listen to DevSecOps Talks to hear why container adoption created as many security questions as it solved.
"Official on Docker Hub doesn't mean secure — scan a Jenkins image and you'd be surprised."Julien challenges the idea that a branded or official image should be trusted blindly. Even well-known organization-backed images can contain a surprising number of CVEs, and reputable sources can still introduce malicious changes — intentionally or by mistake. Listen to this DevSecOps Talks episode for a practical conversation about defining trust in your container supply chain.
"The less code you have, the less vulnerability you have."Julien sums up a recurring theme: smaller runtime images are not just cleaner — they are fundamentally safer. From DockerSlim shrinking a 600 MB Maven image to 140 MB, to Alpine and distroless approaches, the hosts argue for removing everything production does not absolutely need. Listen to DevSecOps Talks to hear why image size and security are more connected than many teams realize.
"Nginx gets a CVE? Now you have to rebuild your entire app."Andrey highlights how containerization merged the OS patching cycle with the application delivery cycle. In the old world, ops could patch Nginx without touching the app. In the container world, every security update means a full image rebuild and redeploy — making security an application delivery concern, not just an infrastructure one. Listen to this DevSecOps Talks episode for a practical take on why modern patching must flow through the CI/CD pipeline.
"If you don't spend one day a week updating, you'll spend two weeks fixing it later."Julien describes dependency and image maintenance as a non-negotiable recurring budget. Skip the updates and the security debt compounds fast — turning routine maintenance into an emergency remediation project. Listen to DevSecOps Talks for an honest take on the operational cost of staying secure in containerized environments.
"The real lock-in is never the runtime — it's always the IAM."In a brief but pointed aside about handling secrets in containers, Julien observes that authorization and access control are the truly cloud-specific parts of any architecture. Runtime workloads can move; IAM policies cannot. Listen to this DevSecOps Talks episode for a candid discussion on where the real complexity lies in cloud-native security.
ResourcesSlimToolkit (formerly DockerSlim) — Open-source tool that minifies container images by removing non-essential components, reducing image size and attack surface without code changes. Mentioned by Mattias in the episode.
Google Distroless Container Images — Minimal container base images from Google that contain only the application and its runtime dependencies, stripping out shells, package managers, and OS utilities.
Docker Multi-Stage Builds — Official Docker documentation on using multiple build stages to produce smaller, cleaner production images by separating the build environment from the runtime image.
Docker Content Trust — Docker's built-in mechanism for cryptographic signing and verification of image integrity and publisher identity using Notary.
Amazon ECR Image Scanning — AWS documentation on scanning container images for OS and language package vulnerabilities in Elastic Container Registry, mentioned by Julien as a practical alternative to paid Docker Hub scanning.
Snyk Container — Developer security tool for scanning container images and application dependencies for known vulnerabilities, with remediation guidance and base image upgrade recommendations.
Anchore Container Scanning — SBOM-powered container vulnerability scanning platform, referenced by Andrey as the engine behind many registry-level CVE scanning capabilities.
Alpine Linux Docker Image — Minimal 5 MB base image built on musl libc and BusyBox, widely used as a lightweight, security-conscious alternative to full Linux distribution base images.
Gitops a new concept on devops. Whats is it and how can you use it when deploy and setup your k8s cluster.
SummaryGitOps sounds simple — put Kubernetes manifests in Git and let the cluster pull changes — but the episode quickly reveals the real debate is not about Git at all. Andrey argues the only genuinely novel thing about GitOps is the pull-based model where an in-cluster agent reconciles state, while Julien questions whether GitOps is good for day-2 operations or just for bootstrapping clusters. The spiciest moment: Andrey declares "life is too short to do pull requests" and advocates pushing straight to master with strong CI/CD guardrails instead.
Key Topics What GitOps actually is — and what it is notAndrey frames the discussion by separating what is genuinely new about GitOps from what teams have already been doing for years. Storing deployment specifications in Git, he argues, is just version control — teams have done that for a decade. The meaningful difference is the deployment model: instead of an external CI/CD server pushing changes into Kubernetes by calling the cluster API, GitOps places an agent inside the cluster that either receives a webhook or polls a Git repository, pulls in the desired state, and applies it from within.
That pull-based model is what Andrey identifies as the core innovation. It eliminates the need to expose the Kubernetes API externally — a real concern when using hosted CI services like CircleCI, which would otherwise need network access to the cluster. As Andrey puts it, exposing the API externally is risky "unless you want someone mining bitcoin on your cluster."
He references the tooling landscape at the time: Weaveworks (the company that coined the term "GitOps" and created WeaveNet, a Kubernetes CNI driver), Flux, Argo, and Jenkins X. He notes that Flux and Argo were joining forces at the time of recording. He also mentions Jenkins X as a potential GitOps tool, since it runs CI/CD jobs natively in Kubernetes, but expresses skepticism about using Kubernetes for build workloads — Kubernetes is declarative about desired state, but "you cannot declare my build is successful because you have no idea how your build gonna go."
Editor's note: Weaveworks, the company that originated the term "GitOps," shut down in February 2024. Flux continues as a CNCF graduated project. The GitOps principles have since been formalized by the OpenGitOps project under the CNCF.
The Weaveworks definition, read straight from the sourceAndrey reads Weaveworks' concise GitOps definition from their blog and walks through its key points:
Andrey also raises a nuance about Helm: since Helm templates can produce different output depending on input variables, true GitOps implies committing not only the Helm charts but also the rendered manifests — because the generated output is what actually represents the declarative desired state.
He draws a comparison to GitHub's earlier promotion of ChatOps, noting that many of the same ideas — observable, verifiable changes driven through a central workflow — were already part of GitHub's operational philosophy, just with a different interface.
Two layers: infrastructure-as-code and in-cluster GitOpsJulien offers a more practical framing, splitting the problem into two distinct layers:
In Julien's model, a Git repository becomes the authoritative inventory of everything that should exist in the cluster. He describes the ideal: "if anything else is running here, alert me or kill it." That gives teams confidence that the observed cluster state matches the intended one, and helps prevent configuration drift — a problem the hosts discussed in their earlier infrastructure-as-code episode.
Day-2 operations: where the model gets testedWhile Julien appreciates GitOps for defining and bootstrapping cluster state, he is openly skeptical about its effectiveness for long-running operations. He distinguishes between two very different challenges: "setting up things" versus "running things for a long time — they're not the same."
Real environments drift. People intervene manually during incidents. Urgent fixes happen outside the normal workflow. The clean desired-state model becomes harder to maintain once the messiness of day-2 operations enters the picture. Julien frames this as an open question rather than a settled answer: GitOps may be excellent for establishing a clean baseline, but whether it holds up as a complete long-term operating model remains to be proven.
Who controls changes: developers, operators, or both?Andrey raises a governance concern: GitOps can look like a direct developer-to-cluster pathway. If a developer changes a YAML file, commits it, and the cluster automatically applies the change, operations staff are effectively bypassed — "there is nowhere an operation person can interfere with this."
Julien pushes back, arguing that the workflow — not the tooling — determines who has control. If changes go through pull requests with review and approval, it does not matter whether the author is a developer or an operator. Both participate in the same process. The mechanism is the same one used for application code: propose a change, review it, merge it.
Pull requests, compliance, and "push to master"The conversation takes its most opinionated turn when the topic shifts to pull requests.
Andrey is blunt: "Life is too short to do pull requests. You never get anything done. You do a pull request, you ask for review and then you hunt the person for two days." His preference is to push directly to master and build CI/CD pipelines strong enough to catch mistakes — "you build your system to defend yourself from the fools."
He does acknowledge an important exception: regulated industries where every production deployment must be peer-reviewed or approved. In those environments, formal review is not just a process preference but a compliance mechanism that can significantly reduce legal exposure when something goes wrong.
Andrey also shares a personal practice: because he frequently switches between projects and loses context, the first thing he does is document every verification step as part of the CI/CD pipeline. That way, when he returns to a project months later, the pipeline already encodes everything he would need to remember. "There is no guarantee that someone else has a better understanding of what I did."
Observability gaps in GitOps pipelinesAndrey identifies a practical developer-experience problem with GitOps: the visibility gap.
In a traditional pipeline, a developer can trace a change end-to-end — build, test, deploy — in one place. With GitOps, the CI pipeline ends when it commits changes to a repository. The actual deployment happens later, inside the cluster, through a separate reconciliation process. "My pipeline stops at the place where I do commit, push, done. Since then, pipeline doesn't have much to absorb."
To understand whether a deployment succeeded, the developer needs to inspect cluster state rather than the original pipeline. Bridging that gap requires additional tooling and represents a real paradigm shift in how teams observe deployments.
He also flags a repository-structure problem: if source code and deployment manifests live in the same repository, updating manifests can trigger the source-code pipeline again — requiring conditional logic to prevent unnecessary rebuilds.
Deployment ordering and full-system validationJulien closes the discussion with a practical concern: deployment order matters in real systems. A proxy may need a backend to exist first. Some components cannot be rolled out in arbitrary order without causing failures.
He also questions the validation model. In a software build pipeline, teams rebuild and test the entire application from the main branch to verify the whole system works. But with GitOps, a change to one part of the cluster may be applied incrementally without validating the full cluster state end-to-end. "I will never test the full master branch and rebuild the full cluster from it, except everything goes."
That leaves an open question the hosts do not fully resolve: how can teams preserve the elegance of declarative Git-driven deployment while managing sequencing, dependencies, and whole-system confidence?
Highlights "Unless you want someone mining bitcoin on your cluster"Andrey explains the security motivation behind the pull-based GitOps model — if you use an external CI system, you need to expose your Kubernetes API, which is not exactly ideal. His colorful warning about cryptocurrency miners makes the point memorable.
Listen to the episode for Andrey's full breakdown of why the pull-vs-push distinction is the real heart of GitOps.
"Life is too short to do pull requests."The spiciest take of the episode. Andrey argues that pull requests slow teams to a crawl — you open one, ask for review, then spend two days hunting the reviewer. His alternative: push to master and build pipelines strong enough to protect against mistakes. He does carve out an exception for regulated industries where peer review is legally required.
Listen to the episode and decide whether you agree or strongly disagree.
"GitOps is a nice way to set up your Kubernetes cluster — but is it a good tool to keep it running? I'm not sure."Julien draws a sharp line between bootstrapping a cluster and operating it long-term. Setting up things and running things for a long time are "not the same." It is a refreshingly honest admission that a clean architecture pattern does not automatically solve the messy reality of day-2 operations.
Listen to the episode for a take that many GitOps advocates skip over.
"You build your system to defend yourself from the fools."Andrey's philosophy in one sentence. Rather than relying on human review processes, invest in CI/CD pipelines and automated guardrails that prevent mistakes regardless of who pushes the change. He backs this up with a personal habit: encoding every verification step into the pipeline so future-him does not have to remember anything.
Listen to the episode for a practical argument in favor of automation over process.
"If anything else is running here — alert me or kill it."Julien describes the appeal of GitOps as an authoritative inventory of what should exist in a cluster. If the Git repository defines the desired state and the cluster enforces it, anything unauthorized can be flagged or removed. It is one of the clearest expressions of why teams are drawn to the GitOps model.
Listen to the episode for a practical view of GitOps as cluster hygiene.
The daughter interruptionMid-argument about observability gaps, Andrey's daughter walks in wanting to share something exciting. It is a charming reminder that even deep infrastructure debates happen in real life with real interruptions.
Listen to the episode for the unscripted moment — and Andrey's smooth recovery.
ResourcesGitops a new concept on devops. Whats is it and how can you use it when deploy and setup your k8s cluster.
SummaryGitOps sounds simple — put Kubernetes manifests in Git and let the cluster pull changes — but the episode quickly reveals the real debate is not about Git at all. Andrey argues the only genuinely novel thing about GitOps is the pull-based model where an in-cluster agent reconciles state, while Julien questions whether GitOps is good for day-2 operations or just for bootstrapping clusters. The spiciest moment: Andrey declares "life is too short to do pull requests" and advocates pushing straight to master with strong CI/CD guardrails instead.
Key Topics What GitOps actually is — and what it is notAndrey frames the discussion by separating what is genuinely new about GitOps from what teams have already been doing for years. Storing deployment specifications in Git, he argues, is just version control — teams have done that for a decade. The meaningful difference is the deployment model: instead of an external CI/CD server pushing changes into Kubernetes by calling the cluster API, GitOps places an agent inside the cluster that either receives a webhook or polls a Git repository, pulls in the desired state, and applies it from within.
That pull-based model is what Andrey identifies as the core innovation. It eliminates the need to expose the Kubernetes API externally — a real concern when using hosted CI services like CircleCI, which would otherwise need network access to the cluster. As Andrey puts it, exposing the API externally is risky "unless you want someone mining bitcoin on your cluster."
He references the tooling landscape at the time: Weaveworks (the company that coined the term "GitOps" and created WeaveNet, a Kubernetes CNI driver), Flux, Argo, and Jenkins X. He notes that Flux and Argo were joining forces at the time of recording. He also mentions Jenkins X as a potential GitOps tool, since it runs CI/CD jobs natively in Kubernetes, but expresses skepticism about using Kubernetes for build workloads — Kubernetes is declarative about desired state, but "you cannot declare my build is successful because you have no idea how your build gonna go."
Editor's note: Weaveworks, the company that originated the term "GitOps," shut down in February 2024. Flux continues as a CNCF graduated project. The GitOps principles have since been formalized by the OpenGitOps project under the CNCF.
The Weaveworks definition, read straight from the sourceAndrey reads Weaveworks' concise GitOps definition from their blog and walks through its key points:
Andrey also raises a nuance about Helm: since Helm templates can produce different output depending on input variables, true GitOps implies committing not only the Helm charts but also the rendered manifests — because the generated output is what actually represents the declarative desired state.
He draws a comparison to GitHub's earlier promotion of ChatOps, noting that many of the same ideas — observable, verifiable changes driven through a central workflow — were already part of GitHub's operational philosophy, just with a different interface.
Two layers: infrastructure-as-code and in-cluster GitOpsJulien offers a more practical framing, splitting the problem into two distinct layers:
In Julien's model, a Git repository becomes the authoritative inventory of everything that should exist in the cluster. He describes the ideal: "if anything else is running here, alert me or kill it." That gives teams confidence that the observed cluster state matches the intended one, and helps prevent configuration drift — a problem the hosts discussed in their earlier infrastructure-as-code episode.
Day-2 operations: where the model gets testedWhile Julien appreciates GitOps for defining and bootstrapping cluster state, he is openly skeptical about its effectiveness for long-running operations. He distinguishes between two very different challenges: "setting up things" versus "running things for a long time — they're not the same."
Real environments drift. People intervene manually during incidents. Urgent fixes happen outside the normal workflow. The clean desired-state model becomes harder to maintain once the messiness of day-2 operations enters the picture. Julien frames this as an open question rather than a settled answer: GitOps may be excellent for establishing a clean baseline, but whether it holds up as a complete long-term operating model remains to be proven.
Who controls changes: developers, operators, or both?Andrey raises a governance concern: GitOps can look like a direct developer-to-cluster pathway. If a developer changes a YAML file, commits it, and the cluster automatically applies the change, operations staff are effectively bypassed — "there is nowhere an operation person can interfere with this."
Julien pushes back, arguing that the workflow — not the tooling — determines who has control. If changes go through pull requests with review and approval, it does not matter whether the author is a developer or an operator. Both participate in the same process. The mechanism is the same one used for application code: propose a change, review it, merge it.
Pull requests, compliance, and "push to master"The conversation takes its most opinionated turn when the topic shifts to pull requests.
Andrey is blunt: "Life is too short to do pull requests. You never get anything done. You do a pull request, you ask for review and then you hunt the person for two days." His preference is to push directly to master and build CI/CD pipelines strong enough to catch mistakes — "you build your system to defend yourself from the fools."
He does acknowledge an important exception: regulated industries where every production deployment must be peer-reviewed or approved. In those environments, formal review is not just a process preference but a compliance mechanism that can significantly reduce legal exposure when something goes wrong.
Andrey also shares a personal practice: because he frequently switches between projects and loses context, the first thing he does is document every verification step as part of the CI/CD pipeline. That way, when he returns to a project months later, the pipeline already encodes everything he would need to remember. "There is no guarantee that someone else has a better understanding of what I did."
Observability gaps in GitOps pipelinesAndrey identifies a practical developer-experience problem with GitOps: the visibility gap.
In a traditional pipeline, a developer can trace a change end-to-end — build, test, deploy — in one place. With GitOps, the CI pipeline ends when it commits changes to a repository. The actual deployment happens later, inside the cluster, through a separate reconciliation process. "My pipeline stops at the place where I do commit, push, done. Since then, pipeline doesn't have much to absorb."
To understand whether a deployment succeeded, the developer needs to inspect cluster state rather than the original pipeline. Bridging that gap requires additional tooling and represents a real paradigm shift in how teams observe deployments.
He also flags a repository-structure problem: if source code and deployment manifests live in the same repository, updating manifests can trigger the source-code pipeline again — requiring conditional logic to prevent unnecessary rebuilds.
Deployment ordering and full-system validationJulien closes the discussion with a practical concern: deployment order matters in real systems. A proxy may need a backend to exist first. Some components cannot be rolled out in arbitrary order without causing failures.
He also questions the validation model. In a software build pipeline, teams rebuild and test the entire application from the main branch to verify the whole system works. But with GitOps, a change to one part of the cluster may be applied incrementally without validating the full cluster state end-to-end. "I will never test the full master branch and rebuild the full cluster from it, except everything goes."
That leaves an open question the hosts do not fully resolve: how can teams preserve the elegance of declarative Git-driven deployment while managing sequencing, dependencies, and whole-system confidence?
Highlights "Unless you want someone mining bitcoin on your cluster"Andrey explains the security motivation behind the pull-based GitOps model — if you use an external CI system, you need to expose your Kubernetes API, which is not exactly ideal. His colorful warning about cryptocurrency miners makes the point memorable.
Listen to the episode for Andrey's full breakdown of why the pull-vs-push distinction is the real heart of GitOps.
"Life is too short to do pull requests."The spiciest take of the episode. Andrey argues that pull requests slow teams to a crawl — you open one, ask for review, then spend two days hunting the reviewer. His alternative: push to master and build pipelines strong enough to protect against mistakes. He does carve out an exception for regulated industries where peer review is legally required.
Listen to the episode and decide whether you agree or strongly disagree.
"GitOps is a nice way to set up your Kubernetes cluster — but is it a good tool to keep it running? I'm not sure."Julien draws a sharp line between bootstrapping a cluster and operating it long-term. Setting up things and running things for a long time are "not the same." It is a refreshingly honest admission that a clean architecture pattern does not automatically solve the messy reality of day-2 operations.
Listen to the episode for a take that many GitOps advocates skip over.
"You build your system to defend yourself from the fools."Andrey's philosophy in one sentence. Rather than relying on human review processes, invest in CI/CD pipelines and automated guardrails that prevent mistakes regardless of who pushes the change. He backs this up with a personal habit: encoding every verification step into the pipeline so future-him does not have to remember anything.
Listen to the episode for a practical argument in favor of automation over process.
"If anything else is running here — alert me or kill it."Julien describes the appeal of GitOps as an authoritative inventory of what should exist in a cluster. If the Git repository defines the desired state and the cluster enforces it, anything unauthorized can be flagged or removed. It is one of the clearest expressions of why teams are drawn to the GitOps model.
Listen to the episode for a practical view of GitOps as cluster hygiene.
The daughter interruptionMid-argument about observability gaps, Andrey's daughter walks in wanting to share something exciting. It is a charming reminder that even deep infrastructure debates happen in real life with real interruptions.
Listen to the episode for the unscripted moment — and Andrey's smooth recovery.
Resources
Are infra as code always the best way to go and if not when and where should you use it. Here we are trying to better understand when its god to use and when its not.
SummaryIn this inaugural episode, Mattias, Andrey, and Julian discuss what infrastructure as code really means, why teams adopt it, and where it can go wrong. They explore the evolution from manual server management to declarative infrastructure, the differences between configuration management and infrastructure provisioning, the growing complexity of tools like Terraform and CloudFormation, and why culture, process, and operational discipline matter as much as the tooling itself.
Key Topics What Infrastructure as Code Actually SolvesThe discussion starts with Mattias describing the shift from manually editing Apache configs over SSH to defining cloud environments in code. He recalls the progression: first managing individual servers by hand, then adopting configuration management tools like Puppet, Chef, and Ansible, and finally arriving at cloud-native tools like AWS CloudFormation that can provision entire environments declaratively.
Andrey pushes the conversation toward first principles, arguing that it is important to separate the "what" from the "how." He explains that infrastructure as code depends on having APIs — software-defined interfaces that allow infrastructure to be created and managed programmatically. Without that kind of interface, teams are limited to SSH and the manual tools they had before. The rise of public cloud providers and platforms like OpenStack finally gave teams the APIs they needed to describe infrastructure declaratively in definition files.
Configuration Management vs Infrastructure as CodeA key distinction in the episode is the difference between server configuration tools and true infrastructure as code. Andrey notes that tools like Puppet, Chef, and Ansible were originally conceived as server configuration management tools — designed to automate the provisioning and configuration of servers, not to define infrastructure itself.
He acknowledges this is a gray area, since tools like Ansible can now call AWS APIs and manage infrastructure directly. But historically, the configuration management era was about fighting configuration drift on existing servers, while the cloud era introduced the ability to declare entire environments as code. If you asked the vendors selling Chef, they would tell you Chef is "all about infrastructure as code" — but the original intent was different.
When to Automate — and When Not ToThe hosts caution against automating too early. Andrey says he tends not to automate things until they genuinely need automation. If creating one cluster with a few nodes and one database is all you need, full automation may be premature. But if you know you will eventually manage hundreds or thousands, starting early makes sense.
Julian reinforces this point with a memorable gym analogy: "You go to the gym, you see Arnold Schwarzenegger lifting 200 kilos from the ground and you say, he does it, I can do it. And then you pick up the little weight and find out that if you start with 200 kilos, you're gonna break your back." His point is that infrastructure as code tools get you up and running fast — that is what they are designed for — but day two operations always come knocking. The automation itself can become a burden if you are not careful about what you automate and when.
Infrastructure as Documentation and Source of TruthMattias describes one of his main reasons for using infrastructure as code: knowing what is actually running. He sees the codebase as documentation and as proof of the intended state of the environment — a way to verify that what he thinks is deployed matches what is actually in the cloud.
The hosts agree with that idea, but they also point out the tension between declared state and reality. If people still make manual changes in the cloud console, the code drifts away from what is actually running. Andrey notes the problem: if undocumented manual changes are not reflected back into code, the next infrastructure deployment could recreate the original broken state — "you're back to the fire state, basically."
The Terraform Complexity ProblemJulian brings up Terraform as "the elephant in the room" and argues that it has become significantly more complex over time. He says the language started out as purely descriptive, but newer features in HCL2 — such as for loops, conditionals, and sequencing logic — have pushed it closer to a general-purpose programming language.
His concern is that this makes infrastructure definitions harder to read and reason about. Instead of simply describing desired state, users now have to mentally execute the code to understand what it will produce. Andrey agrees there is a legitimate need for this evolution — once a declarative setup grows large enough, you genuinely want loops and conditionals — but acknowledges it creates a tension between readability and expressiveness.
Declarative vs Imperative ApproachesThe episode explores the difference between declarative and imperative models. Andrey explains that shell scripts are imperative — you tell the system exactly what to do, step by step — while a declarative tool lets a team state the desired outcome and rely on the platform to converge on that state.
Kubernetes is presented as a strong example of the declarative model. You submit manifests that declare what you want, and operators work to make reality match that intent — not necessarily immediately, but as soon as all requirements are fulfilled. Andrey suggests infrastructure tooling may evolve in this direction, with systems that continuously enforce declared state rather than only applying changes on demand. He gives a security example: an intruder stops AWS CloudTrail, but a reactive system — like a Kubernetes operator — detects the deviation and turns it back on automatically.
Julian adds that this is already happening. He mentions that a Kubernetes operator exists to bridge the gap to cloud APIs, allowing teams to define infrastructure resources inside Kubernetes YAML manifests and have the operator create them in the cloud. Google Cloud's Config Connector is a concrete example of this pattern, letting teams manage GCP resources as native Kubernetes objects.
Immutable Infrastructure and Emergency ChangesAndrey strongly advocates for immutable infrastructure: baking golden images using tools like Packer, deploying them as-is, and replacing systems rather than patching them in place. In that model, people should not be logging into systems or making changes manually. If you need a change, you burn a new image and roll it out. SSH should not even be enabled in a proper cloud setup.
Mattias raises a practical challenge: in real incidents, people with admin access to the cloud console often need to click a button to resolve the problem quickly. He describes his own experience — the team started with read-only production access but had to grant write access once on-call responsibilities kicked in. Andrey agrees that teams should not be dogmatic when production is on fire: "You go and do whatever it takes to put fire down." But those emergency fixes must be reflected back into code, and the team must know exactly what was changed. Otherwise, the next deployment may recreate the original problem.
Culture and Process Matter More Than ToolsOne of the clearest themes in the conversation is that infrastructure as code is not just a tooling choice. Julian argues that it does not matter what technology you use if your process and culture are not aligned with security and best practices: "You can fix the technology only so much, but it's mainly about people."
Mattias describes a setup where Jenkins applies all CloudFormation changes, and every modification to the cloud goes through pull requests, code review, and change management — the same workflow used for application code. This means infrastructure changes become auditable, reviewable, and easier to track. Andrey sees this as applying development principles to infrastructure: version history, visibility into who changed what, the ability to ask someone why they made a change, and code review before changes are applied.
Guardrails for Manual ChangesAndrey shares a practical example from a previous engagement where developers had near-admin access to the AWS console and would create EC2 instances, S3 buckets, and other resources outside of Terraform or CloudFormation. To control cost and reduce unmanaged resources, the team built a system using specific tags generated by a Terraform module.
A Lambda function ran every night, scanned for resources without the required tags, posted a Slack notification saying "I found these, gonna delete them next day," and tagged them for deletion. The following night, anything still tagged for deletion was removed. This gave developers flexibility for experimentation — they could spin up resources manually and try things out — while preventing forgotten resources from becoming permanent, invisible infrastructure. It also helped keep costs under control.
Tooling Is Only the StartJulian stresses that adopting infrastructure as code does not automatically make systems reliable, immutable, or resilient. In his view, it is "just the beginning of the journey." He warns against the myth that infrastructure as code equals immutable infrastructure — you can absolutely build stateful, mutable systems with code if you choose to.
He also pushes back on the assumption that automation always saves time, admitting with self-awareness: "I automated a task, it took me two days to automate it, and I saved barely 10 seconds of my life." His advice is to measure the actual benefit rather than being seduced by the marketing brochure. Data will tell you more about a tool's real value than excitement will.
Abstraction, Code Generation, and Developer ExperienceThe hosts discuss the challenge of making infrastructure easy for developers who just want a database and a connection string, not a deep understanding of DBA work and security configuration. Andrey argues that abstracting best practices away from developers saves enormous organizational time, since developer time is expensive and holds back feature delivery.
He describes a third approach beyond declarative and imperative: code generators. Large companies with resources sometimes build internal generators that take simplified YAML inputs and output fully declarative specs. This creates another level of abstraction on top of existing tools, allowing developers to be productive without needing to understand infrastructure details. It is controversial — in some ways it takes power away from people — but it can dramatically simplify the developer experience.
Pulumi vs Terraform and Community SupportAndrey introduces Pulumi as an interesting new branch of infrastructure tooling that lets teams describe infrastructure in general-purpose languages like TypeScript, Python, or Go instead of domain-specific languages like HCL. He notes that while it feels familiar to developers — you stay in your comfort zone — you still need to learn a new DSL embedded in that language. It is "not entirely like you just described infrastructure in the language you know."
Julian says he tried Pulumi and found it appealing for developers who want consistency across their codebase. But he remains cautious, arguing that "code is a liability" — referencing Kelsey Hightower's satirical GitHub project nocode ("write nothing, deploy nowhere, run securely") to make the point that less code means fewer problems. For beginners, Julian recommends starting with Terraform or the native tooling from cloud providers, mainly because the community is larger, tutorials are more abundant, and there are meetup groups where people can learn from each other. His advice is pragmatic: "Just make sure that you ditch Terraform the minute it gets in your way."
Start With the Problem, Not the ToolAndrey repeatedly returns to the same question: what problem is being solved? He argues that teams should choose tools based on business needs and existing team capabilities, not because a tool is fashionable. "A lot of people and developers, they like shiny tools — and there's nothing wrong about that — but you always have to ask, what is the problem we are solving?"
Andrey's framing connects tool selection to team dynamics: if your team already has knowledge of a particular tool, relearning a new one just because it is trendy does not make sense. What you need is not a fancy tool but to deliver business value with the capabilities you have.
Migration, Legacy, and Incremental AdoptionThe hosts acknowledge that many teams are not starting from a clean slate. Andrey points out that legacy infrastructure exists for a reason: it helped the business survive and grow. As Mattias puts it bluntly: "Legacy pays the bills."
For organizations with years of manually built systems and hybrid environments, Andrey suggests doing value stream mapping to identify the biggest pain point and tackling that first. A greenfield project can serve as a success story to demonstrate the approach before trying to transform everything. He emphasizes that coming into an organization with a shiny idea and telling people "whatever you did before was crap" is a sure way to lose allies. Technology boils down to working with people — the tools are fine, but they do not replace the people running them.
The Templating DilemmaMattias raises a specific frustration with infrastructure as code: templating. He likes looking at his Git repository and seeing exactly what is running, but heavy use of variables and templates means he sees placeholder names instead of actual values. This tension — between reusable, DRY templates and readable, concrete definitions — is a real challenge that the hosts acknowledge without a clean resolution.
Resilience and RecoveryNear the end of the episode, Andrey gives a concrete example of losing a Kubernetes cluster in production. Because the environment had been defined as code, the team was able to recreate it and recover in about one to two hours. Some things that were not properly documented slowed them down; with complete documentation the recovery could have been as fast as 15 to 20 minutes — mostly just waiting for AWS to provision the resources after the API calls.
Julian adds context to this: he argues that even with infrastructure as code, recreating a Kubernetes cluster and failing over traffic while maintaining service is genuinely hard. The concept is sound, but having the safety net to actually do it takes time, practice, and a lot of work. His advice for building confidence is to adopt the mentality of immutable infrastructure and get into the habit of regularly recreating things and practicing failovers.
Final AdviceAndrey recommends education first. He specifically mentions the book Infrastructure as Code by Kief Morris (published by O'Reilly, now in its third edition) as a strong foundation. His broader advice: understand the domain, define the problem clearly, ask yourself what outcome you want to deliver for the business, and let the answers to those questions guide your tool decisions.
Julian's closing thought is that in a large organization, a dedicated infrastructure team using infrastructure as code can manage everything — on-prem or cloud — with a single workflow. That team can abstract complexity so developers do not need to learn Terraform, CloudFormation, or any other tool. The specialization pays off by reducing onboarding friction and letting each team focus on what they do best.
HighlightsAre infra as code always the best way to go and if not when and where should you use it. Here we are trying to better understand when its god to use and when its not.
SummaryIn this inaugural episode, Mattias, Andrey, and Julian discuss what infrastructure as code really means, why teams adopt it, and where it can go wrong. They explore the evolution from manual server management to declarative infrastructure, the differences between configuration management and infrastructure provisioning, the growing complexity of tools like Terraform and CloudFormation, and why culture, process, and operational discipline matter as much as the tooling itself.
Key Topics What Infrastructure as Code Actually SolvesThe discussion starts with Mattias describing the shift from manually editing Apache configs over SSH to defining cloud environments in code. He recalls the progression: first managing individual servers by hand, then adopting configuration management tools like Puppet, Chef, and Ansible, and finally arriving at cloud-native tools like AWS CloudFormation that can provision entire environments declaratively.
Andrey pushes the conversation toward first principles, arguing that it is important to separate the "what" from the "how." He explains that infrastructure as code depends on having APIs — software-defined interfaces that allow infrastructure to be created and managed programmatically. Without that kind of interface, teams are limited to SSH and the manual tools they had before. The rise of public cloud providers and platforms like OpenStack finally gave teams the APIs they needed to describe infrastructure declaratively in definition files.
Configuration Management vs Infrastructure as CodeA key distinction in the episode is the difference between server configuration tools and true infrastructure as code. Andrey notes that tools like Puppet, Chef, and Ansible were originally conceived as server configuration management tools — designed to automate the provisioning and configuration of servers, not to define infrastructure itself.
He acknowledges this is a gray area, since tools like Ansible can now call AWS APIs and manage infrastructure directly. But historically, the configuration management era was about fighting configuration drift on existing servers, while the cloud era introduced the ability to declare entire environments as code. If you asked the vendors selling Chef, they would tell you Chef is "all about infrastructure as code" — but the original intent was different.
When to Automate — and When Not ToThe hosts caution against automating too early. Andrey says he tends not to automate things until they genuinely need automation. If creating one cluster with a few nodes and one database is all you need, full automation may be premature. But if you know you will eventually manage hundreds or thousands, starting early makes sense.
Julian reinforces this point with a memorable gym analogy: "You go to the gym, you see Arnold Schwarzenegger lifting 200 kilos from the ground and you say, he does it, I can do it. And then you pick up the little weight and find out that if you start with 200 kilos, you're gonna break your back." His point is that infrastructure as code tools get you up and running fast — that is what they are designed for — but day two operations always come knocking. The automation itself can become a burden if you are not careful about what you automate and when.
Infrastructure as Documentation and Source of TruthMattias describes one of his main reasons for using infrastructure as code: knowing what is actually running. He sees the codebase as documentation and as proof of the intended state of the environment — a way to verify that what he thinks is deployed matches what is actually in the cloud.
The hosts agree with that idea, but they also point out the tension between declared state and reality. If people still make manual changes in the cloud console, the code drifts away from what is actually running. Andrey notes the problem: if undocumented manual changes are not reflected back into code, the next infrastructure deployment could recreate the original broken state — "you're back to the fire state, basically."
The Terraform Complexity ProblemJulian brings up Terraform as "the elephant in the room" and argues that it has become significantly more complex over time. He says the language started out as purely descriptive, but newer features in HCL2 — such as for loops, conditionals, and sequencing logic — have pushed it closer to a general-purpose programming language.
His concern is that this makes infrastructure definitions harder to read and reason about. Instead of simply describing desired state, users now have to mentally execute the code to understand what it will produce. Andrey agrees there is a legitimate need for this evolution — once a declarative setup grows large enough, you genuinely want loops and conditionals — but acknowledges it creates a tension between readability and expressiveness.
Declarative vs Imperative ApproachesThe episode explores the difference between declarative and imperative models. Andrey explains that shell scripts are imperative — you tell the system exactly what to do, step by step — while a declarative tool lets a team state the desired outcome and rely on the platform to converge on that state.
Kubernetes is presented as a strong example of the declarative model. You submit manifests that declare what you want, and operators work to make reality match that intent — not necessarily immediately, but as soon as all requirements are fulfilled. Andrey suggests infrastructure tooling may evolve in this direction, with systems that continuously enforce declared state rather than only applying changes on demand. He gives a security example: an intruder stops AWS CloudTrail, but a reactive system — like a Kubernetes operator — detects the deviation and turns it back on automatically.
Julian adds that this is already happening. He mentions that a Kubernetes operator exists to bridge the gap to cloud APIs, allowing teams to define infrastructure resources inside Kubernetes YAML manifests and have the operator create them in the cloud. Google Cloud's Config Connector is a concrete example of this pattern, letting teams manage GCP resources as native Kubernetes objects.
Immutable Infrastructure and Emergency ChangesAndrey strongly advocates for immutable infrastructure: baking golden images using tools like Packer, deploying them as-is, and replacing systems rather than patching them in place. In that model, people should not be logging into systems or making changes manually. If you need a change, you burn a new image and roll it out. SSH should not even be enabled in a proper cloud setup.
Mattias raises a practical challenge: in real incidents, people with admin access to the cloud console often need to click a button to resolve the problem quickly. He describes his own experience — the team started with read-only production access but had to grant write access once on-call responsibilities kicked in. Andrey agrees that teams should not be dogmatic when production is on fire: "You go and do whatever it takes to put fire down." But those emergency fixes must be reflected back into code, and the team must know exactly what was changed. Otherwise, the next deployment may recreate the original problem.
Culture and Process Matter More Than ToolsOne of the clearest themes in the conversation is that infrastructure as code is not just a tooling choice. Julian argues that it does not matter what technology you use if your process and culture are not aligned with security and best practices: "You can fix the technology only so much, but it's mainly about people."
Mattias describes a setup where Jenkins applies all CloudFormation changes, and every modification to the cloud goes through pull requests, code review, and change management — the same workflow used for application code. This means infrastructure changes become auditable, reviewable, and easier to track. Andrey sees this as applying development principles to infrastructure: version history, visibility into who changed what, the ability to ask someone why they made a change, and code review before changes are applied.
Guardrails for Manual ChangesAndrey shares a practical example from a previous engagement where developers had near-admin access to the AWS console and would create EC2 instances, S3 buckets, and other resources outside of Terraform or CloudFormation. To control cost and reduce unmanaged resources, the team built a system using specific tags generated by a Terraform module.
A Lambda function ran every night, scanned for resources without the required tags, posted a Slack notification saying "I found these, gonna delete them next day," and tagged them for deletion. The following night, anything still tagged for deletion was removed. This gave developers flexibility for experimentation — they could spin up resources manually and try things out — while preventing forgotten resources from becoming permanent, invisible infrastructure. It also helped keep costs under control.
Tooling Is Only the StartJulian stresses that adopting infrastructure as code does not automatically make systems reliable, immutable, or resilient. In his view, it is "just the beginning of the journey." He warns against the myth that infrastructure as code equals immutable infrastructure — you can absolutely build stateful, mutable systems with code if you choose to.
He also pushes back on the assumption that automation always saves time, admitting with self-awareness: "I automated a task, it took me two days to automate it, and I saved barely 10 seconds of my life." His advice is to measure the actual benefit rather than being seduced by the marketing brochure. Data will tell you more about a tool's real value than excitement will.
Abstraction, Code Generation, and Developer ExperienceThe hosts discuss the challenge of making infrastructure easy for developers who just want a database and a connection string, not a deep understanding of DBA work and security configuration. Andrey argues that abstracting best practices away from developers saves enormous organizational time, since developer time is expensive and holds back feature delivery.
He describes a third approach beyond declarative and imperative: code generators. Large companies with resources sometimes build internal generators that take simplified YAML inputs and output fully declarative specs. This creates another level of abstraction on top of existing tools, allowing developers to be productive without needing to understand infrastructure details. It is controversial — in some ways it takes power away from people — but it can dramatically simplify the developer experience.
Pulumi vs Terraform and Community SupportAndrey introduces Pulumi as an interesting new branch of infrastructure tooling that lets teams describe infrastructure in general-purpose languages like TypeScript, Python, or Go instead of domain-specific languages like HCL. He notes that while it feels familiar to developers — you stay in your comfort zone — you still need to learn a new DSL embedded in that language. It is "not entirely like you just described infrastructure in the language you know."
Julian says he tried Pulumi and found it appealing for developers who want consistency across their codebase. But he remains cautious, arguing that "code is a liability" — referencing Kelsey Hightower's satirical GitHub project nocode ("write nothing, deploy nowhere, run securely") to make the point that less code means fewer problems. For beginners, Julian recommends starting with Terraform or the native tooling from cloud providers, mainly because the community is larger, tutorials are more abundant, and there are meetup groups where people can learn from each other. His advice is pragmatic: "Just make sure that you ditch Terraform the minute it gets in your way."
Start With the Problem, Not the ToolAndrey repeatedly returns to the same question: what problem is being solved? He argues that teams should choose tools based on business needs and existing team capabilities, not because a tool is fashionable. "A lot of people and developers, they like shiny tools — and there's nothing wrong about that — but you always have to ask, what is the problem we are solving?"
Andrey's framing connects tool selection to team dynamics: if your team already has knowledge of a particular tool, relearning a new one just because it is trendy does not make sense. What you need is not a fancy tool but to deliver business value with the capabilities you have.
Migration, Legacy, and Incremental AdoptionThe hosts acknowledge that many teams are not starting from a clean slate. Andrey points out that legacy infrastructure exists for a reason: it helped the business survive and grow. As Mattias puts it bluntly: "Legacy pays the bills."
For organizations with years of manually built systems and hybrid environments, Andrey suggests doing value stream mapping to identify the biggest pain point and tackling that first. A greenfield project can serve as a success story to demonstrate the approach before trying to transform everything. He emphasizes that coming into an organization with a shiny idea and telling people "whatever you did before was crap" is a sure way to lose allies. Technology boils down to working with people — the tools are fine, but they do not replace the people running them.
The Templating DilemmaMattias raises a specific frustration with infrastructure as code: templating. He likes looking at his Git repository and seeing exactly what is running, but heavy use of variables and templates means he sees placeholder names instead of actual values. This tension — between reusable, DRY templates and readable, concrete definitions — is a real challenge that the hosts acknowledge without a clean resolution.
Resilience and RecoveryNear the end of the episode, Andrey gives a concrete example of losing a Kubernetes cluster in production. Because the environment had been defined as code, the team was able to recreate it and recover in about one to two hours. Some things that were not properly documented slowed them down; with complete documentation the recovery could have been as fast as 15 to 20 minutes — mostly just waiting for AWS to provision the resources after the API calls.
Julian adds context to this: he argues that even with infrastructure as code, recreating a Kubernetes cluster and failing over traffic while maintaining service is genuinely hard. The concept is sound, but having the safety net to actually do it takes time, practice, and a lot of work. His advice for building confidence is to adopt the mentality of immutable infrastructure and get into the habit of regularly recreating things and practicing failovers.
Final AdviceAndrey recommends education first. He specifically mentions the book Infrastructure as Code by Kief Morris (published by O'Reilly, now in its third edition) as a strong foundation. His broader advice: understand the domain, define the problem clearly, ask yourself what outcome you want to deliver for the business, and let the answers to those questions guide your tool decisions.
Julian's closing thought is that in a large organization, a dedicated infrastructure team using infrastructure as code can manage everything — on-prem or cloud — with a single workflow. That team can abstract complexity so developers do not need to learn Terraform, CloudFormation, or any other tool. The specialization pays off by reducing onboarding friction and letting each team focus on what they do best.
Highlights