Introduction to Cloud Engineering
What cloud engineers actually do
Cloud engineering isn't one job. It's a set of overlapping roles that share a common workflow. Here are the main ones you'll see in job listings:
Cloud Engineer. Provisions and manages cloud infrastructure. Writes Terraform, configures networking, sets up databases, manages IAM policies. The broadest role and the one this track most closely follows.
Site Reliability Engineer (SRE). Focuses on keeping systems running. Defines SLOs, builds monitoring and alerting, responds to incidents, reduces toil. Heavier on operations than building. If the system is down, the SRE is the one getting paged.
Platform Engineer. Builds the internal tools and infrastructure that other engineers use to deploy their applications. CI/CD pipelines, developer environments, service templates, deployment automation. Makes infrastructure self-service so application teams don't need to file tickets for every change.
DevOps Engineer. Bridges development and operations. Automates build, test, and deployment pipelines. In practice, the role varies enormously — at some companies it means cloud engineering, at others it means CI/CD, at others it means "the person who fixes the Jenkins server."
These roles overlap significantly. A cloud engineer at a small company does all of it. At a larger company, the work is more specialized. Either way, the underlying workflow is the same.
The professional loop
Every cloud engineering project, whether it's a simple static site deployment or a multi-region production system, moves through the same cycle:
1. Requirements and architecture design. What infrastructure is needed? What are the availability, latency, compliance, and cost constraints? A surprising number of infrastructure problems trace back to nobody asking these questions clearly before provisioning started.
2. Infrastructure authoring. Write infrastructure as code. Terraform configurations, networking layouts, IAM policies, service definitions. This is where decisions become concrete — every resource block is a commitment.
3. Plan and review. Run terraform plan and read the diff. What will be created, changed, or destroyed? This step exists separately because the gap between "what I wrote" and "what will happen" is where most cloud engineering mistakes live. A clean plan diff is not the same as a safe change.
4. Provision and configure. Apply the changes. Infrastructure now exists in the real world. Unlike code changes, many infrastructure changes are irreversible — you can't un-delete a production database.
5. Application deployment. Deploy applications onto the infrastructure. Containers, CI/CD pipelines, rollout strategies, health checks. The infrastructure exists; now something needs to run on it.
6. Observability and monitoring. Instrument the system. Metrics, logs, traces, alerts. Without observability, you're operating blind — you won't know something is broken until a user tells you.
7. Operations and incident response. Respond when things break. Triage alerts, diagnose failures, mitigate impact, write postmortems. This is where practitioners spend most of their time — not building new infrastructure, but keeping existing infrastructure running.
8. Cost and capacity management. Track what you're spending and why. Every running resource is a line item. Cloud bills are notoriously surprising — the architecture that seemed affordable at design time generates unexpected charges from data transfer, log ingestion, NAT gateways, and idle resources nobody remembered to shut down.
You'll run this loop in every project in this track. What changes is the complexity: early projects give you a clean architecture and a specified set of services. Later projects give you inherited infrastructure, incomplete documentation, and a client who needs it fixed by Friday.
The building-to-maintaining inflection
The track has a deliberate phase transition. The first half (projects 1-8) is predominantly greenfield — you build new infrastructure from scratch, in clean environments. The second half (projects 9 onward) is predominantly operational — you inherit infrastructure someone else built, diagnose what's drifted, respond to incidents, and maintain systems under real constraints.
This mirrors the profession. Every cloud engineer starts by building something. But the majority of the job is maintaining what already exists. The infrastructure you build in the first half of the track becomes the kind of infrastructure you inherit in the second half: partially documented, slightly drifted, and carrying decisions that made sense at the time but don't anymore.
What you'll work on
Each project is built for a client with a specific infrastructure problem. You'll direct AI to build or fix the infrastructure, interact with the client to clarify requirements, verify the output, and deliver something that works. Here's a sample of what that looks like across the track:
- A cloud migration for a small business outgrowing shared hosting
- A CI/CD pipeline that deploys containers with automated rollback
- An observability stack with dashboards, alerts, and SLO tracking
- Inherited infrastructure with drift, undocumented resources, and no IaC
- A cascading incident requiring triage, mitigation, and postmortem
- A multi-environment setup with IaC governance and cost controls
The projects get harder in specific ways. The architecture gets messier. The documentation disappears. The client stops telling you exactly what they need. You go from clean greenfield builds to inherited systems with drift, manual changes, and resources nobody remembers creating. And throughout, AI is your primary tool — capable and fast, but prone to specific mistakes that you'll learn to catch: security groups that are too permissive, cost estimates that omit variable charges, state operations that could destroy resources, and plans that look clean but hide forces-replacement.
Core tools
These are the tools cloud engineers use daily. You'll set up the core ones in the track setup; the rest are introduced as projects need them.
Terminal. Your command line. Everything runs through it: Terraform, AWS CLI, Docker, Claude Code itself.
Claude Code. Your AI coding agent. You'll direct it to write Terraform configurations, debug failed deployments, analyze infrastructure, and respond to incidents. It's strong at generating infrastructure code, and it makes specific, predictable mistakes with security, cost estimation, and state management that you'll learn to catch.
Git and GitHub. Version control. Every project lives in a repository. Every infrastructure change is tracked.
Terraform. Infrastructure as code. You describe what infrastructure should exist in declarative configuration files, and Terraform creates, modifies, or destroys resources to match. The dominant IaC tool in the industry. You'll use it from project 1.
AWS CLI. Command-line interface for Amazon Web Services. Query resources, check configurations, debug networking, manage access. The track uses AWS as the primary cloud provider — the concepts transfer to any provider, but the specific commands are AWS.
Docker. Packages applications into containers. The standard unit of deployment in modern cloud infrastructure. You'll build container images, run them locally, and deploy them to cloud container services.
You'll install additional tools as the track progresses: monitoring agents, CI/CD platforms, Kubernetes tooling, and others. Each project tells you what's needed.