Introduction
Cluster
- 3 Processes needs to run on every node
- Container runtime
- kubelet
- Interfaces with both container runtime and node
- Responsible for taking a container config and creating a container using the container runtime
- kubeproxy
- Responsible for forwarding requests from services to pods
- Intelligently routes requests to pods and tries to reduce network overhead
- Master Nodes
- 4 processes needs to run on every master node
- API Server
- Takes care of authentication
- Scheduler
- Knows which node to schedule the pod to based on the incoming request
- Scheduler only decides which node to run to but its the kubelet that receives the request from scheduler and creates the container
- Controller Manager
- Detect state changes like crashing pods and tries to recover the cluster state by making a request to the scheduler
- ETCD
- KV store for the storing the cluster state
- The brain of the cluster
Cluster Architecture
Nodes
Controllers
Leases
Container Runtime Interface
Garbage Collection
ETCD
Code DNS
Workloads
Containers
- Containers in the same pod are colocated and co scheduled to run on the same node
- Container Runtime is responsible for managing the execution and lifecyle of containers within the k8s environment
- Supported runtimes: containerd, CRI-O and other implementations of Container Runtime Interface
Images
Runtime Class
RuntimeClass
is used to select the container runtime configuration that is used to run a pod's containers
- Different pods can use different runtime classes. For example, if part of your workload deserves a high level of information security assurance, you might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. You'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.
- RuntimeClass is also used to run different pods with same runtime but with different settings
- Setup
- Configure CRI implementation on the nodes
- Create corresponding RuntimeClass resources
- #to-read Configure CRI implementation on a node
- RuntimeClass assumes all nodes in the cluster support it by default.
- Usage
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
- Scheduling
- By specifying the
scheduling
field for a RuntimeClass, you can set constraints to ensure that Pods running with this RuntimeClass are scheduled to nodes that support it. If scheduling
is not set, this RuntimeClass is assumed to be supported by all nodes.
- The constraints are set using the label selector
runtimeclass.scheduling.nodeSelector
Container Lifecycle
Container Lifecycle Hooks
Ephemeral Containers
- A special type of container that runs temporarily in an existing pod to be used for troubleshooting
kubectl debug pod-name --image alpine --target container-name
- Containers created this way will keep running forever
- This creates a new container within the same pod
- This new container uses the same container/linux namespace as the original container that is being debugged
- EC lack guarantees for resources and execution
- Will not be automatically be restarted
- Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod
- Ephemeral containers are useful for interactive troubleshooting when
kubectl exec
is insufficient because a container has crashed or a container image doesn't include debugging utilities.
- Process Namespace Sharing: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/ #to-read
Sidecar Containers
Pods
- Smallest unit of abstraction
- Abstraction over a container
- Usually 1 application per pod
- Each pod gets its own IP address
- Pod can communicate using IP address
- Pods are ephemeral - Means they can die easily
- New pods gets a new IP address
Pod Lifecycle
Init Containers
Multi Container Pods
Sidecar Injection
Disruption
Pod Quality of Service Classes
Pod Restart Policy
Pod Priorities
Disruptions
- Involuntary Disruptions
- Hardware failure in the node
- VM is deleted by mistake
- Cloud Provider / Hypervisor failure
- A kernel Panic
- Node disappears from the cluster due to cluster network partition
- Eviction of node due to out of resources
- Dealing with Disruptions
- Ensure pods have resource request and limits mentioned
- Replicate the app for HA
- Spread apps across zones
- Pod Disruption Budgets
- A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions
- Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.
Deployments
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
- restartPolicy by default is
Always
pod-template-hash
- Added by default to rs by the deployment
Updating a Deployment
- A deployment's rollout is triggered if and only if the pod template spec is changed
- Scaling the deployment does not trigger a new rollout
Scalability
Deployment Types
- When a deployment is created the following happens
- A rollout is created by the deployment
- A replica set is created by the deployment
- The pods are created by the replica set
- Recreate
- Delete old pods
- Create new one
- Results in downtime
- Rolling
- Delete one old pod and add a new one and so on
- Default
maxUnavailable
and maxSurge
controls how many replicas are rolled over at a point in time
- Rollover
- This is the act of creating a new deployment when an old deployment is in progress
- Deployment strategy can be changed using
strategyType
- A new revision is created whenever a deployment is rolled out
kubectl rollout history deployment name
kubectl rollout undo
rolls back
kubectl rollout status
- Rollback to a specific version
kubectl rollout undo deployment/nginx-deployment --to-revision=2
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
- Change Cause of a deployment is set in the annotation
kubernetes.io/change-cause
- A custom change cause can be set using
kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
- #to-read Proportional Scaling
- Autoscaling
- Use HPA
- If HPA is used, then dont set
.spec.replicas
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
- Pausing & Resuming a rollout
kubectl rollout pause deployment/name
& kubectl rollout resume deployment/name
- Watch the status of a rollout using
kubectl get rs -w
- A paused deployment cannot be rolled back until it is resumed
- Deployment Progress Deadline
- One way you can detect this condition is to specify a deadline parameter in your Deployment spec: (
[.spec.progressDeadlineSeconds](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds)
). .spec.progressDeadlineSeconds
denotes the number of seconds the Deployment controller waits before indicating (in the Deployment status) that the Deployment progress has stalled.
- Pausing a deployment will not result in a progress deadline check
- Deployment Revision History Limit
- You can set
.spec.revisionHistoryLimit
field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background. By default, it is 10.
ReplicaSets
- Maintain a stable set of replica pods running at any given time
- A RS is linked to its pods via the pods'
metadata.ownerReferences
field which specifies what resource the current object is owned by
- It is through this link the RS knows the state of the pods its managing
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
- Any naked pods with the same label selector as managed by a RS will also be picked by RS if their owner reference is empty
- In the ReplicaSet,
.spec.template.metadata.labels
must match spec.selector
, or it will be rejected by the API.
- What happens if a RS is updated? Does it create a new rollout? #question
- Isolating pods from RS
- You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
- Scaling Down a RS
- When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
- Pending (and unschedulable) pods are scaled down first
- If
controller.kubernetes.io/pod-deletion-cost
annotation is set, then the pod with the lower value will come first.
- Pods on nodes with more replicas come before pods on nodes with fewer replicas.
- If the pods' creation times differ, the pod that was created more recently comes before the older pod (the creation times are bucketed on an integer log scale when the
LogarithmicScaleDown
feature gate is enabled)
If all of the above match, then selection is random.
- Pod Deletion Cost
- Using the
[controller.kubernetes.io/pod-deletion-cost](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost)
annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
- Pods with lower deletion cost are deleted first
- Default is 0
- This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
- HPA can also target a RS
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
StatefulSets
DaemonSets
- Runs a pod on each node in the cluster
- Can be used for collecting logs from all nodes
- Automatically calculates the number of replicas of the pod based on the number of nodes in the cluster
- Adds only one pod per node
Jobs
Cron Jobs
Replication Controller
Networking
Network Types
Pod Network
Cluster network
Node Network
Dual Stack
Gateway API
Service
- Has a permanent IP address
- Lifecycle of a pod and service are not connected. Even if the pod fails, the service still will be up
- Load balances the requests and distributes it to pods below it
Service Cluster IP Allocation
Ingress
SSL Termination
Ingress Rules
Ingress Controller
Gateway API
Endpoint Slices
Network Policies
- By default all communication is allowed within all pods
- NP restricts communications
- It configures the CNI application
- Not all plugins support NP
- Define which app the NP applies to by using selector labels
- If empty the policy will be applied to all pods in the namespace
- Ingress rules from incoming and egress for outgoing
policyTypes
captures whether its ingress or egress
podSelector
selects the pods
namespaceSelector
is used to filter namespaces
- How does the manifest look for Allow All / Deny All? #question
DNS
Routing
Topology Aware Routing
Storage
Volumes
- Data stored in containers are gone when a pod restarts
- Volumes are used to store persistent data
- Volumes attaches a physical storage to the pod
- Can be on local machine
- Can be on cloud storage
- The k8s admin is responsible for backing up, restoring the data. K8s does not do this
Persistent Volumes
Persistent Volume Claims
Projected Volumes
Ephemeral Volumes
Storage Classes
Volume Attribute Classes
Volume Provisioning
Static Provisioning
Dynamic Provisioning
Volume Snapshots
Volume Snapshot Classes
Volume Health Monitoring
Configuration
Config Maps
- Contains config data to be used by the pod
- Config Maps are connected to the pod
- Will updating the config map reflect immediately in the App? #question
Secrets
- Only base 64 encoded and not encrypted
- Mount to pods like config maps
Resource Requests & Limits
- Its a good practice to have resource requests and limits for all resources
- Requests are the amount of resources it needs
- Limits are the max amount of resource that the pod should be allowed before it is killed
kubectl get pods -o jsonpath="{range .items[*]}{.metadata.name}{.spec.containers[*].resources}{'\\n'}
to get list of pods and its resources
- If a pod needs to be evicted, those without resources will be evicted first
Autoscaling
Cluster Autoscaling
Pod Autoscaling
HPA
VPA
Pod Disruption Budgets
Priority Classes
Runtime Classes
Admission Controllers
Security
Authorization Modes
RBAC
Roles
- Roles allow to define namespaced permissions
- Roles only define the permissions. They are not associated with users
apiVersion: rbac.authorization.k8s.iov1
kind: Role
metadata:
name: developer
namespace: namespace_name
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "create", "list"]
resourceName: ["myApp"]
Cluster Role
- Defines resources and permissions at a cluster level and not limited to a namespace
Users
- K8S does not manage users natively
- Allows different external auth strategies. Can be
- External Tokens
- Certificates
- 3rd Party IdP
RoleBinding
- Binds a role to an user or an user group
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-role-binding
subjects:
- kind: User/Group/ServiceAccount
name: username
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io
- Checking API Access
kubectl auth can-i create deployments --namespace dev
Service Accounts
- Represents an application
- Application running both internally and externally need access to resources inside a cluster
- A service account can be linked to a role/cluster role using a binding
Other Authorization Modes
- ABAC, Node, Webhook
/etc/kubernetes/manifests/kube-apiserver.yaml
contains information on what authorization modes are enabled using -authorization-mode
Cluster Hardening
Certificates
- When a cluster is bootstrapped with
kubeadm init
the certs for k8s components are generated in /etc/kubernetes/pki
- Client certificated can be found in
/etc/kubernetes/kubelet.conf
which allows nodes to talk to the master
kubeadm
generated a CA for k8s cluster and etcd server so that it can sign the certificates
- All the nodes get a copy of the CA that generated the certificates
Certificates API
- Process of signing a client certificate
- Create Key Pair
- Generate Certificate Signing Request
- Send CSR using K8S Certificates API
- K8S signs the certificate
- A K8S admin approves/denies the request
- After approval, a signed certificate is provided by k8s CA which is handed over to the client
Scheduling
Node Name & Selector
- Pods are automatically scheduled on the worker nodes and scheduler intelligently decides where to place the nodes
- In some case, we might want to decide the node
- Done using
noneName
attribute in the spec
- If the nodes are dynamic, we can use
nodeSelector
- Specify nodes using labels
- First see the labels in a node by
kubectl get node --show-labels
kubectl label node node-name key=value
to label a node
Node Affinity
- More expressive than node name and selector
- Match rules with logical operator
- Done using
affinity.nodeAffinity
attribute
- Can have a bunch of rules with weights associated
- Has 2 types of preferences
- Soft:
preferredDuringSchedulingIgnoredDuringExecution
- Hard:
requiredDuringSchedulingIgnoredDuringExecution
- Supported Operators: In, Exists, Gt, Lt, NotIn, DoesNotExist
nodeAntiAffinity
is the opposite of nodeAffinity
Taints & Tolerations
- Affinity allows us to tell which nodes a pod should be assigned to. Taints allows us to tell the node which pods they should accept
- Master nodes have taints on them that repels regular pods from being scheduled on them
node-tole.kubernetes.io/master:NoSchedule
is the exact taint
kubeadm
sets these taints during installation
- Tolerations are used to worker around a taint
- i.e to force a pod to be scheduled on a node that has a taint defined that is satisfied by the pod
- If we want a pod to run only on the master node
- Add a toleration to master node taint
- Add nodeName selector and set it to master
Inter Pod Affinity
- Configure pods to run only on nodes that has another pod running. Like collecting logs for a particular pod
- Configured using
podAffinity
- Inter pod anti affinity is the opposite of that
- By setting the anti affinity rule to the same label as the pod being scheduled, this will prevent from running more than 1 replica of a pod in the same node
- What is
topologyKey
? #question
Node Selector
Node Affinity
Node Anti Affinity
Pod Affinity
Taints
Tolerations
Assigning Pods to Nodes
Pod Overhead
- When a pod is run on a node, the pod itself takes an amount of system resources. These are additional to the resources needed to run the containers inside the pod.
- Specified in the runtime class configuration of a pod, this allows the k8s to take into account the overhead resources in addition to the actual pod resources.
- Overhead resources are the resources needed to run the actual pod and not the application inside the pod
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
- Pod overhead are considered during the admission time of a pod by the admission controller.
- During admission, the pod spec is mutated to include the overhead of the runtime class
- If the podspec already has the overhead field, the pod will be rejected. #to-try
- Resource Quota section of https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ #to-read
Pod Scheduling Readiness
Objects
Labels & Selectors
Namespaces
Annotations
Field Selectors
Finalizers
Owners & Dependents
Administration
Building a Cluster from scratch
- One master and two worker nodes
- Min requirement for k8s server requirements is min 2GB RAM and 2 CPU per node
- Master nodes usually require less resources than worker nodes
- Basics of TLS Certificates
- A way to establish trust and secure communication
- Symmetric Encryption
- Use a random string to encrypt a piece of message
- Random string is called Encryption Key
- Same key is used to decrypt the data
- Asymmetric Encryption
- Separate encryption and decryption keys also called Key Pair
- An encrypted data with one key can only be decrypted with its key pair
- In this case, the server first generates a key pair and sends a key to the client
- Client uses this key to encrypt a piece of information which can only be decrypted using the other key
- Certificate Authority
- Issues digital certificates that certifies the ownership
- These digital certs have the public key of the website and the subdomains which is signed by CA
- Client Certificates
- Server also can request a certificate from the client to protect from the hacker
- Trusted vs UnTrusted CA
- Browsers ship with public key from the official list of authorized CAs in the world
- Cluster Installation
- Steps
- Deploy a container runtime on every node
- Install kubelet on every node that runs as a normal linux process
- Deploy pods which run master on master node like apiserver, scheduler, controller manager, etcd etc
- Deploy kubeproxy pod on all nodes
- Static Pods
- Just like normal pods but are directly scheduled by kubelet without need of k8s master components. i.e. without control plane
- Kubelet watches a specific location on the node its running
/etc/kubernetes/manifests
- Any pod manifest present in this location is scheduled as a static pod
- kubelet watches static pod and restarts it if it fails. Controller manager does not do this
- Static pods are suffixed with node name they are running on
- Certificates
- Controls how one k8s component/node can communicate with the other in a secure manner
- K8S components uses mTLS
- Communications
- Almost every component will talk to API Server. Which means all components should be able to supply a certificate to the API Server in order to authenticate itself
- Generate self signed CA Certificate for K8S (Cluster root CA)
- This self signed CA Certificate is used to sign all client and server certs used in the cluster
- Certs are stored in
/etc/kubernertes/pki
- API Server will have server certificate, scheduler and controller manager will have client certificates
- Also since API Server talks to etcd & kubelet. So API Server will have its client cert, etcd and kubelet will have server cert
- Kubelet also talks back to API server
- Admin Users also talk to API Servers and hence admins need their own client certificate
- This cert should also be signed by self signed CA
- List of certs to create
- Generate self signed CA Certificate - Cluster root CA
- Sign all client and server certificates
- Server cert for API Server
- Client cert for scheduler and controller manager
- Server cert for etcd and kubelet
- Client cert for API server to talk to kubelets and etcd
- Client cert for kubelet to authenticate to API Server
- Client cert for k8s admins
- Kubeadm
- Bootstraps all the above configs/certs
- Maintained by K8S
- It only bootstraps. Does not provision the cluster
- Container Runtime
- Not a k8s component
- Used to schedule containers
- Every node needs to have a container runtime
- Master node needs this to run control plane components as containers
- Worker nodes needs this to run application workloads
- CRI is the interface that can be implemented by any container runtime so that k8s can use it
- containerd and cri-o are the other popular container runtimes
- AWS/GCP/Azure use containerd as container runtime
- Installing containerd
- Installing kubeadm, kubelet and kubectl
- kubeadm
- Install only on master nodes
kubeadm init
- initialize the control plane. i.e.
- Generate
/etc/kubernetes
folder
- Generates self signed CA
- Generates static pod manifests into
/etc/kubernetes/manifests
(To be detected by kubelet to start staticpods)
- Make all necessary configurations
sudo apt-get install -y kubelet kubeadm kubectl
Upgrading a cluster
- https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime
- 2 Steps
- Upgrading the master nodes
- During upgrade, the control plan components will not be accessible but there wont be application downtime. The worker nodes will still be functioning
- But if there is an issue on the worker nodes, it wont be fixed as no master nodes are running
- Having 2 or more master nodes will help in fixing the above problem
- kube-apiserver, controller-manager, kube-scheduler, kubelet, kube-proxy and kubectl needs to be upgraded
- In addition to above, kubeadm will take care of upgrading etcd and coredns as well
- The network plugin needs to be upgraded separately
- Version restrictions during upgrade:
- Not all components needs to be on same version but there are some rules
- kube-apiserver should be the latest amongst all versions
- controller-manager and kube-scheduler can be 1 version behind
- kube-proxy and kubelet can be 2 versions behind
- kubectl should be the same as API server
- Upgrade kubeadm and then run
kubeadm upgrade apply version
- kubelet and kubectl needs to be upgraded separately as it was installed separately
- Since kubelet is responsible for scheduling pods in a node including the static pods for master components, the node should first be drained before upgrading kubelet
kubectl drain node-name
- Will evict all pods safely and the node will be marked unschedulable
- Draining will cordon the node as well
- After upgrading kubelet and kubectl, make the master node ready by
kubectl uncordon master
to make is schedulable
- Upgrading the worker nodes
- Upgrade kubeadm
- Drain the node
- This means all pods on the node will be scheduled on another node
- Node gets marked as unschedulable
- Change worker node to schedulable
- If there are 2 nodes and if all pods have atleast 2 replicas, then there will be no downtime during this upgrade
- Draining a node
- Done by
kubectl drain node-name
- Will first mark the pods as unschedulable
- Evict all nodes
kubectl cordon node-name
will only mark a node as unschedulable without evicting pods on the node. kubectl uncordon node-name
to mark it schedulable again
- k8s only officially supports past 3 versions
- Upgrade a cluster 1 minor version at a time
kubeadmn upgrade plan
kubeadmn upgrade apply version
Certificate Management
Building a cluster
Cluster Installation
Node Configuration
Container Runtime Interface
kudeadm, kubelet and kubectl
Connecting to a cluster
Network Configuration
Container Communication
Container Network Interface
Network Plugin
Garbage Collection
Troubleshooting
- Check if pod is running =]
kubectl get pod[name]
- Check if service is forwarding the request =]
kubectl get ep
or kubectl describe service[name]
- Check if service is accessible by pinging it from one of the pods in the cluster
nc SERVICE_IP SERVICE_PORT
ping SERVICE_NAME
- Check application logs
- Check pod status and recent events
kubectl describe pod POD_NAME
Debug with Temporary Pods
- Debug with BusyBox
- Has all tools pre installed for debugging
kubectl run debug-pod --image=busybox -it
kubectl exec -it debug-pod -- sh
- Directly run a command without opening a shell:
kubectl exec -it debug-pod -- sh -c "printenv"
- Debug a running pod by logging into it
kubectl exec -it pod-name --container container-name --/bin/sh
- Debug by adding an ephemeral container
kubectl debug -it[pod-name] --image=busybox --target=<container-to-debug]
Kubectl format output
- Get more detailed output format using json and JSONPath (A query language for json)
kubectl get pod -o jsonpath='{range.items[*]}{.metadata.name}{"\\t"}{.stats.podIP}{"\\n}{end}'
- Custom Columns
kubect get pod -o custom-colume=POD_NAME:.metadata.name,POD_IPL.status.podIP
Troubleshoot Kubelet & Kubectl
- Kubelet
- Runs as a linux process. So SSH into the node
service kubelete status
to check the status of the service
journalctl -u kubelet
to see kubelet logs
sudo systemctl restart kubelet
to restart kubelet service
- Kubectl
- Sometimes we can get errors or just hang
- Check kubectl config which is
~/.kube/config
- Check if cluster certificate data and server is correct
kubectl cluster-info dump
provides the complete cluster info
Pod Access Problems
Objects
- Server Side Validation
- The API server offers server side field validation that detects unrecognized or duplicate fields in an object. It provides all the functionality of
kubectl --validate
on the server side.
- The options are
Strict
, Warn
& Ignore
. Strict is the default
- Names & IDs
- Each namespace can have only one resource with an ID
- For non unique user provided attributes, use labels and annotations
- Names must be unique across all API versions of the same resource
Labels & Selectors
Labels
- Key Value pairs attached to objects
- Used for identifying objects
- Can be attached to objects at creation time, added later, modified
- The keys of a label can have 2 parts separated by a
/
- An optional prefix which should be a DNS subdomain
- A name
- Well Known Labels: https://kubernetes.io/docs/reference/labels-annotations-taints/
Selectors
- Equality based selector: (=,== , !=)
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "registry.k8s.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
- Set based selector: (in, notin, exists)
- Just mentioning the keys without values matches if the label just exists
- If there are multiple selectors, all of them are compared using a logical AND
- Using Selectors in API
- List and watch commands can use -l to pass label selectors
kubectl get pods -l environment=production,tier=frontend
kubectl get pods -l 'environment in (production),tier in (frontend)'
- Using selectors in manifests
selector:
matchLabels:
component: redis
matchExpressions:
- { key: tier, operator: In, values: [cache] }
- { key: environment, operator: NotIn, values: [dev] }
Namespaces
- Logical grouping of resources inside a cluster
Miscellaneous
Best Practices
Configuration Best Practices
Production Environment
Resource Profiles
Automated Placements
Observability
Probes
Liveness Probes
- Let k8s know if the app inside the pod is active
- configured using
livenessProbe
periodSeconds
defines the period of repeat
- 3 types
- Simple Exec command =] non zero exit code is considered as error
- tcpSocket =] Success if TCP connection can be opened
- httpGet =] 2xx-3xx is considered as success
Readiness Probes
- Check if pod is ready to receive the request
- Only after this liveness probe will start
Startup Probes
- Verifies if a container is started.
- Runs before any other probe
- Executed only once at startup unlike readiness which are executed periodically
Operators
Maturity Level
Operator Framework
Custom Resource Definition
APIs
API Groups
API Versions
Deprecations
Production
Best Practices
Behavioral Patterns
Batch Jobs
Periodic Jobs
Daemon Service
Singleton Service
Stateful Service
Commands
kubectl options
kubectl cluster-info
kubectl create --help
kubectl auth --help
kubectl auth can-i verb resource --as user
kubectl scale deployment deployment --replicas=X
kubectl rollout status deployment/nginx-deployment
Tools & Services
Crossplane
Kustomize
Helm
Kubeshark
Resources
Tasks
- [ ] Add code snippets wherever possible
- [ ] Go through the k8s docs and follow all links
Deployments
- Do not run native pods
- Offer scalability and reliability of the application
- Provides updates and update strategy
- Automatically restart pods if anything goes wrong with the pods
- If replicas is not set, default value is 1
- Scalability
- kubectl scale deployment[name] —replicas=n
- kubectl edit deployment[name]
- Deployment Updates
- Supports zero downtime application updates
- When an update is applied, a new replica set is created
- After successful update, old RS is deleted (But can be kept if needed)
- updateStrategy defines how to handle updates
- Types
- Rolling
- Replace - Used if app does not support running multiple versions at the same time
Labels, selectors and annotations
- Label
- Key value pair to provide additional info
- Used to connect related resources
- Deployment finds its pods using selectors
- Services finds its endpoint pods using selectors
- Used in network policies to control access
- Syntax: —selector key=value
- app=app_name is an automatically created label
- kubectl get deployments —show-labels displays all labels in the deployments
- Modifying a deployment to add a label does not add the label to the pods managed by the deployment
- Max length of a label is 63 chars
- Annotation
- Provide detailed metadata in an object
- Cannot be used in queries
- Used for licenses, maintainer etc
- kubectl rollout history and kubectl rollout undo
- Every deployment keeps a record of the rollout history
- Question: How to get events/Alerts for new replicas added by HPA?
Networking
Ingress
- Make applications easily accessible by using DNS provided URLs
- Purpose: Expose services to outside
- Works by adding entries to DNS
- Ingress doesnt make sense for Cluster IP
- Works only for NodePort & Load Balancer
- Ingress can do load balancing
- Can Ingress load balance across services running in 2 clusters?
- Ingress needs an Ingress Controller which is not available by default
- Exposes HTTP & HTTPS Routes
- Traffic is controlled by the rules defined on the ingress resources
- Can terminal SSL/TLS
- Offers name based virtual hosting.
- Options for Ingress Controllers
- HA Proxy
- NGinx
- Traefik
- Kong
- Contour
- Ingress Rule
- Optional Host
- If no host is specified, the rule applies to all http traffic
- A list of path and its backend
- Backend consists of service name and service port
- Backend Types
- Simple Fanout ⇒ Traffic routed to multiple backends. This allows to minimize number of load balancers
- Name based virtual hosting ⇒ Traffic incoming on a specific name is router to a specific service
- TLS Ingress → Uses a TLS secret to ensure TLS termination at load balancer
Storage
- File stored in the container will only live as long as the container itself
- Pod Volumes can be used to allocate storage that outlives a container and stays available during pod lifetime
- Pod spec can contain volume
- By default volume will be ephemeral
- This allows containers in a pod to share storage
- Question: How to share storage between containers in the same pod?
- PV ⇒ Persistent Volume
- Allows pods to connect to external storage
- There can be multiple PVs pointing to different external storage solutions
- PVs by themselves are independent
- PVC is required to bind a PV to a pod
- Request for storage
- Specify type and size
- PVC is bound to a pod
- ConfigMaps are specific volume objects that connect to config files and variables
- Secrets do the same, but by encoding the data in it
- Configuring volume
- Decide if the volume is a pod local volume or persistent volume
- Volumes needs to be specified in
spec.volumes
and mounted on the pod using spec.containers.volumeMounts
- Volume Types ⇒ emptyDir, hostPath, azureDisk, awsElasticBlockStore, gcePersistentDisk, cephfs, rbd
- Configuring PV Storage
- A PV is an independent object that connects to external storage
- Use PVC to bind a PV to a pod
- Question: What is a ReclaimPolicy?
- Configuring PVC
- PVC access request to PV according to specific props
- AccessModes
- Availability of Resources
- PVC Bind
- After connecting to a PV, the PVC will show as a bound
- Question: how to make a PVC link to a specific PV when many PVs are defined?
ConfigMaps & Secrets
- Special types of volumes
- Can be used in 3 ways
- To make variables available within a pod
- Provide command line arguments
- Mount on a location where the application expects the config file
- When mounting a config map as a volume, in the mount point, files will be created with the name of the key and the value will be inside the file
- Secrets are base64 encoded config maps
- CM & secrets must be created before the pods that are using them
- Source of CM
- Directories: uses multiple files in a directory
- Files: Puts the content of a file in the CM
- Literal Values: Key value pairs which can be use as variables and CLI args
- Irrespective of the source, the usage from pod remains the same
- ConfigMap as a config file
- In this, the pod is using the CM as a mounted volume which contains the config file
- The CM itself contains the config file
- Secrets
- Store sensitive data
- Create like config maps
- Encoded but not encrypted
- Types
- Decker-registry: used for connecting to a docker registry
- TLS: TLS Secret
- Generic: secret from a local file / directory / literal value
- Question: How to see the contents of auto generated secrets?
- Creation:
- Same as config map.
kubectl create secret
API Usage
Troubleshooting
- Determining a troubleshooting strategy
- When executing an API call
- Kubectl → API Server → Server writes to etcd → kubeschedule → gets data from etcd → talk to kubelet in node → kubelet talks to container runtime
- Kubectl describe is the first line of defence. It applies to any resource type
- Next step is kubectl logs to check container logs
- Pod States
- Pending: Pod has been validated by the API Server and an entry has been created in etcd but some prerequisite conditions have not been met
- Running
- Completed: Pod has completed its work
- Failed: Pod has finished but something has gone wrong. Restart policy is never
- CrashLoopbackoff: Pod has failed and cluster has restarted it.
- Unknown: Pod status could not be obtained
- Kubectl describe
- Have a look at the events
- Look at app state
- Check last state and check exit code
- If non 0 exit code, use kubectl logs
- Analyzing pod access problems
- Check if labels in service and pods are matching
- Use kubectl get endpoints to check services and corresponding pod endpoints
- If ingress is used, check if ingress controller is working properly
- Check network policy - NP can be applied to pods, namespaces, IP ranges to restrict traffic in both directions
Kubectl get netpol -A
to check if any network policy exist
kubectl describe netpol name
- Label comparison are case sensitive
- If a pod doesnt have an endpoint there is something wrong with the service config. Probably labels
- Monitoring Cluster Event Logs
- Gives generic overview of everything
Kubectl get events -o wide
Kubectl describe resource_name
kubectl describe nodes
Observability
- Health Probes
- Startup Probe
- Legacy apps take time to start
- Readiness Probe
- Checks is application is ready to serve requests
- Liveness Probe
- Periodically checks the application’s responsiveness
- Each probes offer 3 ways to verify health
- Custom Command - Needs non zero exit code
- HttpGet - 200 to 399 status code responses
- TCP Socket - Checks if connection can be established
- Each probe has a set of attributes to configure
- InitialDelayS (Default: 0) - Delay in seconds until first check is executed
- PeriodSeconds (Default: 10) - Interval for executing check
- TimeoutSecond (1)
- SuccessThreshold (1) - Number of success check attempts until probe is considered successful after a failure
- FailureThreshold (3) - Number of failures for check before probe is marked as failed and takes action
- Use Kube YAML to validate the manifests
- Common Pod Error Statuses
- ImagePullBackoff - Image could not be pulled from the registry
- ErrImagePull - Image could not be pulled from the registry
- CrashLoopBackOff - Application or command run in container crashes
- CreateContainerConfigError - ConfigMap or Secret referenced by the container cannot be found
- Check for number of restarts in pod statuses
- Ephemeral Containers
- Some containers are very minimal that they wont even provide a shell
- Google Distroless
- K8s.gcr.io/pause:3.1
- Ephemeral containers are meant to be disposable and can be used for troubleshooting
kubectl alpha debug -it minimal-pod —image=busybox
- Troubleshooting Services
- Check if labels are mapped properly
- Get endpoints for a service
kubectl get endpoints service_name
- Open a quick tmp container and query the cluster IP
kubectl run tmp —image=busybox -it —rm — wget -o- 10.99.155.165:80
- Check if port mapping is done correctly between the service and the pod
- Monitoring
- Metrics Server should be installed
kubectl top nodes
kubectl top pod pod_name
Scheduling
- NodeSelector
- Allows to specify a label that a node has in order for a pod to run
- Label must exist already on the node before deployment of the pod
- Pod will only be scheduled when a label is matched
- No conditional logic. Simple label equality is allowed
- NodeAffinity
- Complex conditions than node selector
- 2 Types
- RequiredDuringSchedulingIgnoredDuringException
- The expression defined must be matched during scheduling
- Does not evict the pod after a pod has been scheduled after the expression becomes invalid after pod scheduling
- PreferredDuringSchedulingIgnoredDuringExecution
- A node matching the expression is only a preference not a mandatory requirement
- The pod will still get scheduled if no node matching expression exists
- NodeSelectorTerms
- Specify the expressions. Expressions are evaluated as OR. So if one of the expression is true, then the whole selector is set to true
- Node Affinity Operators
- In: List of values that the value of the label can have
- NotIn: Opposite of In
- Exists: Checks if a label exists
- DoesNotExist: Opposite of Exist
- Gt: Numeric Comparison
- Lt: Numeric Comparison
- NodeAffinityWeight
- Each expression can have a weight
- Question: How to check the expression weights during scheduling?
- PodAffinity
- Specifies where pods can located relative to each other
- Used to colocate pods or separate pods
- LabelSelector are used to match pods
- Same affinity types are node affinity
- TopologyKey
- PodAntiAffinity: Opposite of pod affinity
- Taints
- NodeAffinity attracts pods to a node. Taints repel them
- We can then use toleration to allow particular pods to be scheduled
- Taints are applied to nodes and tolerations are applied to pods
- Taint Effects
- NoSchedule
- PreferNoSchedule
- No Execute
kubectl taint node[node] key=value<effect]
- Tolerations
- Allows a pod to be scheduled despite a taint
- Toleration Operators
- In order for a toleration to work, the taint and the toleration should have the same effect
- Use cases for Taints & Tolerations
- Dedicated Nodes
- Specialized Hardware
- Eviction
- Question: Taints & Tolerations vs Affinity?
- Node Condition bases taints
node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/pid-pressure
node.kubernetes.io/unschedulable
node.kubernetes.io/network-unavailable
Operators
- Bundle, package and manage k8s custom controllers
- Different between an operator vs other methods like helm is that if something changes in a deployment/config etc, there is no reconciliation possible. Operators allow for reconciliation
- Operators does the following: observe, diff, apply
- Handles upgrades of the images
- Auto healing of the components managed by the operator
- https://operatorframework.io
- Go lang based operators - Most feature rich
- Ansible based operators
- Helm based operators
- Capability Model (https://operatorframework.io/operator-capabilities/)
- Level 1 ⇒ Basic Install - Automated application provisioning and configuration management
- Level 2 ⇒ Seamless Upgrades - Patch and minor version, upgrades supported
- Level 3 ⇒ Full Lifecycle - App Lifecycle, storage lifecycle (Backup, failure recovery)
- Level 4 ⇒ Deep Insights - Metrics, alerts, log processing and workload analysis
- Level 5 ⇒ Auto Pilot - Horizontal/Vertical scaling, auto config, tuning, abnormal detection, scheduling tuning
- Popular Operators
- Prometheus
- Elasticsearch
- Istio
- ArgoCD
- MinIO
Security
- Security 101
- Layered Defence/Defence in Depth
- Security should be redundant
- Least Privilege
- Runtime Security
- Host Security
- Network Security
- Threat Detection
- Image Hygiene
- SecOps
- Attacks
- Attacking API Server
- By default every pod gets a service account
- Each service has account has a bunch of roles that allows the pod to talk to API server
- Get token from pod (Like reading a file from the file system)
- Use token to attack cluster API server
- Get secrets to attack further
- Mitigation
- RBAC
- Roles given to users and service accounts
- Each role should have permission to perform some operation
- RBAC settings apply to namespace
- API Server Firewall
- Restrict access to API server to certain IP addresses
master-authorized-networks
- Network Policy
- Restrict access to important services to only the pods that require it
- Specify access via labels
- Network plugins should support it
- Get access to cluster components
- Manipulate cluster components like etcd
- Mitigation
- Encrypt data in etcd
- Use auth and firewalls to restrict access to etcd
- Security Issues
- Images
- Code from untrusted registries
- Vulnerabilities in tools of OS or code libraries
- Bloated base images
- Mitigation
- Use approved lean images
- Create list of trusted registry
- Image Scanning
- Used signed images
Hardening Clusters
Hardening a Kubernetes (K8s) cluster from a security perspective involves a multi-layered approach that encompasses several aspects of the Kubernetes environment. Here’s a comprehensive guide to securing your Kubernetes cluster:
- Cluster Configuration and Management
- Use Secure Access Controls: Implement role-based access control (RBAC) to limit access based on the principle of least privilege.
- Audit Logging: Enable audit logging to keep track of all actions and changes within the cluster. This can help in forensic analysis in case of security incidents.
- Regularly Update and Patch: Keep your Kubernetes components and dependencies up to date to protect against vulnerabilities.
- Network Policies and Security
- Implement Network Policies: Define network policies to control the traffic between pods within a cluster, ensuring only authorized pods can communicate with each other.
- Use a Service Mesh: Implement a service mesh like Istio for enhanced security features, including secure service-to-service communication.
- Encrypt Data in Transit: Ensure that all communications within your cluster are encrypted using TLS.
- Cluster Authentication and Authorization
- Secure API Access: Use Transport Layer Security (TLS) for all API communication and ensure that all clients authenticate before accessing the Kubernetes API.
- Use Strong Authentication Mechanisms: Implement strong authentication mechanisms such as multi-factor authentication for accessing the Kubernetes cluster.
- Limit API Access: Minimize the number of components and users that have access to the Kubernetes API.
- Ecrypt data at rest in etcd
- Pod Security Policies and Practices
- Apply Pod Security Policies (PSP): Use PSPs to control the security specifications pods must adhere to. Note that PSPs are deprecated in Kubernetes 1.21 and replaced by OPA Gatekeeper or Kyverno in newer versions.
- Use Security Contexts: Define security settings for your pods and containers, including permissions and user/group IDs.
- Secrets Management
- Encrypt Secrets at Rest: Ensure that secrets stored in Kubernetes are encrypted at rest using a key management system.
- Manage Secrets Securely: Use external secrets management systems like HashiCorp Vault for better secrets management and rotation policies.
- Container Security
- Use Trusted Base Images: Only use trusted container images and regularly scan images for vulnerabilities.
- Implement Image Signing and Verification: Ensure that only signed images are used in your cluster to prevent tampering.
- Run Containers as Non-Root: Avoid running containers as root unless absolutely necessary to reduce the risk of privilege escalation.
- Monitoring and Alerting
- Implement Monitoring Solutions: Use tools like Prometheus and Grafana for monitoring the health and performance of your cluster.
- Set Up Alerting: Configure alerting mechanisms to notify you of potential security issues or anomalies.
- Disaster Recovery and Backup
- Regular Backups: Regularly back up your cluster’s etcd data to ensure you can recover from data loss or corruption.
- Disaster Recovery Plan: Have a disaster recovery plan in place, including procedures for restoring from backups in case of catastrophic failures.
- Security Tools and Audits
- Use Security Tools: Implement security tools like Aqua Security, Sysdig Secure, or Kube-Bench to continuously scan and audit your cluster for vulnerabilities.
- Conduct Regular Security Audits: Regularly audit cluster’s security posture and compliance with security best practices and standards.
- Education and Awareness
- Security Training for Teams: Ensure that your team is aware of best practices for Kubernetes security and the specific configurations and tools used in your environment.
Commands
kubectl scale
kubectl expose
kubectl autoscale
kubectl create service
kubectl get ednpoints
kubectl describe
minikube addons list
kubectl explain ingress.spec.rules
kubectl exec -it[name] bash
kubectl create cm variables —from-env-file=variables
kubectl create cm special —from-literal=key=value
kubectl create cm config —from-file file.
kubectl create secret generic name —from-file=key=/filepath
kubectl get pods
kubectl describe pod pod_name
kubectl get events
kubectl get pods —show-labels
kubectl label pod pod_name key=value
kubectl label pod pod_name key=value1 —overwrite
kubectl label pod pod_name key-
Questions
- How to dynamically mention the port numbers inside the manifest?
Topics
- Crossplane
- Pod Restart Policy
- Production Best Practices
- Operators
- Kustomize
- Secrets Management
- GitOps
- Cluster Role
- Cluster Role Binding
- Helm
- Resource Profiles
- Pod Priorities
- Types of Deployments
- Rolling
- Canary
- Blue Green
- Replace
- Health probes
- Process health checks
- Liveness probes
- Readiness probes
- Startup probes
- Managed Lifecycle
- Sigterm, sigkill, poststart hook, prestop hook
- Lifecycle hooks
- Automated Placements
- Behavioural Patterns
- Batch Jobs
- Periodic Jobs
- Daemon Service
- Singleton Service
- Stateless Service
- Stateful Service
- Service Discovery
- Self Awareness
- Structural Patterns
- Init Container
- Sidecar
- Adapter
- Ambassador
- Security Patterns
- Process Containment
- Network Segmentation
- Secure Configuration
- Access Control
- Authentication
- Authorization
- Admission Controllers
- Subject
- Role Based Access Control
- Advanced Patterns
- Controller
- Operator
- Elastic Scale
- Image Builder
- Terraform
- Ansible
- Kube Proxy
- Pod Security Policy
- CNI
- Core DNS
- Monitoring
- Logging
- Cluster Logging
- Node Logging
- Troubleshooting
- Control Plane Nodes
- Worker Nodes
- Services
- Opening an interactive shell
- Observability
- Multitenancy
- SSL Certificates
- Image Hardening
- Cluster Hardening
- Host Hardening
- Distributed Tracing
- Packet Capture
- Service Graph
- Visualization of network flows
- Encryption
- Service Mesh
- Cluster Federation
- Multi node Cluster
- Service Accounts
- Role Bindings
- Backup & Restore etcd
- Upgrading a cluster
- PV & PVC
- Access Mode
- Volume Mode
- Reclaim policy
- Finalizers
- Owners & Dependents
- Object Names & IDs
- Leases
- Cgroup v2
- Container Runtime Initiative
- Garbage Collection
- Ephemeral Containers
- Disruptions
- Automatic Cleanup
- Cron Jobs
- Gateway API
- Endpoints
- Endpoint Slices
- Network Policies
- Dual Stack
- Topology Aware routing
- Service Cluster IP Allocation
- Service Internal Traffic Policy
- Volumes
- Persistent Volumes
- Projected Volumes
- Storage Classes
- Volume Attribute Classes
- Dynamic Volume Provisioning
- Volume Snapshots
- Volume Cloning
- Node Specific Volume Limits
- Volume Health Monitoring
- Bind Mounts
- Tmpfs mounts
- Data Backups
- Security
- Pod Security Standard
- Pod Security Admission
- Policies
- Limit Ranges
- Resource Quotas
- Process ID limits
- Node Resource Managers
- Scheduling, Preemption & Eviction
- Node Affinity & Pod Affinity
- Assigning pods to nodes
- Pod overheads
- Pod scheduling readiness
- Pod topology spread constraints
- Taints & Tolerations
- Scheduling Framework
- Dynamic Resource Allocation
- Scheduler Performance Tuning
- Resource Bin Packing
- A scheduling strategy to optimize the resource utilization in the cluster.
- Places pods on nodes that maximizes CPU, memory and other resources while still respecting scheduling constraints and requirements of each pod
- Question: How to do bin packing?
- Node Pressure Eviction
- API initiated Eviction
- Performance
- Cluster Admin
- Certificates
- Logging Architecture
- System Logs
- Traces for k8s components
- Proxies in k8s
- API Priority & fairness
- Installing Addons
- Extending
- Network Plugins
- Device Plugins
- Custom Resources
- Operator Pattern
- Pod Security Policy
- Manifest Validation
- ApplicationSet
- HPA & VPA
- Istio
- PodDisruptionBudgets
- Increases resiliency of the application
- Ensure that a minimum number pods of a replicated application remains available during voluntary disruptions
- Voluntary Disruptions can occur during
- Node is drained to perform maintenance or upgrades
- Pods are deleted to scale down a deployment
- Pods are moved during a rescheduling
- Key Components
- MinAvailable ⇒ Can be an absolute number or percentage of pods
- MaxUnavailable ⇒ Can be an absolute number or percentage of pods
- K8S in IOT
- Pause Container
- Also known as pod infrastructure container
- A secret container that runs in every pod called Pause Container
- The single job is to hold the network namespace
- Created before the business application container
- IP is given to the pause container
- If pause container is deleted, a new pod is created by k8s
- Sidecar
- Envoy Proxy
- High performance, OSS network proxy
- Operates at layer 7
- Can act as a service mesh which is both a proxy and reverse proxy
- Provides load balancing
- Architecture
- Downstream/Upstream
- Clusters
- Listeners
- Network Filters
- Threading Model
- Connection Pools
- Init Container
- Admission Hook
- PodSecurityPolicy
- ForegroundCascadeDeletion
- BackgroundCascadeDeletion
- Admission Controller
- Plugins that intercept requests to API server before the persistence of object configuration in etcd
- Runs after the request is authenticated and authorized.
- They can modify or reject requests to enforce certain policies or augment the incoming request based on specific logic
- 2 Types
- Validating Admission Controllers
- Validate the request after processed by mutating controller
- Failed requests are rejected and an error is returned
- Mutating Admission Controllers
- Modify incoming requests before validation
- Inbuilt Admission Controllers
- NamespaceLifecycle: All requests to non existent namespace are rejected
- LimitRanger: Enforces limits on size of objects
- ServiceAccount
- SecurityContextDeny
- PodSecurityPolicy
- MutationAdmissionWebhook & ValidatingAdmissionWebhook: Custom admission policies
- Draining a Node
- Is a process of evicting all pods from the node for maintenance or upgrades
- Workloads should be shut/moved gracefully
- Steps
- Mark node as unschedulable
- Drain the node
kubectl drain node_name —ignore-daemonsets —delete-local-data
- If pods cant be evicted because of PDB or other scheduling policies, the drain command will wait and retry
- Perform Maintenance
- Uncordon the node
Network Policies
- By default any pod can talk to any pod using IP or DNS name across namespaces.
- Network policies control traffic from pod to pod
- The rules are matched using label selectors.
- NP defines the direction of traffic and to allow/disallow.
- Incoming traffic is called ingress and outgoing is called egress
Storage
Security
-
kube-apiserver is a key component to secure the cluster
-
TLS Certificates for all communication between the cluster
-
K8s does not manage user accounts. Integrates with other IDP
-
Authentication Mechanisms
- Static Password Files:
--basic-auth-file=users.csv
- Static Token File:
--token-auth-file=users.csv
- Certificates
- IdP
-
Kube Config
- Clusters
- Contexts
- Combines users and clusters
- Users
kubectl config use-context[context_name]
-
API Groups
- Core Group ⇒ Has all the core resources
- Named Group ⇒ New additions
-
Authorization
-
Types
- Node
- ABAC
- Used by users / group of users
- Managed by policy files
- kube-apiserver needs to be restarted for any changes to ABAC
- RBAC
- Instead of associated policies for users, a role is first created and then associated with users
- Webhook
- Used for outsourcing auth mechanism
- Example: Open Policy Agent
- AlwaysAllow
- AlwaysDeny
-
When multiple auth modes are activated, the order of evaluation is based on the order in which the modes are specified when starting the kube-apiserver
-
RBAC
- Create a role definition file
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer
rules:
- apiGroups: [""] # [""] represents core api group
resources: ["pods"]
verbs: ["list", "get"]
resourceNames: []
- Link user to the role by using a role binding
kubectl auth can-i create deployments
to check access
kubectl auth can-i create pods --as dev-user
to impersonate and check
- Roles and Role Bindings are namespaced
-
Cluster Scoped Resources
- Nodes
- PV
- ClusterRoles
- ClusterRoleBindings
- CSR etc
-
Admission Controllers
- Happens after request authorizarion is successful
- Enforces validation on the manifest and applies some changes if needed
- Pre Built Admission Controllers
- AlwaysPullImages
- DefaultStorageClass
- EventRateLimit
- NamespaceExists
- Flags
--enable-admission-plugins
and --disable-admission-plugins
- It can not only perform validations and accept/reject requests but also can perform operations on the backend. Like
NamespaceAutoProvision
which creates a namespace if it doesnt exist
- Validating Admission Controller
- Mutating Admission Controller
- Example: DefaultStorageClass
- Usually mutating controllers are applied first before the validating controllers
-
Nutating & Admission Webhooks
- Point the webhooks to a service running custom controllers either inside the cluster or outside the cluster
- Request is an AdmissionReview object. Response is AdmissionReview with
response.allowed
field containing the result
- Can be any API server written using any technology
POST /validate
and POST /mutate
are the endpoints for the webhooks
-
API Versions
-
API Deprecations
- Deprecation policy Rules
- API elements may only be removed by incrementing the version of the API Group. i.e a resource in v1alpha1 can be removed only after the version changes to v1alpha2
- API objects may be able to round trip between versions in a given release without information loss with the exception of whole REST resources that do not exist in some versions
kubectl convert -f[file_name]
to migrate manifests from old API to new API
-
CRD
- Each custom resource has its own controller that is responsible for managing the custom resource
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: fully qualified name
spec:
scope: Namespaced # or not
group: #api group
names:
kind:
singular:
plural:
shortNames:
- # short name
version:
- name:
served: true
storage: true
schema:
openAPIV3Schema:
- Just adding a CRD will store it in etcd. No logic will be executed
- A custom controller is required for custom logic
- A controller is an object that runs in a loop and listens to changes in resources it is interested in
-
Operator Framework
- Used to deploy the CRD and the controller together
- Operators do the job that humans manually do in terms of deploying, backing up, fixing issues etc
Volumes
- Each container in a pod gets its own temporary file system that cannot be accessed by another container
- Volumes can be used to exchange data between a main application container and a sidecar
- Volume Type
- emptyDir ⇒ Empty directory in pod with RW access. Only persisted for the lifespan of a pod. Useful for exchanging data between containers of a pod
- hostPath ⇒ File or directory from the host node’s filesystem
- configMap, secret ⇒ Provides a way to inject configuration data
- nfs ⇒ Network file system
- PVC ⇒ Claims a persistent volume
apiVersion: v1
kind: Pod
metadata:
name: business-app
spec:
volumes:
- name: logs-volume
emptyDir: {}
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /var/logs
name: logs-volume
- Static vs Dynamic Provisioning
- Static ⇒ Create the storage device first and create a PV
- Dynamic ⇒ PV will automatically be created by setting a storage class name using
spec.storageClassName
- A storage class is an abstraction concept that defines a class of storage device like fast vs slow performance etc
- PV can only be created using manifest and not via CLI
- Access Modes
- ReadWriteOnce
- ReadOnlyMany
- ReadWriteMany
- Reclaim Policy ⇒ Determines what should happen with the PV after it has been released from its claim. By default the object will be retained
apiVersion: v1
kind: PersistentVolume
metadata:
name: db-pv
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/db
- PVC
- The purpose is to bind the PV to a pod
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: db-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 512m
- Once the PVC is created, the status set to
Bound
which means that the binding to the PV was successful
- Once the PV and PVC are defined, the volume can be mounted on the pod by specifying the PVC in the pod manifest
apiVersion: v1
kind: Pod
metadata:
name: app-consuming-pvc
spec:
volumes:
- name: app-storage
persistentVolumeClaim:
claimName: db-pvc
containers:
- image: alpine
name: app
command: ["/bin/sh"]
args: ["-c", "while true; do sleep 60; done;"]
volumeMounts:
- mountPath: "/mnt/data"
name: app-storage
ISTIO
- Service Mesh
- Provides visibility on the interconnection between the pods
- Each pods has a sidecar container called istio proxy which tracks all the calls
- The proxies are collective terms called Data Plane
istiod
runs on istio-system which gets data from istio proxy
- Pods running on istio-system are called Control Plane
EKS
- Components
- Control Plane
- Has atleast 2 API server nodes and 3 etcd nodes that run across 3 AZs in a region
- Automatically detects unhealthy control plane nodes
- Worker Nodes & Node Groups
- Group of ec2 instances to run workloads
- Node Group: One or more ec2 instances that are deployed in an ASG
- All instances in node group should be of same type, run same AMI and has same worker node role
- Fargate Profiles
- Serverless
- Runs only on private subnets. Needs VPC with atleast 1 private subnet
- VPC
Tools
- Valero - Backup of data and cluster state
- Kubeval
- Conftest
- Datree