Introduction

Cluster

3 Processes needs to run on every node
- Container runtime
- kubelet
  - Interfaces with both container runtime and node
  - Responsible for taking a container config and creating a container using the container runtime
- kubeproxy
  - Responsible for forwarding requests from services to pods
  - Intelligently routes requests to pods and tries to reduce network overhead
Master Nodes
- 4 processes needs to run on every master node
  - API Server
    - Takes care of authentication
  - Scheduler
    - Knows which node to schedule the pod to based on the incoming request
    - Scheduler only decides which node to run to but its the kubelet that receives the request from scheduler and creates the container
  - Controller Manager
    - Detect state changes like crashing pods and tries to recover the cluster state by making a request to the scheduler
  - ETCD
    - KV store for the storing the cluster state
    - The brain of the cluster

Cluster Architecture

Nodes

Controllers

Leases

Container Runtime Interface

Garbage Collection

ETCD

Code DNS

Workloads

Containers

Containers in the same pod are colocated and co scheduled to run on the same node
Container Runtime is responsible for managing the execution and lifecyle of containers within the k8s environment
Supported runtimes: containerd, CRI-O and other implementations of Container Runtime Interface

Images

Runtime Class

RuntimeClass is used to select the container runtime configuration that is used to run a pod's containers
Different pods can use different runtime classes. For example, if part of your workload deserves a high level of information security assurance, you might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. You'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.
RuntimeClass is also used to run different pods with same runtime but with different settings
Setup
- Configure CRI implementation on the nodes
- Create corresponding RuntimeClass resources
#to-read Configure CRI implementation on a node
RuntimeClass assumes all nodes in the cluster support it by default.
Usage

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  runtimeClassName: myclass
  # ...

Scheduling
- By specifying the scheduling field for a RuntimeClass, you can set constraints to ensure that Pods running with this RuntimeClass are scheduled to nodes that support it. If scheduling is not set, this RuntimeClass is assumed to be supported by all nodes.
- The constraints are set using the label selector runtimeclass.scheduling.nodeSelector

Container Lifecycle

Container Lifecycle Hooks

Ephemeral Containers

A special type of container that runs temporarily in an existing pod to be used for troubleshooting
kubectl debug pod-name --image alpine --target container-name
- Containers created this way will keep running forever
This creates a new container within the same pod
This new container uses the same container/linux namespace as the original container that is being debugged
EC lack guarantees for resources and execution
Will not be automatically be restarted
Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod
Ephemeral containers are useful for interactive troubleshooting when kubectl exec is insufficient because a container has crashed or a container image doesn't include debugging utilities.
Process Namespace Sharing: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/ #to-read

Sidecar Containers

Pods

Smallest unit of abstraction
Abstraction over a container
Usually 1 application per pod
Each pod gets its own IP address
Pod can communicate using IP address
Pods are ephemeral - Means they can die easily
New pods gets a new IP address

Pod Lifecycle

Init Containers

Multi Container Pods

Sidecar Injection

Disruption

Pod Quality of Service Classes

Pod Restart Policy

Pod Priorities

Disruptions

Involuntary Disruptions
- Hardware failure in the node
- VM is deleted by mistake
- Cloud Provider / Hypervisor failure
- A kernel Panic
- Node disappears from the cluster due to cluster network partition
- Eviction of node due to out of resources
Dealing with Disruptions
- Ensure pods have resource request and limits mentioned
- Replicate the app for HA
- Spread apps across zones
Pod Disruption Budgets
- A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions
  - Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.

Deployments

Blue print for pods with replicas
Replicas can be scaled up or down
A deployment is an abstraction for pods
#to-read Pausing a deployment https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pausing-and-resuming-a-deployment
Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

restartPolicy by default is Always
pod-template-hash
- Added by default to rs by the deployment

Updating a Deployment

A deployment's rollout is triggered if and only if the pod template spec is changed
Scaling the deployment does not trigger a new rollout

Scalability

Deployment Types

When a deployment is created the following happens
- A rollout is created by the deployment
- A replica set is created by the deployment
- The pods are created by the replica set
Recreate
- Delete old pods
- Create new one
- Results in downtime
Rolling
- Delete one old pod and add a new one and so on
- Default
- maxUnavailable and maxSurge controls how many replicas are rolled over at a point in time
Rollover
- This is the act of creating a new deployment when an old deployment is in progress
Deployment strategy can be changed using strategyType
A new revision is created whenever a deployment is rolled out
- kubectl rollout history deployment name
- kubectl rollout undo rolls back
- kubectl rollout status
- Rollback to a specific version

kubectl rollout undo deployment/nginx-deployment --to-revision=2

#to-read Label Selector Updates: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#label-selector-updates
Scaling a deployment: kubectl scale deployment/nginx-deployment --replicas=10
Adding a HPA for a deployment

kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80

Change Cause of a deployment is set in the annotation kubernetes.io/change-cause
A custom change cause can be set using kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
#to-read Proportional Scaling
Autoscaling
- Use HPA
- If HPA is used, then dont set .spec.replicas

kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80

Pausing & Resuming a rollout
- kubectl rollout pause deployment/name & kubectl rollout resume deployment/name
- Watch the status of a rollout using kubectl get rs -w
- A paused deployment cannot be rolled back until it is resumed
Deployment Progress Deadline
- One way you can detect this condition is to specify a deadline parameter in your Deployment spec: ([.spec.progressDeadlineSeconds](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds)). .spec.progressDeadlineSeconds denotes the number of seconds the Deployment controller waits before indicating (in the Deployment status) that the Deployment progress has stalled.
- Pausing a deployment will not result in a progress deadline check
Deployment Revision History Limit
- You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background. By default, it is 10.

ReplicaSets

Maintain a stable set of replica pods running at any given time
A RS is linked to its pods via the pods' metadata.ownerReferences field which specifies what resource the current object is owned by
- It is through this link the RS knows the state of the pods its managing

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: frontend
  labels:
    app: guestbook
    tier: frontend
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      tier: frontend
  template:
    metadata:
      labels:
        tier: frontend
    spec:
      containers:
      - name: php-redis
        image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5

Any naked pods with the same label selector as managed by a RS will also be picked by RS if their owner reference is empty
In the ReplicaSet, .spec.template.metadata.labels must match spec.selector, or it will be rejected by the API.
What happens if a RS is updated? Does it create a new rollout? #question
Isolating pods from RS
- You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
Scaling Down a RS
- When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
  1. Pending (and unschedulable) pods are scaled down first
  2. If controller.kubernetes.io/pod-deletion-cost annotation is set, then the pod with the lower value will come first.
  3. Pods on nodes with more replicas come before pods on nodes with fewer replicas.
  4. If the pods' creation times differ, the pod that was created more recently comes before the older pod (the creation times are bucketed on an integer log scale when the LogarithmicScaleDown feature gate is enabled) If all of the above match, then selection is random.
Pod Deletion Cost
- Using the [controller.kubernetes.io/pod-deletion-cost](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost) annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
- Pods with lower deletion cost are deleted first
- Default is 0
- This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
HPA can also target a RS

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: frontend-scaler
spec:
  scaleTargetRef:
    kind: ReplicaSet
    name: frontend
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

StatefulSets

DaemonSets

Runs a pod on each node in the cluster
Can be used for collecting logs from all nodes
Automatically calculates the number of replicas of the pod based on the number of nodes in the cluster
Adds only one pod per node

Jobs

Cron Jobs

Replication Controller

Networking

Network Types

Pod Network

Cluster network

Node Network

Dual Stack

Gateway API

Service

Has a permanent IP address
Lifecycle of a pod and service are not connected. Even if the pod fails, the service still will be up
Load balances the requests and distributes it to pods below it

Service Cluster IP Allocation

Ingress

SSL Termination

Ingress Rules

Ingress Controller

Gateway API

Endpoint Slices

Network Policies

By default all communication is allowed within all pods
NP restricts communications
It configures the CNI application
- Not all plugins support NP
Define which app the NP applies to by using selector labels
- If empty the policy will be applied to all pods in the namespace
Ingress rules from incoming and egress for outgoing
policyTypes captures whether its ingress or egress
podSelector selects the pods
namespaceSelector is used to filter namespaces
How does the manifest look for Allow All / Deny All? #question

DNS

Routing

Topology Aware Routing

Storage

Volumes

Data stored in containers are gone when a pod restarts
Volumes are used to store persistent data
Volumes attaches a physical storage to the pod
- Can be on local machine
- Can be on cloud storage
The k8s admin is responsible for backing up, restoring the data. K8s does not do this

Persistent Volumes

Persistent Volume Claims

Projected Volumes

Ephemeral Volumes

Storage Classes

Volume Attribute Classes

Volume Provisioning

Static Provisioning

Dynamic Provisioning

Volume Snapshots

Volume Snapshot Classes

Volume Health Monitoring

Configuration

Config Maps

Contains config data to be used by the pod
Config Maps are connected to the pod
- Will updating the config map reflect immediately in the App? #question

Secrets

Only base 64 encoded and not encrypted
- Not safe by default
Mount to pods like config maps

Resource Requests & Limits

Its a good practice to have resource requests and limits for all resources
Requests are the amount of resources it needs
Limits are the max amount of resource that the pod should be allowed before it is killed
kubectl get pods -o jsonpath="{range .items[*]}{.metadata.name}{.spec.containers[*].resources}{'\\n'} to get list of pods and its resources
If a pod needs to be evicted, those without resources will be evicted first

Autoscaling

Cluster Autoscaling

Pod Autoscaling

HPA

VPA

Pod Disruption Budgets

Priority Classes

Runtime Classes

Admission Controllers

Security

Authorization Modes

RBAC

Least Privilege Rule

Roles

Roles allow to define namespaced permissions
Roles only define the permissions. They are not associated with users

apiVersion: rbac.authorization.k8s.iov1
kind: Role
metadata:
	name: developer
	namespace: namespace_name
rules:
	- apiGroups: [""]
	  resources: ["pods"]
	  verbs: ["get", "create", "list"]
	  resourceName: ["myApp"]

Cluster Role

Defines resources and permissions at a cluster level and not limited to a namespace

Users

K8S does not manage users natively
Allows different external auth strategies. Can be
- External Tokens
- Certificates
- 3rd Party IdP

RoleBinding

Binds a role to an user or an user group

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
	name: developer-role-binding
subjects:
	- kind: User/Group/ServiceAccount
	  name: username
	  apiGroup: rbac.authorization.k8s.io
roleRef:
	kind: Role
	name: developer
	apiGroup: rbac.authorization.k8s.io

Checking API Access
- kubectl auth can-i create deployments --namespace dev

Service Accounts

Represents an application
Application running both internally and externally need access to resources inside a cluster
A service account can be linked to a role/cluster role using a binding

Other Authorization Modes

ABAC, Node, Webhook
/etc/kubernetes/manifests/kube-apiserver.yaml contains information on what authorization modes are enabled using -authorization-mode

Cluster Hardening

Certificates

When a cluster is bootstrapped with kubeadm init the certs for k8s components are generated in /etc/kubernetes/pki
Client certificated can be found in /etc/kubernetes/kubelet.conf which allows nodes to talk to the master
kubeadm generated a CA for k8s cluster and etcd server so that it can sign the certificates
All the nodes get a copy of the CA that generated the certificates

Certificates API

Process of signing a client certificate
- Create Key Pair
- Generate Certificate Signing Request
- Send CSR using K8S Certificates API
- K8S signs the certificate
- A K8S admin approves/denies the request
- After approval, a signed certificate is provided by k8s CA which is handed over to the client

Scheduling

Node Name & Selector

Pods are automatically scheduled on the worker nodes and scheduler intelligently decides where to place the nodes
In some case, we might want to decide the node
- Done using noneName attribute in the spec
If the nodes are dynamic, we can use nodeSelector
- Specify nodes using labels
- First see the labels in a node by kubectl get node --show-labels
- kubectl label node node-name key=value to label a node

Node Affinity

More expressive than node name and selector
Match rules with logical operator
Done using affinity.nodeAffinity attribute
Can have a bunch of rules with weights associated
Has 2 types of preferences
- Soft: preferredDuringSchedulingIgnoredDuringExecution
- Hard: requiredDuringSchedulingIgnoredDuringExecution
Supported Operators: In, Exists, Gt, Lt, NotIn, DoesNotExist
nodeAntiAffinity is the opposite of nodeAffinity

Taints & Tolerations

Affinity allows us to tell which nodes a pod should be assigned to. Taints allows us to tell the node which pods they should accept
Master nodes have taints on them that repels regular pods from being scheduled on them
- node-tole.kubernetes.io/master:NoSchedule is the exact taint
- kubeadm sets these taints during installation
Tolerations are used to worker around a taint
- i.e to force a pod to be scheduled on a node that has a taint defined that is satisfied by the pod
If we want a pod to run only on the master node
- Add a toleration to master node taint
- Add nodeName selector and set it to master

Inter Pod Affinity

Configure pods to run only on nodes that has another pod running. Like collecting logs for a particular pod
Configured using podAffinity
Inter pod anti affinity is the opposite of that
- By setting the anti affinity rule to the same label as the pod being scheduled, this will prevent from running more than 1 replica of a pod in the same node
What is topologyKey? #question

Node Selector

Node Affinity

Node Anti Affinity

Pod Affinity

Taints

Tolerations

Assigning Pods to Nodes

Pod Overhead

When a pod is run on a node, the pod itself takes an amount of system resources. These are additional to the resources needed to run the containers inside the pod.
Specified in the runtime class configuration of a pod, this allows the k8s to take into account the overhead resources in addition to the actual pod resources.
Overhead resources are the resources needed to run the actual pod and not the application inside the pod

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-fc
handler: kata-fc
overhead:
  podFixed:
    memory: "120Mi"
    cpu: "250m"

Pod overhead are considered during the admission time of a pod by the admission controller.
- During admission, the pod spec is mutated to include the overhead of the runtime class
- If the podspec already has the overhead field, the pod will be rejected. #to-try
Resource Quota section of https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ #to-read

Pod Scheduling Readiness

Objects

Labels & Selectors

Namespaces

Annotations

Field Selectors

Finalizers

Owners & Dependents

Administration

Building a Cluster from scratch

One master and two worker nodes
Min requirement for k8s server requirements is min 2GB RAM and 2 CPU per node
- Master nodes usually require less resources than worker nodes
Basics of TLS Certificates
- A way to establish trust and secure communication
- Symmetric Encryption
  - Use a random string to encrypt a piece of message
  - Random string is called Encryption Key
  - Same key is used to decrypt the data
- Asymmetric Encryption
  - Separate encryption and decryption keys also called Key Pair
  - An encrypted data with one key can only be decrypted with its key pair
  - In this case, the server first generates a key pair and sends a key to the client
  - Client uses this key to encrypt a piece of information which can only be decrypted using the other key
- Certificate Authority
  - Issues digital certificates that certifies the ownership
  - These digital certs have the public key of the website and the subdomains which is signed by CA
  - Client Certificates
    - Server also can request a certificate from the client to protect from the hacker
  - Trusted vs UnTrusted CA
    - Browsers ship with public key from the official list of authorized CAs in the world
Cluster Installation
- Steps
  - Deploy a container runtime on every node
  - Install kubelet on every node that runs as a normal linux process
  - Deploy pods which run master on master node like apiserver, scheduler, controller manager, etcd etc
  - Deploy kubeproxy pod on all nodes
- Static Pods
  - Just like normal pods but are directly scheduled by kubelet without need of k8s master components. i.e. without control plane
  - Kubelet watches a specific location on the node its running /etc/kubernetes/manifests
  - Any pod manifest present in this location is scheduled as a static pod
  - kubelet watches static pod and restarts it if it fails. Controller manager does not do this
  - Static pods are suffixed with node name they are running on
- Certificates
  - Controls how one k8s component/node can communicate with the other in a secure manner
  - K8S components uses mTLS
  - Communications
    - Almost every component will talk to API Server. Which means all components should be able to supply a certificate to the API Server in order to authenticate itself
      - Generate self signed CA Certificate for K8S (Cluster root CA)
      - This self signed CA Certificate is used to sign all client and server certs used in the cluster
      - Certs are stored in /etc/kubernertes/pki
      - API Server will have server certificate, scheduler and controller manager will have client certificates
      - Also since API Server talks to etcd & kubelet. So API Server will have its client cert, etcd and kubelet will have server cert
      - Kubelet also talks back to API server
    - Admin Users also talk to API Servers and hence admins need their own client certificate
      - This cert should also be signed by self signed CA
  - List of certs to create
    - Generate self signed CA Certificate - Cluster root CA
    - Sign all client and server certificates
      - Server cert for API Server
      - Client cert for scheduler and controller manager
      - Server cert for etcd and kubelet
      - Client cert for API server to talk to kubelets and etcd
      - Client cert for kubelet to authenticate to API Server
      - Client cert for k8s admins
- Kubeadm
  - Bootstraps all the above configs/certs
  - Maintained by K8S
  - It only bootstraps. Does not provision the cluster
Container Runtime
- Not a k8s component
- Used to schedule containers
- Every node needs to have a container runtime
  - Master node needs this to run control plane components as containers
  - Worker nodes needs this to run application workloads
- CRI is the interface that can be implemented by any container runtime so that k8s can use it
- containerd and cri-o are the other popular container runtimes
- AWS/GCP/Azure use containerd as container runtime
Installing containerd
- https://kubernetes.io/docs/setup/production-environment/container-runtimes/
Installing kubeadm, kubelet and kubectl
- kubeadm
  - Install only on master nodes
  - kubeadm init - initialize the control plane. i.e.
    - Generate /etc/kubernetes folder
    - Generates self signed CA
    - Generates static pod manifests into /etc/kubernetes/manifests (To be detected by kubelet to start staticpods)
    - Make all necessary configurations
- sudo apt-get install -y kubelet kubeadm kubectl

Upgrading a cluster

https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime
2 Steps
- Upgrading the master nodes
  - During upgrade, the control plan components will not be accessible but there wont be application downtime. The worker nodes will still be functioning
  - But if there is an issue on the worker nodes, it wont be fixed as no master nodes are running
  - Having 2 or more master nodes will help in fixing the above problem
  - kube-apiserver, controller-manager, kube-scheduler, kubelet, kube-proxy and kubectl needs to be upgraded
    - In addition to above, kubeadm will take care of upgrading etcd and coredns as well
    - The network plugin needs to be upgraded separately
  - Version restrictions during upgrade:
    - Not all components needs to be on same version but there are some rules
      - kube-apiserver should be the latest amongst all versions
      - controller-manager and kube-scheduler can be 1 version behind
      - kube-proxy and kubelet can be 2 versions behind
      - kubectl should be the same as API server
  - Upgrade kubeadm and then run kubeadm upgrade apply version
  - kubelet and kubectl needs to be upgraded separately as it was installed separately
  - Since kubelet is responsible for scheduling pods in a node including the static pods for master components, the node should first be drained before upgrading kubelet
    - kubectl drain node-name
      - Will evict all pods safely and the node will be marked unschedulable
      - Draining will cordon the node as well
    - After upgrading kubelet and kubectl, make the master node ready by kubectl uncordon master to make is schedulable
- Upgrading the worker nodes
  - Upgrade kubeadm
  - Drain the node
    - This means all pods on the node will be scheduled on another node
    - Node gets marked as unschedulable
    - Change worker node to schedulable
    - If there are 2 nodes and if all pods have atleast 2 replicas, then there will be no downtime during this upgrade
- Draining a node
  - Done by kubectl drain node-name
  - Will first mark the pods as unschedulable
  - Evict all nodes
  - kubectl cordon node-name will only mark a node as unschedulable without evicting pods on the node. kubectl uncordon node-name to mark it schedulable again
- k8s only officially supports past 3 versions
- Upgrade a cluster 1 minor version at a time
- kubeadmn upgrade plan
- kubeadmn upgrade apply version

Certificate Management

Building a cluster

Cluster Installation

Node Configuration

Container Runtime Interface

kudeadm, kubelet and kubectl

Connecting to a cluster

Network Configuration

Container Communication

Container Network Interface

Network Plugin

Garbage Collection

Troubleshooting

Check if pod is running =] kubectl get pod[name]
Check if service is forwarding the request =] kubectl get ep or kubectl describe service[name]
Check if service is accessible by pinging it from one of the pods in the cluster
- nc SERVICE_IP SERVICE_PORT
- ping SERVICE_NAME
Check application logs
- kubectl logs[pod_name]
Check pod status and recent events
- kubectl describe pod POD_NAME

Debug with Temporary Pods

Debug with BusyBox
- Has all tools pre installed for debugging
- kubectl run debug-pod --image=busybox -it
- kubectl exec -it debug-pod -- sh
- Directly run a command without opening a shell: kubectl exec -it debug-pod -- sh -c "printenv"
Debug a running pod by logging into it
- kubectl exec -it pod-name --container container-name --/bin/sh
Debug by adding an ephemeral container
- kubectl debug -it[pod-name] --image=busybox --target=<container-to-debug]

Kubectl format output

Get more detailed output format using json and JSONPath (A query language for json)
- kubectl get pod -o jsonpath='{range.items[*]}{.metadata.name}{"\\t"}{.stats.podIP}{"\\n}{end}'
Custom Columns
- kubect get pod -o custom-colume=POD_NAME:.metadata.name,POD_IPL.status.podIP

Troubleshoot Kubelet & Kubectl

Kubelet
- Runs as a linux process. So SSH into the node
- service kubelete status to check the status of the service
- journalctl -u kubelet to see kubelet logs
- sudo systemctl restart kubelet to restart kubelet service
Kubectl
- Sometimes we can get errors or just hang
- Check kubectl config which is ~/.kube/config
- Check if cluster certificate data and server is correct
- kubectl cluster-info dump provides the complete cluster info

Pod Access Problems

Objects

Server Side Validation
- The API server offers server side field validation that detects unrecognized or duplicate fields in an object. It provides all the functionality of kubectl --validate on the server side.
- The options are Strict, Warn & Ignore. Strict is the default
Names & IDs
- Each namespace can have only one resource with an ID
- For non unique user provided attributes, use labels and annotations
- Names must be unique across all API versions of the same resource

Labels & Selectors

Labels

Key Value pairs attached to objects
Used for identifying objects
Can be attached to objects at creation time, added later, modified
The keys of a label can have 2 parts separated by a /
- An optional prefix which should be a DNS subdomain
- A name
Well Known Labels: https://kubernetes.io/docs/reference/labels-annotations-taints/

Selectors

Equality based selector: (=,== , !=)

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  containers:
    - name: cuda-test
      image: "registry.k8s.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-p100

Set based selector: (in, notin, exists)
Just mentioning the keys without values matches if the label just exists
If there are multiple selectors, all of them are compared using a logical AND
Using Selectors in API
- List and watch commands can use -l to pass label selectors

kubectl get pods -l environment=production,tier=frontend

kubectl get pods -l 'environment in (production),tier in (frontend)'

Using selectors in manifests

selector:
  matchLabels:
    component: redis
  matchExpressions:
    - { key: tier, operator: In, values: [cache] }
    - { key: environment, operator: NotIn, values: [dev] }

Namespaces

Logical grouping of resources inside a cluster

Miscellaneous

Best Practices

https://kubernetes.io/docs/setup/best-practices/

Configuration Best Practices

https://kubernetes.io/docs/concepts/configuration/overview/

Production Environment

https://kubernetes.io/docs/setup/production-environment/
Propagation Policy #to-read
Multi-tenancy in k8s #to-read

Resource Profiles

Automated Placements

Observability

Probes

Liveness Probes

Let k8s know if the app inside the pod is active
configured using livenessProbe
periodSeconds defines the period of repeat
3 types
- Simple Exec command =] non zero exit code is considered as error
- tcpSocket =] Success if TCP connection can be opened
- httpGet =] 2xx-3xx is considered as success

Readiness Probes

Check if pod is ready to receive the request
Only after this liveness probe will start

Startup Probes

Verifies if a container is started.
Runs before any other probe
Executed only once at startup unlike readiness which are executed periodically

Operators

Maturity Level

Operator Framework

Custom Resource Definition

APIs

API Groups

API Versions

Deprecations

Production

Best Practices

Behavioral Patterns

Batch Jobs

Periodic Jobs

Daemon Service

Singleton Service

Stateful Service

Commands

kubectl options
kubectl cluster-info
kubectl create --help
kubectl auth --help
kubectl auth can-i verb resource --as user
kubectl scale deployment deployment --replicas=X
kubectl rollout status deployment/nginx-deployment

Tools & Services

Crossplane

Kustomize

Helm

Kubeshark

Resources

https://kubernetes.io/docs/reference/

Tasks

[ ] Add code snippets wherever possible
[ ] Go through the k8s docs and follow all links

Deployments

Do not run native pods
Offer scalability and reliability of the application
Provides updates and update strategy
Automatically restart pods if anything goes wrong with the pods
If replicas is not set, default value is 1
Scalability
- kubectl scale deployment[name] —replicas=n
- kubectl edit deployment[name]
Deployment Updates
- Supports zero downtime application updates
- When an update is applied, a new replica set is created
- After successful update, old RS is deleted (But can be kept if needed)
- updateStrategy defines how to handle updates
- Types
  - Rolling
  - Replace - Used if app does not support running multiple versions at the same time

Labels, selectors and annotations

Label
- Key value pair to provide additional info
- Used to connect related resources
- Deployment finds its pods using selectors
- Services finds its endpoint pods using selectors
- Used in network policies to control access
- Syntax: —selector key=value
- app=app_name is an automatically created label
- kubectl get deployments —show-labels displays all labels in the deployments
- Modifying a deployment to add a label does not add the label to the pods managed by the deployment
- Max length of a label is 63 chars
Annotation
- Provide detailed metadata in an object
- Cannot be used in queries
- Used for licenses, maintainer etc
kubectl rollout history and kubectl rollout undo
Every deployment keeps a record of the rollout history
Question: How to get events/Alerts for new replicas added by HPA?

Networking

Pod Network
- All pods are connected to the pod network which is an internal network.
- All pods have IP addresses in this network
Cluster Network
- Still an internal network
- Provides access to pods from the nodes
- The nodes are connected to the cluster network
Node Network
- Nodes are connected to the external network
- User can directly access the node network
Services
- API resources used to expose logical set of pods
- Applies round robin load balancing
- Load balances requests to pods
- The set of pods targeted by a service is determined by a selector (which is a label)
- Services exist independently from the applications they provide access to
  - If a deployment is deleted, the service is not deleted
  - This means one service can provide access to pods in multiple deployments.
- Types
  - Cluster IP
    - Default
    - Exposes the service in an internal cluster IP address
    - Not externally reachable
    - Headless Mode
      - Doesnt have a Cluster IP
      - Service can be accessed just using the service_name.default.svc.cluster.local
  - Node Port
    - Exposes ports on nodes and forwards the port to the service
    - The same port on all nodes are pointing to the same service inside the cluster
  - Load Balancer
    - Only implemented in public cloud
- Each service gets its own unique IP address
Create Services
- kubectl expose to create service and provide access to deployments/replicasets/pods
- kubectl create service or manifest to create service
Exposed services automatically register with the k8s internal DNS
Network Policy
- By default no restrictions to network traffic in K8s
- Pods can always communicate even if in other namespaces
- To limit this, network policy can be used
  - Need to be supported by the network plugin
- If in a policy, there is no match, then all requests will be denied
- If no policy is present, all traffic will be allowed
- Identifiers
  - Pods
    - To select pods which can access
  - Namespaces
    - To grant access to specific namespaces
  - IP blocks
    - Defines the IPs which can access
- When using Pods/Namespaces, selector labels are used to filter pods

Ingress

Make applications easily accessible by using DNS provided URLs
Purpose: Expose services to outside
Works by adding entries to DNS
Ingress doesnt make sense for Cluster IP
Works only for NodePort & Load Balancer
Ingress can do load balancing
- Can Ingress load balance across services running in 2 clusters?
Ingress needs an Ingress Controller which is not available by default
Exposes HTTP & HTTPS Routes
- Traffic is controlled by the rules defined on the ingress resources
Can terminal SSL/TLS
Offers name based virtual hosting.
- What is this?
Options for Ingress Controllers
- HA Proxy
- NGinx
- Traefik
- Kong
- Contour
Ingress Rule
- Optional Host
  - If no host is specified, the rule applies to all http traffic
- A list of path and its backend
  - Backend consists of service name and service port
- Backend Types
  - Simple Fanout ⇒ Traffic routed to multiple backends. This allows to minimize number of load balancers
  - Name based virtual hosting ⇒ Traffic incoming on a specific name is router to a specific service
  - TLS Ingress → Uses a TLS secret to ensure TLS termination at load balancer

Storage

File stored in the container will only live as long as the container itself
Pod Volumes can be used to allocate storage that outlives a container and stays available during pod lifetime
Pod spec can contain volume
- By default volume will be ephemeral
- This allows containers in a pod to share storage
- Question: How to share storage between containers in the same pod?
PV ⇒ Persistent Volume
- Allows pods to connect to external storage
- There can be multiple PVs pointing to different external storage solutions
- PVs by themselves are independent
- PVC is required to bind a PV to a pod
  - Request for storage
  - Specify type and size
- PVC is bound to a pod
- ConfigMaps are specific volume objects that connect to config files and variables
- Secrets do the same, but by encoding the data in it
Configuring volume
- Decide if the volume is a pod local volume or persistent volume
- Volumes needs to be specified in spec.volumes and mounted on the pod using spec.containers.volumeMounts
Volume Types ⇒ emptyDir, hostPath, azureDisk, awsElasticBlockStore, gcePersistentDisk, cephfs, rbd
Configuring PV Storage
- A PV is an independent object that connects to external storage
- Use PVC to bind a PV to a pod
- Question: What is a ReclaimPolicy?
Configuring PVC
- PVC access request to PV according to specific props
  - AccessModes
  - Availability of Resources
  - PVC Bind
    - After connecting to a PV, the PVC will show as a bound
Question: how to make a PVC link to a specific PV when many PVs are defined?

ConfigMaps & Secrets

Special types of volumes
Can be used in 3 ways
- To make variables available within a pod
- Provide command line arguments
- Mount on a location where the application expects the config file
  - When mounting a config map as a volume, in the mount point, files will be created with the name of the key and the value will be inside the file
Secrets are base64 encoded config maps
CM & secrets must be created before the pods that are using them
Source of CM
- Directories: uses multiple files in a directory
- Files: Puts the content of a file in the CM
- Literal Values: Key value pairs which can be use as variables and CLI args
Irrespective of the source, the usage from pod remains the same
ConfigMap as a config file
- In this, the pod is using the CM as a mounted volume which contains the config file
- The CM itself contains the config file
Secrets
- Store sensitive data
- Create like config maps
- Encoded but not encrypted
- Types
  - Decker-registry: used for connecting to a docker registry
  - TLS: TLS Secret
  - Generic: secret from a local file / directory / literal value
- Question: How to see the contents of auto generated secrets?
- Creation:
  - Same as config map. kubectl create secret

API Usage

Troubleshooting

Determining a troubleshooting strategy
- When executing an API call
  - Kubectl → API Server → Server writes to etcd → kubeschedule → gets data from etcd → talk to kubelet in node → kubelet talks to container runtime
Kubectl describe is the first line of defence. It applies to any resource type
Next step is kubectl logs to check container logs
Pod States
- Pending: Pod has been validated by the API Server and an entry has been created in etcd but some prerequisite conditions have not been met
- Running
- Completed: Pod has completed its work
- Failed: Pod has finished but something has gone wrong. Restart policy is never
- CrashLoopbackoff: Pod has failed and cluster has restarted it.
- Unknown: Pod status could not be obtained
Kubectl describe
- Have a look at the events
- Look at app state
- Check last state and check exit code
- If non 0 exit code, use kubectl logs
Analyzing pod access problems
- Check if labels in service and pods are matching
- Use kubectl get endpoints to check services and corresponding pod endpoints
- If ingress is used, check if ingress controller is working properly
- Check network policy - NP can be applied to pods, namespaces, IP ranges to restrict traffic in both directions
  - Kubectl get netpol -A to check if any network policy exist
  - kubectl describe netpol name
  - Label comparison are case sensitive
  - If a pod doesnt have an endpoint there is something wrong with the service config. Probably labels
Monitoring Cluster Event Logs
- Gives generic overview of everything
- Kubectl get events -o wide
- Kubectl describe resource_name
- kubectl describe nodes

Observability

Health Probes
- Startup Probe
  - Legacy apps take time to start
- Readiness Probe
  - Checks is application is ready to serve requests
- Liveness Probe
  - Periodically checks the application’s responsiveness
Each probes offer 3 ways to verify health
- Custom Command - Needs non zero exit code
- HttpGet - 200 to 399 status code responses
- TCP Socket - Checks if connection can be established
Each probe has a set of attributes to configure
- InitialDelayS (Default: 0) - Delay in seconds until first check is executed
- PeriodSeconds (Default: 10) - Interval for executing check
- TimeoutSecond (1)
- SuccessThreshold (1) - Number of success check attempts until probe is considered successful after a failure
- FailureThreshold (3) - Number of failures for check before probe is marked as failed and takes action
Use Kube YAML to validate the manifests
Common Pod Error Statuses
- ImagePullBackoff - Image could not be pulled from the registry
- ErrImagePull - Image could not be pulled from the registry
- CrashLoopBackOff - Application or command run in container crashes
- CreateContainerConfigError - ConfigMap or Secret referenced by the container cannot be found
Check for number of restarts in pod statuses
Ephemeral Containers
- Some containers are very minimal that they wont even provide a shell
  - Google Distroless
  - K8s.gcr.io/pause:3.1
- Ephemeral containers are meant to be disposable and can be used for troubleshooting
- kubectl alpha debug -it minimal-pod —image=busybox
Troubleshooting Services
- Check if labels are mapped properly
- Get endpoints for a service
  - kubectl get endpoints service_name
- Open a quick tmp container and query the cluster IP
  - kubectl run tmp —image=busybox -it —rm — wget -o- 10.99.155.165:80
- Check if port mapping is done correctly between the service and the pod
Monitoring
- Metrics Server should be installed
- kubectl top nodes
- kubectl top pod pod_name

Scheduling

NodeSelector
- Allows to specify a label that a node has in order for a pod to run
- Label must exist already on the node before deployment of the pod
- Pod will only be scheduled when a label is matched
- No conditional logic. Simple label equality is allowed
NodeAffinity
- Complex conditions than node selector
- 2 Types
  - RequiredDuringSchedulingIgnoredDuringException
    - The expression defined must be matched during scheduling
    - Does not evict the pod after a pod has been scheduled after the expression becomes invalid after pod scheduling
  - PreferredDuringSchedulingIgnoredDuringExecution
    - A node matching the expression is only a preference not a mandatory requirement
    - The pod will still get scheduled if no node matching expression exists
  - NodeSelectorTerms
    - Specify the expressions. Expressions are evaluated as OR. So if one of the expression is true, then the whole selector is set to true
    - Node Affinity Operators
      - In: List of values that the value of the label can have
      - NotIn: Opposite of In
      - Exists: Checks if a label exists
      - DoesNotExist: Opposite of Exist
      - Gt: Numeric Comparison
      - Lt: Numeric Comparison
  - NodeAffinityWeight
    - Each expression can have a weight
    - Question: How to check the expression weights during scheduling?
PodAffinity
- Specifies where pods can located relative to each other
- Used to colocate pods or separate pods
- LabelSelector are used to match pods
- Same affinity types are node affinity
- TopologyKey
  - Question: What is this?
- PodAntiAffinity: Opposite of pod affinity
Taints
- NodeAffinity attracts pods to a node. Taints repel them
- We can then use toleration to allow particular pods to be scheduled
- Taints are applied to nodes and tolerations are applied to pods
- Taint Effects
  - NoSchedule
  - PreferNoSchedule
  - No Execute
- kubectl taint node[node] key=value<effect]
Tolerations
- Allows a pod to be scheduled despite a taint
- Toleration Operators
  - Equals
  - Exists
- In order for a toleration to work, the taint and the toleration should have the same effect
Use cases for Taints & Tolerations
- Dedicated Nodes
  - Customer specific nodes
- Specialized Hardware
- Eviction
  - Uses NoExecute taints
Question: Taints & Tolerations vs Affinity?
Node Condition bases taints
- node.kubernetes.io/memory-pressure
- node.kubernetes.io/disk-pressure
- node.kubernetes.io/pid-pressure
- node.kubernetes.io/unschedulable
- node.kubernetes.io/network-unavailable

Operators

Bundle, package and manage k8s custom controllers
Different between an operator vs other methods like helm is that if something changes in a deployment/config etc, there is no reconciliation possible. Operators allow for reconciliation
Operators does the following: observe, diff, apply
Handles upgrades of the images
Auto healing of the components managed by the operator
https://operatorframework.io
- Go lang based operators - Most feature rich
- Ansible based operators
- Helm based operators
Capability Model (https://operatorframework.io/operator-capabilities/)
- Level 1 ⇒ Basic Install - Automated application provisioning and configuration management
- Level 2 ⇒ Seamless Upgrades - Patch and minor version, upgrades supported
- Level 3 ⇒ Full Lifecycle - App Lifecycle, storage lifecycle (Backup, failure recovery)
- Level 4 ⇒ Deep Insights - Metrics, alerts, log processing and workload analysis
- Level 5 ⇒ Auto Pilot - Horizontal/Vertical scaling, auto config, tuning, abnormal detection, scheduling tuning
Popular Operators
- Prometheus
- Elasticsearch
- Istio
- ArgoCD
- MinIO

Security

Security 101
- Layered Defence/Defence in Depth
- Security should be redundant
- Least Privilege
Runtime Security
Host Security
Network Security
Threat Detection
Image Hygiene
SecOps
Attacks
- Attacking API Server
  - By default every pod gets a service account
    - Each service has account has a bunch of roles that allows the pod to talk to API server
  - Get token from pod (Like reading a file from the file system)
  - Use token to attack cluster API server
  - Get secrets to attack further
  - Mitigation
    - RBAC
      - Roles given to users and service accounts
      - Each role should have permission to perform some operation
      - RBAC settings apply to namespace
    - API Server Firewall
      - Restrict access to API server to certain IP addresses
      - master-authorized-networks
    - Network Policy
      - Restrict access to important services to only the pods that require it
      - Specify access via labels
      - Network plugins should support it
- Get access to cluster components
  - Manipulate cluster components like etcd
  - Mitigation
    - Encrypt data in etcd
    - Use auth and firewalls to restrict access to etcd
Security Issues
- Images
  - Code from untrusted registries
  - Vulnerabilities in tools of OS or code libraries
  - Bloated base images
  - Mitigation
    - Use approved lean images
    - Create list of trusted registry
    - Image Scanning
    - Used signed images

Hardening Clusters

Hardening a Kubernetes (K8s) cluster from a security perspective involves a multi-layered approach that encompasses several aspects of the Kubernetes environment. Here’s a comprehensive guide to securing your Kubernetes cluster:

Cluster Configuration and Management
- Use Secure Access Controls: Implement role-based access control (RBAC) to limit access based on the principle of least privilege.
- Audit Logging: Enable audit logging to keep track of all actions and changes within the cluster. This can help in forensic analysis in case of security incidents.
- Regularly Update and Patch: Keep your Kubernetes components and dependencies up to date to protect against vulnerabilities.
Network Policies and Security
- Implement Network Policies: Define network policies to control the traffic between pods within a cluster, ensuring only authorized pods can communicate with each other.
- Use a Service Mesh: Implement a service mesh like Istio for enhanced security features, including secure service-to-service communication.
- Encrypt Data in Transit: Ensure that all communications within your cluster are encrypted using TLS.
Cluster Authentication and Authorization
- Secure API Access: Use Transport Layer Security (TLS) for all API communication and ensure that all clients authenticate before accessing the Kubernetes API.
- Use Strong Authentication Mechanisms: Implement strong authentication mechanisms such as multi-factor authentication for accessing the Kubernetes cluster.
- Limit API Access: Minimize the number of components and users that have access to the Kubernetes API.
- Ecrypt data at rest in etcd
Pod Security Policies and Practices
- Apply Pod Security Policies (PSP): Use PSPs to control the security specifications pods must adhere to. Note that PSPs are deprecated in Kubernetes 1.21 and replaced by OPA Gatekeeper or Kyverno in newer versions.
- Use Security Contexts: Define security settings for your pods and containers, including permissions and user/group IDs.
Secrets Management
- Encrypt Secrets at Rest: Ensure that secrets stored in Kubernetes are encrypted at rest using a key management system.
- Manage Secrets Securely: Use external secrets management systems like HashiCorp Vault for better secrets management and rotation policies.
Container Security
- Use Trusted Base Images: Only use trusted container images and regularly scan images for vulnerabilities.
- Implement Image Signing and Verification: Ensure that only signed images are used in your cluster to prevent tampering.
- Run Containers as Non-Root: Avoid running containers as root unless absolutely necessary to reduce the risk of privilege escalation.
Monitoring and Alerting
- Implement Monitoring Solutions: Use tools like Prometheus and Grafana for monitoring the health and performance of your cluster.
- Set Up Alerting: Configure alerting mechanisms to notify you of potential security issues or anomalies.
Disaster Recovery and Backup
- Regular Backups: Regularly back up your cluster’s etcd data to ensure you can recover from data loss or corruption.
- Disaster Recovery Plan: Have a disaster recovery plan in place, including procedures for restoring from backups in case of catastrophic failures.
Security Tools and Audits
- Use Security Tools: Implement security tools like Aqua Security, Sysdig Secure, or Kube-Bench to continuously scan and audit your cluster for vulnerabilities.
- Conduct Regular Security Audits: Regularly audit cluster’s security posture and compliance with security best practices and standards.
Education and Awareness
- Security Training for Teams: Ensure that your team is aware of best practices for Kubernetes security and the specific configurations and tools used in your environment.

Commands

kubectl scale
kubectl expose
kubectl autoscale
kubectl create service
kubectl get ednpoints
kubectl describe
- svc
minikube addons list
kubectl explain ingress.spec.rules
kubectl exec -it[name] bash
kubectl create cm variables —from-env-file=variables
kubectl create cm special —from-literal=key=value
kubectl create cm config —from-file file.
kubectl create secret generic name —from-file=key=/filepath
kubectl get pods
kubectl describe pod pod_name
kubectl get events
kubectl get pods —show-labels
kubectl label pod pod_name key=value
kubectl label pod pod_name key=value1 —overwrite
kubectl label pod pod_name key-

Questions

How to dynamically mention the port numbers inside the manifest?

Topics

Crossplane
Pod Restart Policy
Production Best Practices
Operators
Kustomize
Secrets Management
GitOps
Cluster Role
Cluster Role Binding
Helm
Resource Profiles
Pod Priorities
Types of Deployments
- Rolling
- Canary
- Blue Green
- Replace
Health probes
- Process health checks
- Liveness probes
- Readiness probes
- Startup probes
Managed Lifecycle
- Sigterm, sigkill, poststart hook, prestop hook
- Lifecycle hooks
Automated Placements
Behavioural Patterns
- Batch Jobs
- Periodic Jobs
- Daemon Service
- Singleton Service
- Stateless Service
- Stateful Service
Service Discovery
Self Awareness
Structural Patterns
- Init Container
- Sidecar
- Adapter
- Ambassador
Security Patterns
- Process Containment
- Network Segmentation
- Secure Configuration
Access Control
- Authentication
- Authorization
- Admission Controllers
- Subject
- Role Based Access Control
Advanced Patterns
- Controller
- Operator
- Elastic Scale
- Image Builder
Terraform
Ansible
Kube Proxy
Pod Security Policy
CNI
Core DNS
Monitoring
Logging
- Cluster Logging
- Node Logging
Troubleshooting
- Control Plane Nodes
- Worker Nodes
- Services
- Opening an interactive shell
Observability
Multitenancy
SSL Certificates
Image Hardening
Cluster Hardening
Host Hardening
Distributed Tracing
Packet Capture
Service Graph
Visualization of network flows
Encryption
Service Mesh
Cluster Federation
Multi node Cluster
Service Accounts
Role Bindings
Backup & Restore etcd
Upgrading a cluster
PV & PVC
- Access Mode
- Volume Mode
- Reclaim policy
Finalizers
Owners & Dependents
Object Names & IDs
Leases
Cgroup v2
Container Runtime Initiative
Garbage Collection
Ephemeral Containers
Disruptions
Automatic Cleanup
Cron Jobs
Gateway API
Endpoints
Endpoint Slices
Network Policies
Dual Stack
Topology Aware routing
Service Cluster IP Allocation
Service Internal Traffic Policy
Volumes
- Persistent Volumes
- Projected Volumes
- Storage Classes
- Volume Attribute Classes
- Dynamic Volume Provisioning
- Volume Snapshots
- Volume Cloning
- Node Specific Volume Limits
- Volume Health Monitoring
- Bind Mounts
- Tmpfs mounts
Data Backups
Security
- Pod Security Standard
- Pod Security Admission
Policies
- Limit Ranges
- Resource Quotas
- Process ID limits
- Node Resource Managers
Scheduling, Preemption & Eviction
- Node Affinity & Pod Affinity
- Assigning pods to nodes
- Pod overheads
- Pod scheduling readiness
- Pod topology spread constraints
- Taints & Tolerations
- Scheduling Framework
- Dynamic Resource Allocation
- Scheduler Performance Tuning
- Resource Bin Packing
  - A scheduling strategy to optimize the resource utilization in the cluster.
  - Places pods on nodes that maximizes CPU, memory and other resources while still respecting scheduling constraints and requirements of each pod
  - Question: How to do bin packing?
- Node Pressure Eviction
- API initiated Eviction
Performance
Cluster Admin
- Certificates
- Logging Architecture
- System Logs
- Traces for k8s components
- Proxies in k8s
- API Priority & fairness
- Installing Addons
Extending
- Network Plugins
- Device Plugins
- Custom Resources
- Operator Pattern
Pod Security Policy
Manifest Validation
ApplicationSet
HPA & VPA
Istio
PodDisruptionBudgets
- Increases resiliency of the application
- Ensure that a minimum number pods of a replicated application remains available during voluntary disruptions
- Voluntary Disruptions can occur during
  - Node is drained to perform maintenance or upgrades
  - Pods are deleted to scale down a deployment
  - Pods are moved during a rescheduling
- Key Components
  - MinAvailable ⇒ Can be an absolute number or percentage of pods
  - MaxUnavailable ⇒ Can be an absolute number or percentage of pods
K8S in IOT
Pause Container
- Also known as pod infrastructure container
- A secret container that runs in every pod called Pause Container
- The single job is to hold the network namespace
- Created before the business application container
- IP is given to the pause container
- If pause container is deleted, a new pod is created by k8s
Sidecar
- Sidecar Injection
Envoy Proxy
- High performance, OSS network proxy
- Operates at layer 7
- Can act as a service mesh which is both a proxy and reverse proxy
- Provides load balancing
- Architecture
  - Downstream/Upstream
  - Clusters
  - Listeners
  - Network Filters
  - Threading Model
  - Connection Pools
Init Container
Admission Hook
PodSecurityPolicy
ForegroundCascadeDeletion
BackgroundCascadeDeletion
Admission Controller
- Plugins that intercept requests to API server before the persistence of object configuration in etcd
- Runs after the request is authenticated and authorized.
- They can modify or reject requests to enforce certain policies or augment the incoming request based on specific logic
- 2 Types
  - Validating Admission Controllers
    - Validate the request after processed by mutating controller
    - Failed requests are rejected and an error is returned
  - Mutating Admission Controllers
    - Modify incoming requests before validation
- Inbuilt Admission Controllers
  - NamespaceLifecycle: All requests to non existent namespace are rejected
  - LimitRanger: Enforces limits on size of objects
  - ServiceAccount
  - SecurityContextDeny
  - PodSecurityPolicy
  - MutationAdmissionWebhook & ValidatingAdmissionWebhook: Custom admission policies
Draining a Node
- Is a process of evicting all pods from the node for maintenance or upgrades
- Workloads should be shut/moved gracefully
- Steps
  - Mark node as unschedulable
    - kubectl cordon node_name
  - Drain the node
    - kubectl drain node_name —ignore-daemonsets —delete-local-data
    - If pods cant be evicted because of PDB or other scheduling policies, the drain command will wait and retry
  - Perform Maintenance
  - Uncordon the node

Network Policies

By default any pod can talk to any pod using IP or DNS name across namespaces.
Network policies control traffic from pod to pod
The rules are matched using label selectors.
NP defines the direction of traffic and to allow/disallow.
Incoming traffic is called ingress and outgoing is called egress

Storage

Storage Classes
- Ideally, before creating a PV, the disk should be created and added to the cluster. This is called static provisioning
- Storage classes help in dynamic provisioning
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
	name: google-storage
provisioner: kubernetes.io/gce-pd
parameters:
	type:
	replication-type: 
```
- Once the SC is defined there is no need to define a PV. Just mention the storageClassName in PVC
Stateful Sets
- Deployments cannot guarantee order to starting up
- In Stateful Sets, pods are created in a determined order
- Assigns unique ordinal index to each pod
- SS assigns a sticky identity to each pods
- SS manifest needs the name of a headless service to be specified
- podManagementPolicy: Parallel orders the SS to create pods in parallel but still offer other benefits of SS
- Headless service does not have an IP, does not do load balancing. Creates only a DNS entry. podname.headless-servicename.namespace.svc
- Headless service has ClusterIp set to None.

Security

kube-apiserver is a key component to secure the cluster
TLS Certificates for all communication between the cluster
K8s does not manage user accounts. Integrates with other IDP
Authentication Mechanisms
- Static Password Files: --basic-auth-file=users.csv
- Static Token File: --token-auth-file=users.csv
  - Passed as Bearer token
- Certificates
- IdP
Kube Config
- Clusters
- Contexts
  - Combines users and clusters
- Users
- kubectl config use-context[context_name]
API Groups
- Core Group ⇒ Has all the core resources
- Named Group ⇒ New additions
Authorization
- Types
  - Node
    - Used by kubelets
  - ABAC
    - Used by users / group of users
    - Managed by policy files
    - kube-apiserver needs to be restarted for any changes to ABAC
  - RBAC
    - Instead of associated policies for users, a role is first created and then associated with users
  - Webhook
    - Used for outsourcing auth mechanism
      - Example: Open Policy Agent
  - AlwaysAllow
  - AlwaysDeny
- When multiple auth modes are activated, the order of evaluation is based on the order in which the modes are specified when starting the kube-apiserver
- RBAC
  - Create a role definition file
```
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
	name: developer
rules:
	- apiGroups: [""] # [""] represents core api group
		resources: ["pods"]
		verbs: ["list", "get"]
		resourceNames: []
```
  - Link user to the role by using a role binding
  - kubectl auth can-i create deployments to check access
  - kubectl auth can-i create pods --as dev-user to impersonate and check
  - Roles and Role Bindings are namespaced
- Cluster Scoped Resources
  - Nodes
  - PV
  - ClusterRoles
  - ClusterRoleBindings
  - CSR etc
- Admission Controllers
  - Happens after request authorizarion is successful
  - Enforces validation on the manifest and applies some changes if needed
  - Pre Built Admission Controllers
    - AlwaysPullImages
    - DefaultStorageClass
    - EventRateLimit
    - NamespaceExists
  - Flags
    - --enable-admission-plugins and --disable-admission-plugins
  - It can not only perform validations and accept/reject requests but also can perform operations on the backend. Like NamespaceAutoProvision which creates a namespace if it doesnt exist
  - Validating Admission Controller
    - Example: NamespaceExists
  - Mutating Admission Controller
    - Example: DefaultStorageClass
  - Usually mutating controllers are applied first before the validating controllers
- Nutating & Admission Webhooks
  - Point the webhooks to a service running custom controllers either inside the cluster or outside the cluster
    - Request is an AdmissionReview object. Response is AdmissionReview with response.allowed field containing the result
    - Can be any API server written using any technology
  - POST /validate and POST /mutate are the endpoints for the webhooks
API Versions
- Alpha versions are not enabled by default
- An API group can support multiple versions at the same time
  - But only one can be preferred/active one. Preferred is used when retrieving info from API
  - Only one can be a storage one. Storage is used when storing in etcd
API Deprecations
- Deprecation policy Rules
  - API elements may only be removed by incrementing the version of the API Group. i.e a resource in v1alpha1 can be removed only after the version changes to v1alpha2
  - API objects may be able to round trip between versions in a given release without information loss with the exception of whole REST resources that do not exist in some versions
- kubectl convert -f[file_name] to migrate manifests from old API to new API

CRD

Each custom resource has its own controller that is responsible for managing the custom resource

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
	name: fully qualified name
spec:
	scope: Namespaced # or not
	group: #api group
	names:
		kind:
		singular:
		plural: 
		shortNames:
			- # short name
	version:
		- name:
			served: true
			storage: true
			schema:
				openAPIV3Schema:

Just adding a CRD will store it in etcd. No logic will be executed
- A custom controller is required for custom logic
A controller is an object that runs in a loop and listens to changes in resources it is interested in

Operator Framework
- Used to deploy the CRD and the controller together
- Operators do the job that humans manually do in terms of deploying, backing up, fixing issues etc

Volumes

Each container in a pod gets its own temporary file system that cannot be accessed by another container
Volumes can be used to exchange data between a main application container and a sidecar
Volume Type
- emptyDir ⇒ Empty directory in pod with RW access. Only persisted for the lifespan of a pod. Useful for exchanging data between containers of a pod
- hostPath ⇒ File or directory from the host node’s filesystem
- configMap, secret ⇒ Provides a way to inject configuration data
- nfs ⇒ Network file system
- PVC ⇒ Claims a persistent volume

apiVersion: v1
kind: Pod
metadata:
  name: business-app
spec:
  volumes:
  - name: logs-volume
    emptyDir: {}
  containers:
  - image: nginx
    name: nginx
    volumeMounts:
    - mountPath: /var/logs
      name: logs-volume

Static vs Dynamic Provisioning
- Static ⇒ Create the storage device first and create a PV
- Dynamic ⇒ PV will automatically be created by setting a storage class name using spec.storageClassName
  - A storage class is an abstraction concept that defines a class of storage device like fast vs slow performance etc
PV can only be created using manifest and not via CLI
- Access Modes
  - ReadWriteOnce
  - ReadOnlyMany
  - ReadWriteMany
- Reclaim Policy ⇒ Determines what should happen with the PV after it has been released from its claim. By default the object will be retained

apiVersion: v1
kind: PersistentVolume
metadata:
  name: db-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/db

PVC
- The purpose is to bind the PV to a pod

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: db-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 512m

Once the PVC is created, the status set to Bound which means that the binding to the PV was successful
Once the PV and PVC are defined, the volume can be mounted on the pod by specifying the PVC in the pod manifest

apiVersion: v1
kind: Pod
metadata:
  name: app-consuming-pvc
spec:
  volumes:
    - name: app-storage
      persistentVolumeClaim:
        claimName: db-pvc
  containers:
  - image: alpine
    name: app
    command: ["/bin/sh"]
    args: ["-c", "while true; do sleep 60; done;"]
    volumeMounts:
      - mountPath: "/mnt/data"
        name: app-storage

ISTIO

Service Mesh
- Provides visibility on the interconnection between the pods
Each pods has a sidecar container called istio proxy which tracks all the calls
- The proxies are collective terms called Data Plane
istiod runs on istio-system which gets data from istio proxy
- Pods running on istio-system are called Control Plane

EKS

Components
- Control Plane
  - Has atleast 2 API server nodes and 3 etcd nodes that run across 3 AZs in a region
  - Automatically detects unhealthy control plane nodes
- Worker Nodes & Node Groups
  - Group of ec2 instances to run workloads
  - Node Group: One or more ec2 instances that are deployed in an ASG
    - All instances in node group should be of same type, run same AMI and has same worker node role
- Fargate Profiles
  - Serverless
  - Runs only on private subnets. Needs VPC with atleast 1 private subnet
- VPC

Tools

Valero - Backup of data and cluster state
Kubeval
Conftest
Datree