SUSE Adaptive Telco Infrastructure Platform (ATIP)

SUSE ATIP is a platform designed for hosting modern, cloud native, Telco applications at scale from core to edge.

This section documents how to configuration telco specific features on ATIP deployed clusters

We will cover the next topics:

Bios configuration: Bios configuration to be used by the real time kernel to optimize the performance in the hardware.
Kernel image for Real Time: Kernel image to be used by the real time kernel.
CPU Tuned configuration: Tuned configuration to be used by the real time kernel.
Multus-Calico configuration: Multus configuration to be used by the kubernetes cluster.
SRIOV configuration: SRIOV configuration to be used by the kubernetes workloads.
DPDK configuration: DPDK configuration to be used by system.
Huge Pages: Huge Pages configuration to be used by the kubernetes workloads.
CPU Pinning configuration: CPU Pinning configuration to be used by the kubernetes workloads.
NUMA Aware scheduling configuration: NUMA Aware scheduling configuration to be used by the kubernetes workloads.
Metal LB configuration: Metal LB configuration to be used by the kubernetes workloads.

Bios configuration

Note: This configuration depends on the hardware vendor, so please, check with your hardware vendor the best configuration to be used.

This section is really important to optimize the performance of the real time kernel in the hardware, because some of this parameters could increase not limiting the performance of the real time kernel. The next table shows the recommended configuration for the most common hardware vendors:

Option	Value	Description
Workload Profile	Telco Optimized	Telco profile to optimize the performance in the hardware.
Boot Performance Mode	Max. Performance	Maximize the performance in the boot process.
Hyper- Threading (Logical Proccesor)	Enable	This option enables Intel® Hyper-Threading Technology for logical processor enabling and converting processor cores (pCores) to logical cores (lCores).
Virtualization Technology (XAPIC)	Enable	This option is for Extended Advanced Programmable Interrupt Controller (xAPIC) support for the Intel® Virtualization Technology for Directed I/O (Intel® VT-d) feature.
uncore frequency scaling	Disable	If enabled, Uncore Frequency Scaling (UFS) allows the uncore to operate at a lower frequency when the Power Control Unit (PCU) has detected low utilization. Conversely, UFS allows the uncore to operate at a higher frequency when the PCU has detected high utilization.
CPU P-State Control (EIST PSD Function	HW_ALL	optimization of the voltage and CPU fequency during operation
CPU C-State Control	Disable	This option is for the CPU C-State Control feature, which provides power savings by placing the processor into lower power states when the processor is idle.
CPU C1E Support	Disable	This option is for the CPU Enhanced Halt (C1E) feature, which provides power savings by placing the processor into a low power state when the processor is idle.
AVX License Pre-Grant	Enable	If enabled, this option enables the pre-grant license level selection based on workload with the AVX ICCP Pre-Grant Level option.
AVX ICCP Pre- Grant Level	Level 5	This option selects a workload level for the Intel® Advanced Vector Extensions (Intel® AVX): Intel® AVX-512 Heavy
AVX P1	Level 2	This option serves a dual purpose: 1 -Specifies the base P1 ratio for Intel® Streaming SIMD Extensions (Intel® SSE) or Intel® AVX workloads. 2- Pre-grants a license level based on the workload level.
Energy Efficient Turbo	Disable	This option allows entry into the Intel® Turbo Boost Technology frequency when the Power Control Unit (PCU) has detected high utilization.
Turbo Mode	Enable	Enabling this Intel® Turbo Boost Technology mode setting allows the CPU cores to operate at higher than the rated frequency.
GPSS timer	0us	This option allows the reduction of the Global P-State Selection (GPSS) timer to be set from: 0 μs to 500 μs
LLC prefetch	Enable	This option enables Last Level Cache (LLC) hardware prefetch logic.
Frequency Prioritization (RAPL Prioritization)	Disable	This setting controls whether the Running Average Power Limit (RAPL) balancer is enabled. If enabled, it activates per core power budgeting.
Hardware P-States	Native with no Legacy Support	When enabled, this option allows the hardware to choose a Performance State (P-State) based on an OS request (that is, a legacy P-State).
EPP enable3	Disable	When this option is enabled, the system uses the energy performance bias register for the Energy Performance Preference (EPP) input to make decision on Performance State (P-State) or Processor Core Idle State (C-State) transitions.
APS Rocketing	Disable	Rocketing mechanism in the HWP p-state selection for pcode algorithm. Rocketing enables the core ratio to jump to max turbo instantaneously as opposed to a smooth ramp
Scalability	Disable	Core Performance to frequency scalability based on optimizations in the CPU.
Native ASPM	Disable	ASPM off not controlled by BIOS or OS.
Power Performance Tuning	OS Controls EPB	This option selects the BIOS or OS that controls the Energy Performance Bias (EPB) functionality.
Workload Configuration	I/O sensitive	This option allows the system power and performance profile to be set to favor compute intensive workload or I/O sensitive workload.
Dynamic L1	Disable	This option applies only to the package-level setting to allow dynamically entering the lower power link state L1.
Set Fan Profile	Performance	This option allows the fan profile to be set to Performance, Balanced, or Quiet.
Cooling Configuration - Fan Speed Offset	Medium	This option allows the fan speed offset to be set to Low, Medium, or High.

Kernel Real Time

The real time kernel image is not necessarily better than a standard kernel. It is a different kernel tuned to a specific use case. The real time kernel is tuned for lower latency at the cost of throughput. The real time kernel is not recommended for general purpose use, but in our case, this is the recommended kernel for Telco Workloads.

There are 4 top features:

Deterministic Execution:

Get greater predictability – ensure critical business processes complete in time, every time and deliver high quality of service, even under heavy system loads. By shielding key system resources for high-priority processes, you can ensure greater predictability for time-sensitive applications.

Low Jitter:

The low jitter built upon the highly deterministic technology helps to keep applications synchronized with the real world. This helps services that need ongoing and repeated calculation.

Priority Inheritance:

Priority inheritance refers to the ability of a lower priority process to assume a higher priority when there is a higher priority process that requires the lower priority process to finish before it can accomplish its task. SUSE Linux Enterprise Real Time solves these priority inversion problems for mission-critical processes.

Thread Interrupts:

Processes running in interrupt mode in a general-purpose operating system are not preemptible. With SUSE Linux Enterprise Real Time these interrupts have been encapsulated by kernel threads, which are interruptible, and in turn allow the hard and soft interrupts to be preempted by user-defined higher priority processes.

In our case, if you have installed a real time image like SLE Micro RT, kernel real time is already installed and you don't need to install it again.

You could check it looking for the kernel and see if contains the rt string at the end of the kernel info:

uname -r
5.14.21-150400.15.11-rt

For more information about the real time kernel, please visit https://www.suse.com/products/realtime/

CPU Tuned Configuration

The first thing is to create a profile for the cpu cores we want to isolate. In this case, we will isolate the cores 1-30 and 33-62.

echo "export tuned_params" >> /etc/grub.d/00_tuned

echo "isolated_cores=1-30,33-62" >> /etc/tuned/cpu-partitioning-variables.conf

tuned-adm profile cpu-partitioning
Cannot talk to Tuned daemon via DBus. Is Tuned daemon running?
Trying to (re)start tuned...
Tuned (re)started, changes applied.

Then we need to modify grub option to isolate cpu cores as well as another important parameters for the cpu usage.

Modify in /etc/default/grub the next line, to add the cpu cores to isolate. The next options are the most important to be customized with your current hardware:

parameter	value	description
isolcpu	1-30,33-62	Isolate the cores 1-30 and 33-62
skew_tick	1	This option allows the kernel to skew the timer interrupts across the isolated CPUs.
nohz	on	This option allows the kernel to run the timer tick on a single CPU when the system is idle.
nohz_full	1-30,33-62	kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation.
rcu_nocbs	1-30,33-62	This option allows the kernel to run the RCU callbacks on a single CPU when the system is idle.
kthread_cpus	0,31,32,63	This option allows the kernel to run the kthreads on a single CPU when the system is idle.
irqaffinity	0,31,32,63	This option allows the kernel to run the interrupts on a single CPU when the system is idle.
processor.max_cstate	1	This option prevents the CPU from dropping into a sleep state when idle
intel_idle.max_cstate	0	This option disables the intel_idle driver and allows acpi_idle to be used

With the values showed above, we are isolating 60 cores, and we are using 4 cores for the OS.

Let's modify the grub file with the previous values:

vi /boot/efi/EFI/sle_rt/grub.cfg
	set tuned_params="skew_tick=1 nohz=on nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 tuned.non_isolcpus=80000001,80000001 nosoftlockup"

vi /etc/default/grub
    GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpu=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll"

transactional-update grub.cfg

To validate that the parameters are applied after reboot, you could check:

cat /proc/cmdline

Multus + Calico

Multus CNI is a CNI plugin that enables attaching multiple network interfaces to pods. Multus does not replace CNI plugins, instead it acts as a CNI plugin multiplexer. Multus is useful in certain use cases, especially when pods are network intensive and require extra network interfaces that support dataplane acceleration techniques such as SR-IOV.

Multus can not be deployed standalone. It always requires at least one conventional CNI plugin that fulfills the Kubernetes cluster network requirements. That CNI plugin becomes the default for Multus, and will be used to provide the primary interface for all pods. In our case, most of the workloads in Telco will be deployed using Multus + calico.

To enable Multus on RKE2 cluster, add multus as the first list entry in the cni config key, followed by the name of the plugin you want to use alongside Multus (or none if you will provide your own default plugin). Note that multus must always be in the first position of the list. For example, to use Multus with calico as the default plugin you could specify:

# /etc/rancher/rke2/config.yaml
cni:
- multus
- calico

This can also be specified with command-line arguments, i.e. --cni=multus,calico or --cni=multus --cni=calico.

You could also install Multus directly during the edge cluster installation:

For more information about Multus, please visit https://github.com/k8snetworkplumbingwg/multus-cni

For more information about CNI plugins, please visit https://docs.rke2.io/install/network_options

SRIOV

SR-IOV allows a device, such as a network adapter, to separate access to its resources among various PCIe hardware functions. There are different ways to deploy SRIOV, and in this case, we will show two different options:

Option 1: using the SRIOV CNI device plugins and a config map to configure it properly.
Option 2: using the SRIOV helm chart from Rancher to make this deployment easy.

Option 1 - Installation of SR-IOV CNI device plugins and a config map to configure it properly

Prepare the config map for the device plugin

You could get the information to fill the config map from the lspci command:

lspci | grep -i acc
8a:00.0 Processing accelerators: Intel Corporation Device 0d5c

lspci | grep -i net
xr11-1:~ # lspci | grep -i net
00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
00.2 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
00.3 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
01.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
01.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
01.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
01.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
11.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
11.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
11.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
11.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)

0d5d is the VF from the FEC card (take a look that it's different than the lspci | grep acc result, because it's the VF, not the PF). Normally it's the first VF of the card, so the last name will be consecutive after VF creation.

The config map consists of a JSON file that describe devices using filters to discover and creates some groups for the interfaces. The most important is to understand the filters and the groups. The filters are used to discover the devices and the groups are used to create the interfaces.

For example, we could filter using:

vendorID: 8086 (Intel)
deviceID: 0d5d (FEC)
driver: vfio-pci (SRIOV driver)
pfNames: p2p1 (PF name)

We could also set placesholders like:

pfNames: ["eth1#1,2,3,4,5,6"]

Regarding the groups, we could create a group for the FEC card and another group for the Intel card even creating some prefix depending our use case:

resourceName: pci_sriov_net_bh_dpdk
resourcePrefix: Rancher.io

There are a lot of combinations in order to discover and create the resource group to allocate some VFs to the pods.

For more information about the filters and groups, please visit https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin

cat <<EOF | k apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [
            {
                "resourceName": "intel_fec_5g",
                "devicetype": "accelerator",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["0d5d"]
                }
            },
            {
                "resourceName": "intel_sriov_odu",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["1889"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["p2p1"]
                }
            },
            {
                "resourceName": "intel_sriov_oru",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["1889"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["p2p2"]
                }
            }
        ]
    }
EOF

Prepare the daemonset for the device plugin

No changes are needed in the daemonset, so you can use the same upstream daemonset file.

cat <<EOF | k apply -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sriov-device-plugin
  namespace: kube-system

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-amd64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      nodeSelector:
        kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:latest-amd64
        imagePullPolicy: IfNotPresent
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json
EOF

After that you should see the pods running:

kubectl get pods -n kube-system | grep sriov
kube-system       kube-sriov-device-plugin-amd64-twjfl                    1/1     Running   0          2m

Check the interfaces discovered and available in the nodes to be used by the pods:

kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq
{
  "cpu": "64",
  "ephemeral-storage": "256196109726",
  "hugepages-1Gi": "40Gi",
  "hugepages-2Mi": "0",
  "intel.com/intel_fec_5g": "1",
  "intel.com/intel_sriov_odu": "4",
  "intel.com/intel_sriov_oru": "4",
  "memory": "221396384Ki",
  "pods": "110"
}

The FEC will be intel.com/intel_fec_5g and the value will be 1
The VF will be intel.com/intel_sriov_odu or intel.com/intel_sriov_oru if you deploy it with device plugin and the config map without helm charts

Important Note: If you don't get the interfaces available here, does not make sense continue with the workload, because interface will not be available for pods

Get helm if not present

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 500 get_helm.sh
./get_helm.sh

Install SRIOV

This part could be done in two ways, using the CLI or using the Rancher UI

Install Operator from CLI

helm repo add rancher-charts https://raw.githubusercontent.com/rancher/charts/dev-v2.7/
helm install sriov-crd rancher-charts/sriov-crd
helm install sriov rancher-charts/sriov -n kube-system

Install Operator from Rancher UI

Once your cluster is installed and you have access to the Rancher UI, you can install the SR-IOV Operator from the Rancher UI from the apps tab:

Check the deployed resources crd and pods

kubectl -n sriov-network-operator get crd

kubectl -n sriov-network-operator get pods

Check the label in the nodes

Now, if you have all resources running, the label should appears automatically in your node:

kubectl get nodes -oyaml | grep feature.node.kubernetes.io/network-sriov.capable
    feature.node.kubernetes.io/network-sriov.capable: "true"

if not present, you can add it manually:

kubectl label $(kubectl get nodes -oname) feature.node.kubernetes.io/network-sriov.capable=true

Review the daemonset to see the new `sriov-network-config-daemon` and `sriov-rancher-nfd-worker` as active and ready

kubectl get daemonset -A
NAMESPACE             NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                           AGE
calico-system         calico-node                     1         1         1       1            1           kubernetes.io/os=linux                                  15h
cattle-sriov-system   sriov-network-config-daemon     1         1         1       1            1           feature.node.kubernetes.io/network-sriov.capable=true   45m
cattle-sriov-system   sriov-rancher-nfd-worker        1         1         1       1            1           <none>                                                  45m
kube-system           rke2-ingress-nginx-controller   1         1         1       1            1           kubernetes.io/os=linux                                  15h
kube-system           rke2-multus-ds                  1         1         1       1            1           kubernetes.io/arch=amd64,kubernetes.io/os=linux         15h

After some minutes (can take up to 10 min to be updated) the nodes detected and configured will appear:

kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -A
NAMESPACE             NAME     AGE
cattle-sriov-system   xr11-2   83s

Check the interfaces detected

the interfaces discovered should be the pci address of the network device. Check this information with lspci command in the host.

$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n kube-system -oyaml
apiVersion: v1
items:
- apiVersion: sriovnetwork.openshift.io/v1
  kind: SriovNetworkNodeState
  metadata:
    creationTimestamp: "2023-06-07T09:52:37Z"
    generation: 1
    name: xr11-2
    namespace: cattle-sriov-system
    ownerReferences:
    - apiVersion: sriovnetwork.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: SriovNetworkNodePolicy
      name: default
      uid: 80b72499-e26b-4072-a75c-f9a6218ec357
    resourceVersion: "356603"
    uid: e1f1654b-92b3-44d9-9f87-2571792cc1ad
  spec:
    dpConfigVersion: "356507"
  status:
    interfaces:
    - deviceID: "1592"
      driver: ice
      eSwitchMode: legacy
      linkType: ETH
      mac: 40:a6:b7:9b:35:f0
      mtu: 1500
      name: p2p1
      pciAddress: "0000:51:00.0"
      totalvfs: 128
      vendor: "8086"
    - deviceID: "1592"
      driver: ice
      eSwitchMode: legacy
      linkType: ETH
      mac: 40:a6:b7:9b:35:f1
      mtu: 1500
      name: p2p2
      pciAddress: "0000:51:00.1"
      totalvfs: 128
      vendor: "8086"
    syncStatus: Succeeded
kind: List
metadata:
  resourceVersion: ""

Note: If your interface is not detected here you should ensure that it is present in the next config map
kubectl get cm supported-nic-ids -oyaml -n cattle-sriov-system
if your device is not there you have to edit the config map adding the right values to be discovered (should be necessary to restart the daemonset sriov-network-config-daemon)

Create the NetworkNode Policy to configure the VFs

Basically, you will create some VFs (numVfs) from the device (rootDevices) and will be configured with the driver (deviceType) and the MTU (mtu):

cat <<EOF | kubectl apply -f -
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-dpdk
  namespace: kube-system
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: intelnicsDpdk
  deviceType: vfio-pci
  numVfs: 8
  mtu: 1500
  nicSelector:
    deviceID: "1592"
    vendor: "8086"
    rootDevices:
    - 0000:51:00.0
EOF

Validate configurations

kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq
{
  "cpu": "64",
  "ephemeral-storage": "256196109726",
  "hugepages-1Gi": "60Gi",
  "hugepages-2Mi": "0",
  "intel.com/intel_fec_5g": "1",
  "memory": "200424836Ki",
  "pods": "110",
  "rancher.io/intelnicsDpdk": "8"
}

Create the sriov network (Optional, in case we need a different network):

cat <<EOF | k apply -f -
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: network-dpdk
  namespace: kube-system
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "192.168.0.0/24",
      "rangeStart": "192.168.0.20",
      "rangeEnd": "192.168.0.60",
      "routes": [{
        "dst": "0.0.0.0/0"
      }],
      "gateway": "192.168.0.1"
    }
  vlan: 500 
  resourceName: intelnicsDpdk
EOF

Check the network created:

kubectl get network-attachment-definitions.k8s.cni.cncf.io -A -oyaml

apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/resourceName: rancher.io/intelnicsDpdk
    creationTimestamp: "2023-06-08T11:22:27Z"
    generation: 1
    name: network-dpdk
    namespace: kube-system
    resourceVersion: "13124"
    uid: df7c89f5-177c-4f30-ae72-7aef3294fb15
  spec:
    config: '{ "cniVersion":"0.3.1", "name":"network-dpdk","type":"sriov","vlan":500,"vlanQoS":0,"ipam":{"type":"host-local","subnet":"192.168.0.0/24","rangeStart":"192.168.0.10","rangeEnd":"192.168.0.60","routes":[{"dst":"0.0.0.0/0"}],"gateway":"192.168.0.1"}
      }'
kind: List
metadata:
  resourceVersion: ""

DPDK

Kernel parameters

To use dpdk using some drivers we need to enable some parameters in the kernel:

parameter	value	description
iommu	pt	This option allows to use vfio for the dpdk interfaces
intel_iommu	on	This option enables to use vfio for VFs.

To enable this parameters we need to add them to the kernel command line:

vi /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpu=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll"

Then you need to update the grub configuration and reboot the system to apply the changes:

transactional-update grub.cfg
reboot

To validate that the parameters are applied after the reboot you can check the command line:

cat /proc/cmdline

Load vfio-pci kernel module

modprobe vfio-pci

Create VFs from the NICs

To create 4 VFs PCI addresses for example for 2 different NICs we need to execute the following commands:

echo 4 > /sys/bus/pci/devices/0000:51:00.0/sriov_numvfs

echo 4 > /sys/bus/pci/devices/0000:51:00.1/sriov_numvfs

Bind the new VFs with the vfio-pci driver

dpdk-devbind.py -b vfio-pci 0000:51:01.0 0000:51:01.1 0000:51:01.2 0000:51:01.3 0000:51:11.0 0000:51:11.1 0000:51:11.2 0000:51:11.3

Review the configuration applied:

dpdk-devbind.py -s

Network devices using DPDK-compatible driver
============================================
51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:01.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:01.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:01.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:11.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:21.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
51:31.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio

Network devices using kernel driver
===================================
19:00.0 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em1 drv=bnxt_en unused=igb_uio,vfio-pci *Active*
19:00.1 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em2 drv=bnxt_en unused=igb_uio,vfio-pci
19:00.2 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em3 drv=bnxt_en unused=igb_uio,vfio-pci
19:00.3 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em4 drv=bnxt_en unused=igb_uio,vfio-pci
51:00.0 'Ethernet Controller E810-C for QSFP 1592' if=eth13 drv=ice unused=igb_uio,vfio-pci
51:00.1 'Ethernet Controller E810-C for QSFP 1592' if=rename8 drv=ice unused=igb_uio,vfio-pci

Huge Pages

When a process uses RAM, the CPU marks it as used by that process. For efficiency, the CPU allocates RAM in chunks—4K bytes is the default value on many platforms. Those chunks are named pages. Pages can be swapped to disk, etc.

Since the process address space is virtual, the CPU and the operating system need to remember which pages belong to which process, and where each page is stored. The more pages you have, the more time it takes to find where memory is mapped. When a process uses 1GB of memory, that's 262144 entries to look up (1GB / 4K). If one page table entry consume 8 bytes, that's 2MB (262144 * 8) to look up.

Most current CPU architectures support larger-than-default pages, which give the CPU/OS less entries to look-up.

Kernel parameters

To enable the huge pages we should add the next kernel parameters:

parameter	value	description
hugepagesz	1G	This options allows to set the size of huge pages to 1G
hugepages	40	This is the number of hugepages defined before
default_hugepagesz	1G	This is the default value to get the huge pages

Modify the grub file to add them to the kernel command line:

vi /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpu=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll"

Usage of huge pages

In order to use the huge pages we need to mount them:

mkdir -p /hugepages
mount -t hugetlbfs nodev /hugepages

Now you could deploy your kubernetes workload creating the resources as well as the volumes:

... 
 resources:
   requests:
     memory: "24Gi"
     hugepages-1Gi: 16Gi
     intel.com/intel_sriov_oru: '4'
   limits:
     memory: "24Gi"
     hugepages-1Gi: 16Gi
     intel.com/intel_sriov_oru: '4'
...

...
volumeMounts:
  - name: hugepage
    mountPath: /hugepages
...
volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
...

CPU Pinning Configuration

Requirements

You must have the CPU tuned to the performance profile covered on this section
You must have the RKE2 cluster kubelet configured with the cpu management arguments covered on this section

Use CPU Pinning on kubernetes

There are three ways to use that feature using the Static Policy defined in kubelet depending on the requests and limits you define on your workload:

BestEffort QoS Class: If you don't define any request or limit for CPU, the pod will be scheduled on the first CPU available on the system.

An example to use the BestEffort QoS Class could be:

spec:
  containers:
  - name: nginx
    image: nginx

Burstable QoS Class: If you define a request for CPU, which is not equal to the limits, or maybe there is no CPU request.

Examples to use the Burstable QoS Class could be:

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
      requests:
        memory: "100Mi"
        cpu: "1"

Guaranteed QoS Class: If you define a request for CPU, which is equal to the limits.

An example to use the Guaranteed QoS Class could be:

spec:
  containers:
    - name: nginx
      image: nginx
      resources:
        limits:
          memory: "200Mi"
          cpu: "2"
        requests:
          memory: "200Mi"
          cpu: "2"

NUMA Aware scheduling

Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a physical memory design used in SMP (multiprocessors) architecture, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

Identify NUMA nodes

To identify the NUMA nodes on your system you can use the next command:

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 257167 MB
node 0 free: 246390 MB
node distances:
node   0
  0:  10

Note: In this case we have only one NUMA node

NUMA has to enabled in the BIOS. If dmesg does not have records of numa initialization during bootup, then it is possible that NUMA related messages in the kernel ring buffer might have been overwritten.

VRAN Acceleration (Intel ACC100)

As communications service providers move from 4G to 5G networks, many are adopting virtualized radio access network (vRAN) architectures for higher channel capacity and easier deployment of edge-based services and applications. vRAN solutions are ideally located to deliver low-latency services with the flexibility to increase or decrease capacity based on the volume of real-time traffic and demand on the network.

Kernel parameters

To enable the vRAN acceleration we need to enable the following kernel parameters (if not present yet):

parameter	value	description
iommu	pt	This option allows to use vfio for the dpdk interfaces
intel_iommu	on	This option enables to use vfio for VFs.

Modify the grub file to add them to the kernel command line:

vi /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpu=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll"

Then you need to update the grub configuration and reboot the system to apply the changes:

transactional-update grub.cfg
reboot

To validate that the parameters are applied after the reboot you can check the command line:

cat /proc/cmdline

Load igb_uio and vfio-pci kernel modules

modprobe igb_uio
modprobe vfio-pci

Get interface information Acc100

Maybe in some cases (depending on the OS) you should add to the path the /sbin/ for the lspci command doing: export PATH=$PATH:/sbin/

lspci | grep -i acc
8a:00.0 Processing accelerators: Intel Corporation Device 0d5c

Bind the PF with igb_uio module

dpdk-devbind.py -b igb_uio 0000:8a:00.0

Create the VFs from the PF

To create 2 vfs from the PF and bind with vfio-pci follow the next steps:

echo 2 > /sys/bus/pci/devices/0000:8a:00.0/max_vfs
dpdk-devbind.py -b vfio-pci 0000:8b:00.0

Configure acc100 with the proposed configuration file

pf_bb_config ACC100 -c /opt/pf-bb-config/acc100_config_vf_5g.cfg
Tue Jun  6 10:49:20 2023:INFO:Queue Groups: 2 5GUL, 2 5GDL, 2 4GUL, 2 4GDL
Tue Jun  6 10:49:20 2023:INFO:Configuration in VF mode
Tue Jun  6 10:49:21 2023:INFO: ROM version MM 99AD92
Tue Jun  6 10:49:21 2023:WARN:* Note: Not on DDR PRQ version  1302020 != 10092020
Tue Jun  6 10:49:21 2023:INFO:PF ACC100 configuration complete
Tue Jun  6 10:49:21 2023:INFO:ACC100 PF [0000:8a:00.0] configuration complete!

Check the new VFs created from the FEC PF:

dpdk-devbind.py -s
...
...
...
Baseband devices using DPDK-compatible driver
=============================================
0000:8a:00.0 'Device 0d5c' drv=igb_uio unused=vfio-pci
0000:8b:00.0 'Device 0d5d' drv=vfio-pci unused=igb_uio

Other Baseband devices
======================
0000:8b:00.1 'Device 0d5d' unused=igb_uio,vfio-pci

Metal LB (Beta)

TBC

SUSE Adaptive Telco Infrastructure Platform (ATIP)

Bios configuration​

Kernel Real Time​

CPU Tuned Configuration​

Multus + Calico​

SRIOV​

Option 1 - Installation of SR-IOV CNI device plugins and a config map to configure it properly​

Prepare the config map for the device plugin​

Prepare the daemonset for the device plugin​

Option 2 - Installation using Rancher using Helm chart for SR-IOV CNI and device plugins​

Get helm if not present​

Install SRIOV​

Check the deployed resources crd and pods​

Check the label in the nodes​

Review the daemonset to see the new sriov-network-config-daemon and sriov-rancher-nfd-worker as active and ready​

Check the interfaces detected​

Create the NetworkNode Policy to configure the VFs​

Validate configurations​

Create the sriov network (Optional, in case we need a different network):​

DPDK​

Kernel parameters​

Load vfio-pci kernel module​

Create VFs from the NICs​

Bind the new VFs with the vfio-pci driver​

Review the configuration applied:​

Huge Pages​

Kernel parameters​

Usage of huge pages​

CPU Pinning Configuration​

Requirements​

Use CPU Pinning on kubernetes​

NUMA Aware scheduling​

Identify NUMA nodes​

VRAN Acceleration (Intel ACC100)​

Kernel parameters​

Load igb_uio and vfio-pci kernel modules​

Get interface information Acc100​

Bind the PF with igb_uio module​

Create the VFs from the PF​

Configure acc100 with the proposed configuration file​

Check the new VFs created from the FEC PF:​

Metal LB (Beta)​

Bios configuration

Kernel Real Time

CPU Tuned Configuration

Multus + Calico

SRIOV

Option 1 - Installation of SR-IOV CNI device plugins and a config map to configure it properly

Prepare the config map for the device plugin

Prepare the daemonset for the device plugin

Option 2 - Installation using Rancher using Helm chart for SR-IOV CNI and device plugins

Get helm if not present

Install SRIOV

Check the deployed resources crd and pods

Check the label in the nodes

Review the daemonset to see the new `sriov-network-config-daemon` and `sriov-rancher-nfd-worker` as active and ready

Check the interfaces detected

Create the NetworkNode Policy to configure the VFs

Validate configurations

Create the sriov network (Optional, in case we need a different network):

DPDK

Kernel parameters

Load vfio-pci kernel module

Create VFs from the NICs

Bind the new VFs with the vfio-pci driver

Review the configuration applied:

Huge Pages

Kernel parameters

Usage of huge pages

CPU Pinning Configuration

Requirements

Use CPU Pinning on kubernetes

NUMA Aware scheduling

Identify NUMA nodes

VRAN Acceleration (Intel ACC100)

Kernel parameters

Load igb_uio and vfio-pci kernel modules

Get interface information Acc100

Bind the PF with igb_uio module

Create the VFs from the PF

Configure acc100 with the proposed configuration file

Check the new VFs created from the FEC PF:

Metal LB (Beta)