jingxu97

jingxu97

Member Since 6 years ago

Google, Mountain View, CA

Experience Points
82
follower
Lessons Completed
0
follow
Lessons Completed
2
stars
Best Reply Awards
31
repos

105 contributions in the last year

Pinned
⚡ Container Cluster Manager from Google
⚡ Website/documentation repo
⚡ Test infrastructure for the Kubernetes project.
⚡ Kubernetes community content
⚡ Compute Resource Usage Analysis and Monitoring of Container Clusters
⚡ External storage plugins, provisioners, and helper libraries
Activity
May
18
5 days ago
push

jingxu97 push jingxu97/enhancements

jingxu97
jingxu97

Merge pull request #2571 from xiaoxubeii/kep-memory-qos

KEP-2570: Support memory qos with cgroups v2

jingxu97
jingxu97

Updates API migration requirements

jingxu97
jingxu97

Updates stage and milestone

jingxu97
jingxu97

Removes feature-gates section since it's in GA stage

jingxu97
jingxu97
jingxu97
jingxu97

Add PRR information for maxUnavailable

jingxu97
jingxu97

Merge pull request #2720 from seans3/kubectl-headers-beta

Update kubectl headers KEP to beta

jingxu97
jingxu97

Merge pull request #2688 from krmayankk/maxun

Add PRR information for maxUnavailable

jingxu97
jingxu97

Graduate memory manager to beta

Signed-off-by: Artyom Lukianov [email protected]

jingxu97
jingxu97

Updates motivation, graduation criteria, scalability and monitoring

jingxu97
jingxu97

Merge pull request #2708 from cynepco3hahue/update_memory_manager_kep

Graduate memory manager to beta

jingxu97
jingxu97
jingxu97
jingxu97

Updates on knowing if the feature works

jingxu97
jingxu97

Add more on custom metrics and API server/etcd unavailability behavior

jingxu97
jingxu97

updates template checkboxes

jingxu97
jingxu97

Updates on service dependency

jingxu97
jingxu97

pdate CSI Windows KEP for GA

update CSI windows kep

jingxu97
jingxu97

Merge pull request #2670 from jingxu97/kep

Update CSI Windows KEP for GA

commit sha: 6a4aadc1a4aa6cbf931fdb91f52f6ff72f436a1d

push time in 4 days ago
Activity icon
issue

jingxu97 issue comment kubernetes/enhancements

jingxu97
jingxu97

Support recovery from volume expansion failure

Enhancement Description

May
16
1 week ago
Activity icon
created branch

jingxu97 in jingxu97/kubernetes create branch may/kubectlexec

createdAt 6 days ago
push

jingxu97 push jingxu97/kubernetes

jingxu97
jingxu97

kubelet/stats: update cadvisor stats provider with new log location

in https://github.com/kubernetes/kubernetes/pull/74441, the namespace and name were added to the pod log location.

However, cAdvisor stats provider wasn't correspondingly updated.

since CRI-O uses cAdvisor stats provider by default, despite being a CRI implementation, eviction with ephemeral storage and container logs doesn't work as expected, until now!

Signed-off-by: Peter Hunt [email protected]

jingxu97
jingxu97

kubelet/stats: take container log stats into account when checking ephemeral stats

this commit updates checkEphemeralStorage to be able to add container log stats, if applicable.

It also updates the old check when container log stats aren't found to be more accurate. Specifically, this check previously worked because of a fluke programming accident:

according to this block in pkg/kubelet/stats/helper.go:113

if result.Rootfs != nil {
    rootfsUsage := *cfs.BaseUsageBytes
    result.Rootfs.UsedBytes = &rootfsUsage
}

BaseUsageBytes should be the value added, not TotalUsageBytes. However, since in this case one also needs to account for the calculated log size, which is TotalUsageBytes - BaseUsageBytes using TotalUsageBytes value accidentally worked.

Updating the case to use the correct value AND log offset fixes this accident and makes the behavior more in line with what happens when calculating ephemeral storage.

Signed-off-by: Peter Hunt [email protected]

jingxu97
jingxu97

kubelet/stats: add unit test for when container logs are found

Signed-off-by: Peter Hunt [email protected]

jingxu97
jingxu97

[e2e][azure] Make internalStaticIP flexible Now, internalStaticIP is hard-coded to "10.240.11.11". Such IP works for aks-engine cluster but not for CAPZ ones (node-subnet 10.1.0.0/16)

Signed-off-by: Zhecheng Li [email protected]

jingxu97
jingxu97

kubectl rollout: support recursive flag for rollout status

jingxu97
jingxu97

tests: adding integration test for rollout status

jingxu97
jingxu97

benchmark unstructuredToVal

jingxu97
jingxu97

update kubectl doc url

Signed-off-by: xin.li [email protected]

jingxu97
jingxu97

Optimize test cases for kubelet

jingxu97
jingxu97

Add wrapper for TimingHistogram

Do not bother wrapping WeightedHistogram because it is not used in k/k.

jingxu97
jingxu97

GCE: skip updating and deleting external loadbalancers if service is managed outside of service controller

jingxu97
jingxu97
jingxu97
jingxu97

Don't clone headers twice

CloneRequest() clones headers too

jingxu97
jingxu97

Make more ordinary and add benchmarks of wrapped timing histograms

jingxu97
jingxu97

Count inline volume for NodeVolumeLimit when CSI migration enabled

Previsouly, when kube-scheduler schedule a pod, it does not take inline intree volume into account when CSI migration is enabled. This could lead to failures where pod scheduled to a node but volume attachment fails.

jingxu97
jingxu97

enforce strict alpha handling for API serving

jingxu97
jingxu97

remove PDB v1beta1 usage where v1 is equivalent

jingxu97
jingxu97

generated: remove no longer served APIs

jingxu97
jingxu97

amend comment of NodeInclusionPolicy

Signed-off-by: kerthcet [email protected]

jingxu97
jingxu97

commit sha: 9f460160c1d6d199f75453e1ae529c230e8a6b1f

push time in 6 days ago
May
12
1 week ago
Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21

What happened?

KCM wrongly detaches all in-tree volumes during update from K8s 1.20 to 1.21.

What did you expect to happen?

KCM to do not detach all in-tree volumes during update from K8s 1.20 to 1.21.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a K8s 1.20.13 cluster.

  2. Create in-tree and out-of-tree StorageClasses.

    allowVolumeExpansion: true
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: default-intree
    parameters:
      type: pd-standard
    provisioner: kubernetes.io/gce-pd
    reclaimPolicy: Delete
    volumeBindingMode: Immediate
    
    allowVolumeExpansion: true
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
      name: default
    parameters:
      type: pd-standard
    provisioner: pd.csi.storage.gke.io
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
    
  3. Create 3 StatefulSets with 4 replicas (1 Stateful set is using the out-of-tree, the other 2 - the in-tree):

    apiVersion: v1
    kind: Service
    metadata:
      name: app1
      labels:
        app: app1
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app1
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app1
    spec:
      serviceName: app1
      replicas: 4
      selector:
        matchLabels:
          app: app1
      template:
        metadata:
          labels:
            app: app1
        spec:
          containers:
            - name: app1
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app1
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app1
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Gi
    
    apiVersion: v1
    kind: Service
    metadata:
      name: app2
      labels:
        app: app2
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app2
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app2
    spec:
      serviceName: app2
      replicas: 4
      selector:
        matchLabels:
          app: app2
      template:
        metadata:
          labels:
            app: app2
        spec:
          containers:
            - name: app2
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app2
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app2
        spec:
          storageClassName: default-intree
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 2Gi
    
    apiVersion: v1
    kind: Service
    metadata:
      name: app3
      labels:
        app: app3
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app3
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app3
    spec:
      serviceName: app3
      replicas: 4
      selector:
        matchLabels:
          app: app3
      template:
        metadata:
          labels:
            app: app3
        spec:
          containers:
            - name: app3
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app3
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app3
        spec:
          storageClassName: default-intree
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 3Gi
    
  4. Update the cluster to K8s 1.21.10.

  5. Make sure that kube-controller-manager detaches all in-tree volumes during update

5.1 kube-controller-manager marks all in-tree volumes as uncertain.

2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}

5.2 6min after marking a lot of volume attachments as uncertain, KCM detaches the in-tree volumes.

2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--2bc75fc8-af96-4327-a9a7-cd648c41ec96\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--8ad228d7-187b-4972-ab4c-74e5a46c6ad8\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
  1. After these detachments, the Node turns into unhealthy state with reason FilesystemIsReadOnly.
  Normal   FilesystemIsReadOnly  48m                     kernel-monitor  Node condition ReadonlyFilesystem is now: True, reason: FilesystemIsReadOnly

The corresponding Pods fail with IO errors as their volumes are detached.

Anything else we need to know?

I see that in ASW the volume is reported as attached

I0407 06:37:53.956930       1 actual_state_of_world.go:507] Report volume "kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" as attached to node "worker-1-z1-7b85b-9r2tb"

Then the VA is marked as uncertain

I0407 06:37:53.957975       1 attach_detach_controller.go:769] Marking volume attachment as uncertain as volume:"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" ("worker-1-z1-7b85b-9r2tb") is not attached (Detached)

Note the diff in the volume name

-kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941
+kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941

Kubernetes version

Update from K8s 1.20.13 to 1.21.10.

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

https://github.com/gardener/gardener

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

external-provisioner - k8s.gcr.io/sig-storage/[email protected] external-attacher - k8s.gcr.io/sig-storage/[email protected] gcp-compute-persistent-disk-csi-driver - gcr.io/gke-release/[email protected]

jingxu97
jingxu97

@ialidzhikov thanks for sharing the detailed information. About label, I confirmed with @msau42, it is ok to not reapply it since it is mainly informational. Controller or scheduler does not depends on the zone label to make decisions. Also as we discussed, in option 5, you can choose not to downgrade external provisioner as long as you make sure no new PVs are created during upgrade

May
11
1 week ago
pull request

jingxu97 pull request kubernetes/kubernetes

jingxu97
jingxu97

WIP: fix issue #109354

Change-Id: I774a429442327a8700db975c625dd681f8577bc8

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Activity icon
created branch

jingxu97 in jingxu97/kubernetes create branch may/recover

createdAt 1 week ago
push

jingxu97 push jingxu97/kubernetes

jingxu97
jingxu97

Fix misspelling of success.

Signed-off-by: JunYang [email protected]

jingxu97
jingxu97

Added --sum flag to kubectl top pod

jingxu97
jingxu97

fixing the panic in TestVersion

jingxu97
jingxu97

Add missing test cases for RunAsGroup and SetRunAsGroup methods

jingxu97
jingxu97
jingxu97
jingxu97

fix comment of e2e test case garbage_collector

Signed-off-by: sayaoailun [email protected]

jingxu97
jingxu97

change to use require.NoError

jingxu97
jingxu97

Replace dbus-send with godbus for fake PrepareForShutdown message

jingxu97
jingxu97

refactor: Change the users of IsQualifiedName to ValidateQualifiedName

jingxu97
jingxu97

fix typo for nodelifecycle controller

jingxu97
jingxu97

Add pod status info log for e2e creating pods

jingxu97
jingxu97

kube-controller-manager: Remove the deprecated --experimental-cluster-signing-duration flag

Signed-off-by: ialidzhikov [email protected]

jingxu97
jingxu97

Improvement: Updated the serviceaccount flag for multiple subjects.

jingxu97
jingxu97

e2e/cleanup: fix package name and dir name mismatches

jingxu97
jingxu97

pkg/volume: fix incorrect klog.Infof usage

klog.Infof expects a format string as first parameter and then expands format specifies inside it. What gets passed here is the final string that must be logged as-is, therefore klog.Info has to be used.

Signed-off-by: yuswift [email protected]

jingxu97
jingxu97

fix: exclude non-ready nodes and deleted nodes from azure load balancers

Make sure that nodes that are not in the ready state and are not newly created (i.e. not having the "node.cloudprovider.kubernetes.io/uninitialized" taint) get removed from load balancers. Also remove nodes that are being deleted from the cluster.

Signed-off-by: Riccardo Ravaioli [email protected]

jingxu97
jingxu97

kubelet: fix panic triggered when playing with a wip CRI

jingxu97
jingxu97

Update rs.extensions to rs.apps

jingxu97
jingxu97

For each call, log apf_execution_time

jingxu97
jingxu97

kubelet: more resilient node allocatable ephemeral-storage data getter

commit sha: b53be1d66ef1c7f79410d619864d3788e084dd49

push time in 1 week ago
push

jingxu97 push jingxu97/kubernetes

jingxu97
jingxu97

Fixed portName validation error message.

jingxu97
jingxu97

apiserver cacher: don't accept requests if stopped

The cacher blocks requests until it is ready, however, the ready variable doesn't differentiate if the cacher was stopped.

The cacher is using a condition variable based on sync.Cond to handle the readiness, however, this was not taking into account if it was not ready because it was waiting to be ready or it was stopped.

Add a new condition to the condition variable to handle the stop condition, and returning an error to signal the goroutines that they should stop waiting and bail out.

jingxu97
jingxu97

Close events recording sink in integration tests

jingxu97
jingxu97

tests: Include the Windows node name in the exception

There are a few tests that fail because the hostnames apparently do not match. Logging the name would help finding the problem.

jingxu97
jingxu97

Update test/e2e/windows/host_process.go

Co-authored-by: James Sturtevant [email protected]

jingxu97
jingxu97

correct coverage MainStart argument order

jingxu97
jingxu97

do not skip DownwardAPIHugePages

jingxu97
jingxu97

azure: remove GA IPv6DualStack feature-gate

jingxu97
jingxu97

Add sanposhiho to SIG Scheduling reviewers

jingxu97
jingxu97

Fix discovery cache TTL to 6 hours

Signed-off-by: Kazuki Suda [email protected]

jingxu97
jingxu97

replace all the deprecated ioutil with io and os

jingxu97
jingxu97

Migrate ipallocator and portallocator to new Events API

jingxu97
jingxu97

integration: force close httpserver on exit

jingxu97
jingxu97

etcd3/store: update creation test to use storage client

There is no functional difference between checking for an empty key using the database client and doing so with the storage interface. Using the latter allows this test to be more portable.

Signed-off-by: Steve Kuznetsov [email protected]

jingxu97
jingxu97

etcd3/store: update cancelled watch test to be generic

There's no reason to create the watch using the underlying watcher.

Signed-off-by: Steve Kuznetsov [email protected]

jingxu97
jingxu97

etcd3/store: call a generic cancelled watch test

Signed-off-by: Steve Kuznetsov [email protected]

jingxu97
jingxu97

storage/testing: move cancelled watch test to generic package

Signed-off-by: Steve Kuznetsov [email protected]

jingxu97
jingxu97

test/integration: clarify namespace utilities

For a developer that's not very familiar with the integration flow, it is very surprising to see that the namespace creation logic does not create anything and that the namespace deletion logic does not delete anything, either.

Signed-off-by: Steve Kuznetsov [email protected]

jingxu97
jingxu97

node-perf: use tf-wide-deep:1.2

jingxu97
jingxu97

Merge pull request #109888 from sanposhiho/patch-3

Add sanposhiho to SIG Scheduling reviewers

commit sha: 564b2049231c971b6e2e51c0822ecbf030da4094

push time in 1 week ago
Activity icon
issue

jingxu97 issue kubernetes/kubernetes

jingxu97
jingxu97

Failing test: Feature gate checking is not enabled

Failure cluster 16ae3f774c3255733716

Error text:
test/e2e/common/node/downwardapi.go:295
May  4 11:36:57.292: Feature gate checking is not enabled, don't use SkipUnlessFeatureGateEnabled(DownwardAPIHugePages). Instead use the Feature tag.
test/e2e/common/node/downwardapi.go:296

Recent failures:

2022/5/7 10:22:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns 2022/5/7 08:45:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns-nodecache 2022/5/7 06:31:10 ci-cri-containerd-e2e-cos-gce-serial 2022/5/7 05:57:09 ci-kubernetes-e2e-gci-gce-serial 2022/5/7 04:22:09 ci-kubernetes-e2e-gci-gce-serial-kube-dns

/kind failing-test

/sig node

Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

Failing test: Feature gate checking is not enabled

Failure cluster 16ae3f774c3255733716

Error text:
test/e2e/common/node/downwardapi.go:295
May  4 11:36:57.292: Feature gate checking is not enabled, don't use SkipUnlessFeatureGateEnabled(DownwardAPIHugePages). Instead use the Feature tag.
test/e2e/common/node/downwardapi.go:296

Recent failures:

2022/5/7 10:22:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns 2022/5/7 08:45:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns-nodecache 2022/5/7 06:31:10 ci-cri-containerd-e2e-cos-gce-serial 2022/5/7 05:57:09 ci-kubernetes-e2e-gci-gce-serial 2022/5/7 04:22:09 ci-kubernetes-e2e-gci-gce-serial-kube-dns

/kind failing-test

/sig node

Activity icon
issue

jingxu97 issue kubernetes/kubernetes

jingxu97
jingxu97

AddVolumeToReportAsAttached logic issue

What happened?

If a volume is marked as uncertain, the volume should not be on the volumeAttached list. But if detach is triggered, but failed, the attach_detach_controller will try to add this volume as attached in node status (AddVolumeToReportAsAttached). However, the volume might be not even attached (since the state is uncertain). So in this case, if detach failed, we should mark volume as uncertain.

What did you expect to happen?

If the volume is marked as uncertain, it should not be listed as volumeAttached in node status

How can we reproduce it (as minimally and precisely as possible)?

Have a customized driver to make attach operation timeout, volume will be marked as uncertain in actual state. Delete pod which is using this volume, and make detach operation fail.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
all the supported kubernetes versions

Cloud provider

all the supported versions

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

May
9
2 weeks ago
open pull request

jingxu97 wants to merge kubernetes/kubernetes

jingxu97
jingxu97

Skip mount point checks when possible during mount cleanup.

What type of PR is this?

/kind bug /kind api-change

What this PR does / why we need it:

This is a continuation of https://github.com/kubernetes/kubernetes/pull/109117. Please see that PR for more background.

Calls to mounter.Unmount are preceded and followed by expensive mount point checks. These checks are not necessary on *nix-s with a umount implementation that performs a similar check itself. This PR adds a mechanism to detect the "safe" behavior, and avoid mount point checks when possible.

This change represents a significant optimization of CleanupMountPoint; enabling use-cases where pods have many mounts, and there is high pod churn. We (EKS) have observed several cases of instability and poor node health in such scenarios, which were resolved with this change.

Which issue(s) this PR fixes:

No issue available.

Special notes for your reviewer:

I chose to add a function to the Mounter interface in order to keep the CleanupMountPoint implementation generic for Unix and Windows. If the "safe" umount behavior is not detected, the existing code paths are unchanged.

Does this PR introduce a user-facing change?

A function (CanSafelySkipMountPointCheck() bool) was added to mount-utils Mounter interface, exposing the mounter's support for skipping mount point checks.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


jingxu97
jingxu97

if mountPath is not a mount point, will it return error at line 104 during Unmount? In this case, it will not call removePath in the following, right?

pull request

jingxu97 merge to kubernetes/kubernetes

jingxu97
jingxu97

Skip mount point checks when possible during mount cleanup.

What type of PR is this?

/kind bug /kind api-change

What this PR does / why we need it:

This is a continuation of https://github.com/kubernetes/kubernetes/pull/109117. Please see that PR for more background.

Calls to mounter.Unmount are preceded and followed by expensive mount point checks. These checks are not necessary on *nix-s with a umount implementation that performs a similar check itself. This PR adds a mechanism to detect the "safe" behavior, and avoid mount point checks when possible.

This change represents a significant optimization of CleanupMountPoint; enabling use-cases where pods have many mounts, and there is high pod churn. We (EKS) have observed several cases of instability and poor node health in such scenarios, which were resolved with this change.

Which issue(s) this PR fixes:

No issue available.

Special notes for your reviewer:

I chose to add a function to the Mounter interface in order to keep the CleanupMountPoint implementation generic for Unix and Windows. If the "safe" umount behavior is not detected, the existing code paths are unchanged.

Does this PR introduce a user-facing change?

A function (CanSafelySkipMountPointCheck() bool) was added to mount-utils Mounter interface, exposing the mounter's support for skipping mount point checks.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

Failing test: Feature gate checking is not enabled

Failure cluster 16ae3f774c3255733716

Error text:
test/e2e/common/node/downwardapi.go:295
May  4 11:36:57.292: Feature gate checking is not enabled, don't use SkipUnlessFeatureGateEnabled(DownwardAPIHugePages). Instead use the Feature tag.
test/e2e/common/node/downwardapi.go:296

Recent failures:

2022/5/7 10:22:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns 2022/5/7 08:45:24 ci-kubernetes-e2e-gci-gce-serial-kube-dns-nodecache 2022/5/7 06:31:10 ci-cri-containerd-e2e-cos-gce-serial 2022/5/7 05:57:09 ci-kubernetes-e2e-gci-gce-serial 2022/5/7 04:22:09 ci-kubernetes-e2e-gci-gce-serial-kube-dns

/kind failing-test

/sig node

May
5
2 weeks ago
Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion

/kind bug

What you expected to happen: When a pod using an attached volume is deleted (gracefully) but kubelet in the corresponding node is down, the AttachDetachController should assume the node is unrecoverable after a timeout (currently 6min) and forcefully detach the volume.

What happened: The volume is never detached.

How to reproduce it (as minimally and precisely as possible):

  • Create a pod with a volume using any of the attachable plugins (I used GCE PD to test).
  • Stop kubelet inside the node where the pod is scheduled.
  • Delete the pod.
  • Wait for 6min+
  • Check to see if the volume is still attached.

Anything else we need to know?: This doesn't happen if the pod is force-deleted.

It's likely due to the last condition checked in this line. Once kubelet is down, the container status is no longer reported correctly. Inside the Attach Detach Controller, This function is called by the pod update informer handler, which sets whether the volume should be attached in the desired state of the world. On pod force deletion, the pod object is deleted immediately, and this triggers the pod delete informer handler, which doesn't call this function.

/sig storage /cc @saad-ali @gnufied @jingxu97 @NickrenREN /assign

jingxu97
jingxu97

We now have a new alpha feature "Nongraceful node shutdown" which can mostly help this situation. Please check out the KEP for details https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown There will be a blog post soon. Please let us know if you have any feedback.

May
4
2 weeks ago
Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

[Failing test] pull-kubernetes-e2e-gce-iscsi-serial and pull-kubernetes-e2e-gce-iscsi are failing

Which jobs are failing?

pull-kubernetes-e2e-gce-iscsi-serial pull-kubernetes-e2e-gce-iscsi

Error text:

W0106 06:54:08.830] scp: /var/log/cluster-autoscaler.log*: No such file or directory
W0106 06:54:08.830] scp: /var/log/kube-addon-manager.log*: No such file or directory
W0106 06:54:08.908] scp: /var/log/fluentd.log*: No such file or directory
W0106 06:54:08.909] scp: /var/log/kubelet.cov*: No such file or directory
W0106 06:54:08.909] scp: /var/log/startupscript.log*: No such file or directory
W0106 06:54:08.915] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
I0106 06:54:09.138] Dumping logs from nodes locally to '/workspace/_artifacts'
I0106 06:54:09.139] Detecting nodes in the cluster
I0106 06:55:02.414] Changing logfiles to be world-readable for download
I0106 06:55:02.912] Changing logfiles to be world-readable for download
I0106 06:55:02.913] Changing logfiles to be world-readable for download
skipped 9 lines unfold_more
W0106 06:55:10.794] scp: /var/log/containers/konnectivity-agent-*.log*: No such file or directory
W0106 06:55:10.795] scp: /var/log/fluentd.log*: No such file or directory
W0106 06:55:10.795] scp: /var/log/node-problem-detector.log*: No such file or directory
W0106 06:55:10.795] scp: /var/log/kubelet.cov*: No such file or directory
W0106 06:55:10.795] scp: /var/log/startupscript.log*: No such file or directory
W0106 06:55:10.801] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
W0106 06:55:11.299] scp: /var/log/containers/konnectivity-agent-*.log*: No such file or directory
W0106 06:55:11.299] scp: /var/log/fluentd.log*: No such file or directory
W0106 06:55:11.300] scp: /var/log/node-problem-detector.log*: No such file or directory
W0106 06:55:11.300] scp: /var/log/kubelet.cov*: No such file or directory
W0106 06:55:11.300] scp: /var/log/startupscript.log*: No such file or directory
W0106 06:55:11.305] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
W0106 06:55:11.484] scp: /var/log/containers/konnectivity-agent-*.log*: No such file or directory
W0106 06:55:11.485] scp: /var/log/fluentd.log*: No such file or directory
W0106 06:55:11.485] scp: /var/log/node-problem-detector.log*: No such file or directory
W0106 06:55:11.485] scp: /var/log/kubelet.cov*: No such file or directory
W0106 06:55:11.485] scp: /var/log/startupscript.log*: No such file or directory
W0106 06:55:11.488] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
W0106 06:55:15.713] INSTANCE_GROUPS=e2e-d44c9f5815-8b654-minion-group
W0106 06:55:15.713] NODE_NAMES=e2e-d44c9f5815-8b654-minion-group-dn5x e2e-d44c9f5815-8b654-minion-group-lcfz e2e-d44c9f5815-8b654-minion-group-pl8z
I0106 06:55:17.189] Failures for e2e-d44c9f5815-8b654-minion-group (if any):
W0106 06:55:19.045] 2022/01/06 06:55:19 process.go:155: Step './cluster/log-dump/log-dump.sh /workspace/_artifacts' finished in 1m55.805053979s
W0106 06:55:19.046] 2022/01/06 06:55:19 process.go:153: Running: ./hack/e2e-internal/e2e-down.sh
skipped 66 lines unfold_more
W0106 06:59:26.454]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0106 06:59:26.454]     subprocess.check_call(cmd, env=env)
W0106 06:59:26.454]   File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
W0106 06:59:26.454]     raise CalledProcessError(retcode, cmd)
W0106 06:59:26.455] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--build=quick', '--stage=gs://kubernetes-release-pull/ci/pull-kubernetes-e2e-gce-iscsi-serial', '--up', '--down', '--test', '--provider=gce', '--cluster=e2e-d44c9f5815-8b654', '--gcp-network=e2e-d44c9f5815-8b654', '--extract=local', '--gcp-node-image=ubuntu', '--image-family=ubuntu-2004-lts', '--image-project=ubuntu-os-cloud', '--gcp-zone=us-west1-b', '--test_args=--ginkgo.focus=\\[Driver:.iscsi\\].*(\\[Serial\\]|\\[Disruptive\\]) --ginkgo.skip=\\[Flaky\\] --minStartupPods=8', '--timeout=120m')' returned non-zero exit status 1
E0106 06:59:26.455] Command failed
I0106 06:59:26.455] process 686 exited with code 1 after 34.1m
E0106 06:59:26.455] FAIL: pull-kubernetes-e2e-gce-iscsi-serial
I0106 06:59:26.456] Call:  gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json
W0106 06:59:27.131] Activated service account credentials for: [[email protected]]
I0106 06:59:27.248] process 83564 exited with code 0 after 0.0m
I0106 06:59:27.249] Call:  gcloud config get-value account
I0106 06:59:27.895] process 83577 exited with code 0 after 0.0m
I0106 06:59:27.895] Will upload results to gs://kubernetes-jenkins/pr-logs using [email protected]
I0106 06:59:27.896] Upload result and artifacts...
I0106 06:59:27.896] Gubernator results at https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/104732/pull-kubernetes-e2e-gce-iscsi-serial/1478975577011523584
I0106 06:59:27.896] Call:  gsutil ls gs://kubernetes-jenkins/pr-logs/pull/104732/pull-kubernetes-e2e-gce-iscsi-serial/1478975577011523584/artifacts
W0106 06:59:28.914] CommandException: One or more URLs matched no objects.
E0106 06:59:29.117] Command failed

link: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/104732/pull-kubernetes-e2e-gce-iscsi-serial/1478975577011523584/

Since when has it been failing?

https://prow.k8s.io/job-history/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-iscsi-serial?buildId=1469132815470694400

https://prow.k8s.io/job-history/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-iscsi-serial?buildId=

2021 Dec 09 11:19:02

Testgrid link

https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-iscsi-serial

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Relevant SIG(s)

/sig testing

Apr
18
1 month ago
Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

Failed to delete pod volume because of directory not empty

What happened:

When then init process of pod is blocking in requesting remote url,I delete the pod.But the init process is still blocking and the umonter is umounting pod volume.At the same time,the process wakes up and writes something to volume,leading to something left in pod volume path(Mine is /var/lib/kubelet/pods/ea23394f-db52-11e8-ad88-6c92bf6f20b2/volumes/hulk~lvm/hulklvm).Then the umounter is trying to rmdir pod volume path.But it failed and print:

nestedpendingoperations.go:262] Operation for "\"hulk/lvm/hulklvm\" (\"ea23394f-db52-11e8-ad88-6c92bf6f20b2\")" failed. No retries permitted until 2018-10-29 16:16:38.785298614 +0800 CST (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "hulk/lvm/hulklvm" (volume.spec.Name: "hulklvm") pod "ea23394f-db52-11e8-ad88-6c92bf6f20b2" (UID: "ea23394f-db52-11e8-ad88-6c92bf6f20b2") with: remove /var/lib/kubelet/pods/ea23394f-db52-11e8-ad88-6c92bf6f20b2/volumes/hulk~lvm/hulklvm: directory not empty

What you expected to happen:

image

It should remove the volume path even if directory not empty.Means in umounter/TearDownAt, we should use os.RemoveAll at end instead of os.Remove

How to reproduce it (as minimally and precisely as possible):

while creating a pod and pod is stucked ,delete it

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):1.6.6
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):centos 7
  • Kernel (e.g. uname -a):3.10.0-693.mt20180403.47.el7.x86_64
  • Install tools:
  • Others:

/kind bug

jingxu97
jingxu97
Apr
14
1 month ago
Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
Apr
13
1 month ago
Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
jingxu97
jingxu97

I don't think there is datastore evacuation happening.

This specific test flaky is "an existing volume should be accessible on a new node after cluster scale up" from Anthos qualification test. I think it only happened in the situation

  1. pod with a volume created
  2. delete pod
  3. create a new node to join the cluster
  4. create the pod the the new node

There are other tests that will make volume detached and attached to a different node, but only this test fail. The only difference I can see is it is trying to attach to a new node. In this case, attach failed with "Failed to retrieve datastore for vol". After timeout, it tries to detach and also failed with the same error.

Activity icon
issue

jingxu97 issue comment kubernetes/website

jingxu97
jingxu97

Update docs to mark in-tree GCP PD plugin as deprecated

This is a Bug Report

Problem: CSI Migration for gcepersistentdisk moved to Beta in 1.17 release so the in-tree plugin was already deprecated at that time. The in-tree plugin should be marked as deprecated in Kubernetes docs.

https://kubernetes.io/docs/concepts/storage/volumes/#gcepersistentdisk

Proposed Solution:

Page to Update: https://kubernetes.io/...

Apr
12
1 month ago
Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
jingxu97
jingxu97

Also another issue related to this test is since attach volume failed with "not found" error, attach_detach_controller will mark volume as uncertain, and try to detach volume when pod is deleted, however detach also failed with "not found" error

{"level":"info","time":"2022-04-10T17:44:49.039474284Z","caller":"vanilla/controller.go:951","msg":"ControllerPublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbac17 VolumeCapability:mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1649603959450-8081-csi.vsphere.vmware.com type:vSphere CNS Block Volume] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9"}
{"level":"error","time":"2022-04-10T17:44:49.072198256Z","caller":"volume/manager.go:616","msg":"CNS AttachVolume failed from vCenter \"atl-qual-vc02.anthos\" with err: ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume.func1\n\t/build/pkg/common/cns-lib/volume/manager.go:616\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume\n\t/build/pkg/common/cns-lib/volume/manager.go:672\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.AttachVolumeUtil\n\t/build/pkg/csi/service/common/vsphereutil.go:548\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1037\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1050\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5632\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerPublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:120\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:86\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5634\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
{"level":"info","time":"2022-04-10T17:44:49.07233599Z","caller":"volume/util.go:343","msg":"Extract vimfault type: +types.NotFound. SoapFault Info: +&{{http://schemas.xmlsoap.org/soap/envelope/ Fault} ServerFaultCode CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n} {{{{<nil> []}}}}} from err +ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9"}
{"level":"error","time":"2022-04-10T17:44:49.072375655Z","caller":"common/vsphereutil.go:550","msg":"failed to attach disk \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\" with VM: \"VirtualMachine:vm-1013755 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f258-985f-b7f0-0782-e9a78fe37425, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\". err: ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n} faultType \"vim.fault.NotFound\"","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.AttachVolumeUtil\n\t/build/pkg/csi/service/common/vsphereutil.go:550\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1037\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1050\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5632\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerPublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:120\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:86\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5634\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
{"level":"info","time":"2022-04-10T17:47:20.066064719Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbac17 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"604cd65a-62ce-49f2-9859-41f6630b7b85"}
{"level":"error","time":"2022-04-10T17:47:20.084717468Z","caller":"vanilla/controller.go:1108","msg":"volumeID \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\" not found in QueryVolume","TraceId":"604cd65a-62ce-49f2-9859-41f6630b7b85","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1108\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1162\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5650\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5652\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
Activity icon
issue

jingxu97 issue comment kubernetes-csi/docs

jingxu97
jingxu97

Update security considerations for CSI inline ephemeral volumes

This PR:

  • Updates the milestones for Generic Ephemeral Inline Volumes (already went GA in 1.23)
  • Updates the documentation for CSI inline volumes per the KEP's Security Considerations and Read-only Volumes sections.

KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/596-csi-inline-volumes Enhancement: https://github.com/kubernetes/enhancements/issues/596

Update security considerations for CSI inline ephemeral volumes
Activity icon
issue

jingxu97 issue comment kubernetes/kubernetes

jingxu97
jingxu97

When volume is not marked in-use, do not backoff

We unnecessarily trigger exp. backoff when volume is not marked in-use. Instead we can wait for volume to be marked as in-use before triggering operation_executor. This could result in reduced time when mounting attached volumes.

/sig storage /kind bug

cc @jsafrane @jingxu97

Allow attached volumes to be mounted quicker by skipping exp. backoff when checking for reported-in-use volumes
jingxu97
jingxu97

oh, I missed it. Maybe 1.21 can be also useful.

Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
jingxu97
jingxu97

From our testing, looks 6.7u3 is ok, but 7.0 failed very often

Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
jingxu97
jingxu97

vSphere version 7.0u3 CSI driver: 2.4.0

The test (anthos storage qualification test) is designed to move the pod to a different node and then access the same volume.

Activity icon
issue

jingxu97 issue comment kubernetes-sigs/vsphere-csi-driver

jingxu97
jingxu97

CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [[email protected] sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [[email protected] sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [[email protected] sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [[email protected] sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [[email protected] sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [[email protected] sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [[email protected] sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [[email protected] sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
Previous