/close
dockershim is dead
smarterclayton issue comment kubernetes/kubernetes
Since upgrading to v1.22.7-gke.1500 in GKE cluster, we have had customer reports that traffic from one namespace is going to another namespace.
After investigating these reports. We found containers with these status's: OutOfmemory, Terminated, ContainerStatusUnknown, OOMKilled, OutOfcpu. Their IP address is given to a new pod in a different name space, the pods that then have the status's above are still running and serving requests. The only way to resolve it is for us to manually delete the container, which then immediately resolves the issue.
We are not sure where to start looking, we assume that these containers are still running on the nodes, although we havent seen them running on the node when using this command: crictl ps
However they must be running as when we have this issue, the web application loads fully on that namespace.
For the pods to be deleted, and if there IP address is reused, for the IP address not to be routed to the old pod anymore.
Start a GKE cluster using OS: Container-Optimized OS, on version: v1.22.7-gke.1500
Achieve one of the following status's on a pod: OutOfmemory Terminated ContainerStatusUnknown OOMKilled OutOfcpu
Observe that a new pod in cluster is assigned the same IP address as the pod with one of the above failed status's, and that old pod is still serving requests on the now reused IP address.
No response
$ kubectl version
v1.22.7-gke.1500
GKE
# On Linux:
$ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
KERNEL_COMMIT_ID=ccbab0481cec29d7f07947bcb6255f325b88513f
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
VERSION=93
VERSION_ID=93
BUILD_ID=16623.102.23
$ uname -a
Linux gke-cf-europe-west2-cluster-ego-c2-pm-08b6d578-8kz5 5.10.90+ #1 SMP Sat Mar 5 10:09:49 UTC 2022 x86_64 Intel(R) Xeon(R) CPU @ 3.10GHz GenuineIntel GNU/Linux
Are CNI plug-ins depending on pod status ips as the authoritative record of allocation? Also, do we formally define CNI destroy as happening before that release?
The answer to those two questions is required to correctly determine where in pod shutdown logic needs to be added (and this is another kubelet e2e test we need to add). Ie if the second is no, we can clear the status podIPs once the pod containers are confirmed shutdown. If the second is yes, we have to defer the final pod status update and the clear until after cni destroy is guaranteed to succeed (that has other safety implications).
smarterclayton issue comment kubernetes/kubernetes
What type of PR is this? /kind cleanup
What this PR does / why we need it: This PR adds a test to test the following untested endpoints:
Which issue(s) this PR fixes: Fixes #108641
Testgrid Link: Batchv1JobLifecycleTest Testgrid
Special notes for your reviewer: Adds +4 endpoint test coverage (good for conformance)
Does this PR introduce a user-facing change?:
NONE
Release note:
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
NONE
/sig testing /sig architecture /area conformance
/approve
smarterclayton merge to kubernetes/enhancements
smarterclayton wants to merge kubernetes/enhancements
at this point we will now have incrementally implemented 90% of etcd, vs 50%.
So the first meta question I want to discuss is whether the final shape is our idealized version of storage (a reference in memory version), and if so whether our approach is building towards something we should formally define. Second is whether the etcd proxy / core etcd data structures does a better job of this than the approach we are taking. Third will be identifying the core data structure tradeoff - is the btree the ideal structure for us vs other types (why you started here), or what key semantics are “must support”, so that we can decide whether the data structure is as minimal as possible.
smarterclayton wants to merge kubernetes/enhancements
“do not change”
smarterclayton merge to kubernetes/enhancements
smarterclayton issue comment kubernetes/kubernetes
What type of PR is this? /kind cleanup
What this PR does / why we need it: This PR adds a test to test the following untested endpoints:
Which issue(s) this PR fixes: Fixes #108113
Testgrid Link: testgrid-link
Special notes for your reviewer: Adds +3 endpoint test coverage (good for conformance)
Does this PR introduce a user-facing change?:
NONE
Release note:
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
NONE
/sig testing /sig architecture /area conformance
/approve /lgtm
smarterclayton issue comment kubernetes/kubernetes
/kind bug
we previously guaranteed the thread safety of methods called on Request
in client go, the retry
interface introduced is a member variable and not thread safe.
This PR introduces a factory function that returns a retry
interface inside Watch
, Do
, DoRaw
, and Stream
making it thread safe as it was before.
Please note there are other member variables in Request
that are not thread safe today, this PR does not address that.
Fixes #109155
NONE
/lgtm /approve
After 1.24 i’d like to update the godoc of client go to make thread safety obvious, add a test that verifies client behavior in the presence of retries (such that the race detector should flag us.
smarterclayton wants to merge kubernetes/kubernetes
/kind bug
we previously guaranteed the thread safety of methods called on Request
in client go, the retry
interface introduced is a member variable and not thread safe.
This PR introduces a factory function that returns a retry
interface inside Watch
, Do
, DoRaw
, and Stream
making it thread safe as it was before.
Please note there are other member variables in Request
that are not thread safe today, this PR does not address that.
Fixes #109155
NONE
I’d prefer not to use factory (not part of style).
Generally, this would be:
retryFn requestRetryFunc
(we use Fn for variables and Func for types.
Also godoc describe this as for testing
smarterclayton merge to kubernetes/kubernetes
/kind bug
we previously guaranteed the thread safety of methods called on Request
in client go, the retry
interface introduced is a member variable and not thread safe.
This PR introduces a factory function that returns a retry
interface inside Watch
, Do
, DoRaw
, and Stream
making it thread safe as it was before.
Please note there are other member variables in Request
that are not thread safe today, this PR does not address that.
Fixes #109155
NONE
smarterclayton issue comment kubernetes/kubernetes
/kind bug
we previously guaranteed the thread safety of methods called on Request
in client go, the retry
interface introduced is a member variable and not thread safe.
This PR introduces a factory function that returns a retry
interface inside Watch
, Do
, DoRaw
, and Stream
making it thread safe as it was before.
Please note there are other member variables in Request
that are not thread safe today, this PR does not address that.
Fixes #109155
NONE
I’m going to try to get to tonight but other stuff has been blocking me.
smarterclayton issue comment kubernetes/kubernetes
What type of PR is this? /kind cleanup
What this PR does / why we need it: This PR adds a test to test the following untested endpoints:
Which issue(s) this PR fixes: Fixes #108113 Fixes #108641
Testgrid Link: Batchv1JobLifecycleTest Testgrid
Testgrid Link: BatchV1NamespacedJobStatus testgrid-link
Special notes for your reviewer: Adds +7 endpoint test coverage (good for conformance)
Does this PR introduce a user-facing change?:
NONE
Release note:
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
NONE
/sig testing /sig architecture /area conformance
err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
patchedJob, err = jobClient.Get(context.TODO(), jobName, metav1.GetOptions{})
framework.ExpectNoError(err, "Unable to get job %s", jobName)
patchedJob.Spec.Suspend = pointer.BoolPtr(false)
patchedJob.Annotations["updated"] = "true"
^ you can't assume annotations is set, you need to initialize it to an empty map if it's nil first
updatedJob, err = e2ejob.UpdateJob(f.ClientSet, ns, patchedJob)
return err
})
So that's a bug in the test that would fix the flake.
smarterclayton issue comment kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
/approve
for api changes and field gate
smarterclayton issue comment kubernetes/kubernetes
This commit refactors the retry logic to include resetting the request body. The reset logic will be called iff it is not the first attempt. This refactor is nescessary mainly because now as per the retry logic, we always ensure that the request body is reset after the response body is fully read and closed in order to reuse the same TCP connection.
Previously, the reset of the request body and the call to read and close the response body were not in the right order, which leads to race conditions.
xref https://github.com/kubernetes/kubernetes/issues/108906
Fixes a bug in our client retry logic that was not draining/closing the body prior to trying to reset the request body for a retry. According to go upstream, this is required to avoid races.
client-go: if resetting the body fails before a retry, an error is now surfaced to the user.
Continuing https://github.com/kubernetes/kubernetes/pull/109028
/priority critical-urgent /assign @aojea @liggitt @smarterclayton /cc @dims @tkashem would something like this work?
would be good to see that test added, this is lgtm from my perspective, but I’ll let one other person do the tag when the test is added (@tkashem probably)
/approve
smarterclayton wants to merge kubernetes/kubernetes
/kind feature
Graceful node shutdown currently places pods into terminal phase upon shutdown. As discussed in https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982763592 and https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982766908 some users/distributions would benefit from the ability to toggle this behavior and instead to not put pods into terminal phase on shutdown.
For example, if it's expected that after shutdown the node will reboot, it may be desirable to not place the pods into terminal phase, so that way after reboot the pods will start running again after the node comes back up.
This PR adds a new kubelet configuration option to toggle this behavior GracefulNodeShutdownPodPolicy
. The default setting of GracefulNodeShutdownPodPolicy
is enabled SetTerminal
(which matches current behavior since 1.20) when graceful node shutdown was introduced.
Fixes #108991
Add a new kubelet configuration option `GracefulNodeShutdownPodPolicy` to toggle setting pods to terminal phase during graceful node shutdown.
Only default GracefulNodeShutdownPodPolicy to SetTerminal if Graceful Node Shutdown feature is enabled (i.e. ShutdownGracePeriod) in pkg/kubelet/apis/config/v1beta1/defaults.go. Otherwise the setting will be left unset, "".
Let me translate this to make sure we're all on the same page with possible configurations
ShutdownGracePeriod
but GracefulNodeShutdownPodPolicy
is empty or null -> defaults to SetTerminal
(preserves behavior for beta users)GracefulNodeShutdownPodPolicy
but ShutdownGracePeriod
is empty or zero -> default to LeaveRunning
GracefulNodeShutdownPodPolicy
and ShutdownGracePeriod
-> no default necessaryIf we did this, then I think users who move from 2 to 3 would be confused (they'd switch from LeaveRunning to SetTerminal, which is bad).
So I might argue that 2 should be:
GracefulNodeShutdownPodPolicy
but ShutdownGracePeriod
is empty or zero -> validation error, user must explicitly specify ShutdownGracePeriod
Which then trains new users to make a choice (and we can say LeaveRunning is the default behavior for pods on kubelet shutdown when grace period is disabled
or something).
smarterclayton merge to kubernetes/kubernetes
/kind feature
Graceful node shutdown currently places pods into terminal phase upon shutdown. As discussed in https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982763592 and https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982766908 some users/distributions would benefit from the ability to toggle this behavior and instead to not put pods into terminal phase on shutdown.
For example, if it's expected that after shutdown the node will reboot, it may be desirable to not place the pods into terminal phase, so that way after reboot the pods will start running again after the node comes back up.
This PR adds a new kubelet configuration option to toggle this behavior GracefulNodeShutdownPodPolicy
. The default setting of GracefulNodeShutdownPodPolicy
is enabled SetTerminal
(which matches current behavior since 1.20) when graceful node shutdown was introduced.
Fixes #108991
Add a new kubelet configuration option `GracefulNodeShutdownPodPolicy` to toggle setting pods to terminal phase during graceful node shutdown.
smarterclayton wants to merge kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
Sorry for this - but we wouldn't use nil
because this apidoc is intended for JSON users (I should have clarified). Now that I think you've answered the question of "the field is absent / omitted" I'd suggest this sentence (And all the others in this file, which can be done later) as:
This field is optional, may be omitted if no secret is required.
smarterclayton merge to kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
smarterclayton issue comment kubernetes/kubernetes
This commit refactors the retry logic to include resetting the request body. The reset logic will be called iff it is not the first attempt. This refactor is nescessary mainly because now as per the retry logic, we always ensure that the request body is reset after the response body is fully read and closed in order to reuse the same TCP connection.
Previously, the reset of the request body and the call to read and close the response body were not in the right order, which leads to race conditions.
xref https://github.com/kubernetes/kubernetes/issues/108906
Fixes a bug in our client retry logic that was not draining/closing the body prior to trying to reset the request body for a retry. According to go upstream, this is required to avoid races.
client-go: if resetting the body fails before a retry, an error is now surfaced to the user.
Continuing https://github.com/kubernetes/kubernetes/pull/109028
/priority critical-urgent /assign @aojea @liggitt @smarterclayton /cc @dims @tkashem would something like this work?
Do we have enough testing to be confident the race is now closed with the current changes? If so, I'll tag this now, otherwise I'll wait for that.
smarterclayton merge to kubernetes/kubernetes
This commit refactors the retry logic to include resetting the request body. The reset logic will be called iff it is not the first attempt. This refactor is nescessary mainly because now as per the retry logic, we always ensure that the request body is reset after the response body is fully read and closed in order to reuse the same TCP connection.
Previously, the reset of the request body and the call to read and close the response body were not in the right order, which leads to race conditions.
xref https://github.com/kubernetes/kubernetes/issues/108906
Fixes a bug in our client retry logic that was not draining/closing the body prior to trying to reset the request body for a retry. According to go upstream, this is required to avoid races.
client-go: if resetting the body fails before a retry, an error is now surfaced to the user.
Continuing https://github.com/kubernetes/kubernetes/pull/109028
/priority critical-urgent /assign @aojea @liggitt @smarterclayton /cc @dims @tkashem would something like this work?
smarterclayton wants to merge kubernetes/kubernetes
This commit refactors the retry logic to include resetting the request body. The reset logic will be called iff it is not the first attempt. This refactor is nescessary mainly because now as per the retry logic, we always ensure that the request body is reset after the response body is fully read and closed in order to reuse the same TCP connection.
Previously, the reset of the request body and the call to read and close the response body were not in the right order, which leads to race conditions.
xref https://github.com/kubernetes/kubernetes/issues/108906
Fixes a bug in our client retry logic that was not draining/closing the body prior to trying to reset the request body for a retry. According to go upstream, this is required to avoid races.
client-go: if resetting the body fails before a retry, an error is now surfaced to the user.
Continuing https://github.com/kubernetes/kubernetes/pull/109028
/priority critical-urgent /assign @aojea @liggitt @smarterclayton /cc @dims @tkashem would something like this work?
I think we may need to assert that this is !apierrors.IsInternalError(err)
, or assert this is not a server error at all (possibly by checking what the type is after unwrapping).
smarterclayton wants to merge kubernetes/kubernetes
pkg/storage/etcd3: correctly validate resourceVersions
In a number of tests, the underlying storage backend interaction will return the revision (logical clock underpinning the MVCC implementation) at the call-time of the RPC. Previously, the tests validated that this returned revision was exactly equal to some previously seen revision. This assertion is only true in systems where no other events are advancing the logical clock. For instance, when using a single etcd cluster as a shared fixture for these tests, the assertion is not valid any longer. By checking that the returned revision is no older than the previously seen revision, the validation logic is correct in all cases.
Signed-off-by: Steve Kuznetsov [email protected]
Depends on https://github.com/kubernetes/kubernetes/pull/108936
/kind cleanup
NONE
/sig api-machinery /assign @liggitt @smarterclayton @sttts @deads2k
I could see three possibilities:
>
sematnics in test+ 2
)>
but we might want the generalized tests to be parameterized to describe the exact relationship (or we simply duplicate the tests via good old cut and paste)I assume here steve you'd like to generalize these tests, but wanted to be sure?
smarterclayton merge to kubernetes/kubernetes
pkg/storage/etcd3: correctly validate resourceVersions
In a number of tests, the underlying storage backend interaction will return the revision (logical clock underpinning the MVCC implementation) at the call-time of the RPC. Previously, the tests validated that this returned revision was exactly equal to some previously seen revision. This assertion is only true in systems where no other events are advancing the logical clock. For instance, when using a single etcd cluster as a shared fixture for these tests, the assertion is not valid any longer. By checking that the returned revision is no older than the previously seen revision, the validation logic is correct in all cases.
Signed-off-by: Steve Kuznetsov [email protected]
Depends on https://github.com/kubernetes/kubernetes/pull/108936
/kind cleanup
NONE
/sig api-machinery /assign @liggitt @smarterclayton @sttts @deads2k
smarterclayton issue comment kubernetes/kubernetes
pkg/storage/etcd3: correctly validate resourceVersions
In a number of tests, the underlying storage backend interaction will return the revision (logical clock underpinning the MVCC implementation) at the call-time of the RPC. Previously, the tests validated that this returned revision was exactly equal to some previously seen revision. This assertion is only true in systems where no other events are advancing the logical clock. For instance, when using a single etcd cluster as a shared fixture for these tests, the assertion is not valid any longer. By checking that the returned revision is no older than the previously seen revision, the validation logic is correct in all cases.
Signed-off-by: Steve Kuznetsov [email protected]
Depends on https://github.com/kubernetes/kubernetes/pull/108936
/kind cleanup
NONE
/sig api-machinery /assign @liggitt @smarterclayton @sttts @deads2k
Would kine
be better testable if this was fixed? Is there any real disadvantage to weakening this assumption (such as compaction / future etcd changes that lead to potentially background changes)?
smarterclayton wants to merge kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
To clarify - empty in the context of a client in a JSON API would generally mean "nodeExpandSecretRef": {}
which is definitely not allowed by validation.
smarterclayton merge to kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
smarterclayton wants to merge kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
(I realize we're just copying the other fields, but it doesn't make sense to me as an API user what you mean by empty, so I'm trying to understand what it means so we can decide on a follow up)
smarterclayton merge to kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
smarterclayton wants to merge kubernetes/kubernetes
CSI spec 1.5 enhanced the spec to add optional secrets field to NodeExpandVolumeRequest. This commit adds NodeExpandSecret to the CSI PV source and also derive the expansion secret in csiclient to send it out as part of the nodeexpand request.
/kind feature
Optionally add one or more of the following kinds if applicable: /kind api-change
Fixes #95367
This release add support for NodeExpandSecret for CSI driver client which enables the CSI drivers to make use of this secret while performing node expansion operation based on the user request. Previously there was no secret provided as part of the nodeexpansion call, thus CSI drivers were not make use of the same while expanding the volume at node side.
KEP reference:https://github.com/kubernetes/enhancements/pull/3173/
may be empty
reads weird. What do you mean by it?
dockershim takes 1h30m to successfully kill a pod in node-serial tests
In https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1420578748725465088
E2eNode Suite: [sig-node] Restart [Serial] [Slow] [Disruptive] [NodeFeature:ContainerRuntimeRestart] Container Runtime Network should recover from ip leak expand_more
is failing because it takes 1h30m to terminate the podtest-e7842f3f-c74c-4285-9b78-3d26c9d53bac
. Looking through logs we attempt to call syncTerminatingPod 86 times and each time it failed. Grepping for the error returnedwhich is because of
which means that kill pod in dockershim was unable to kill a container.
This would probably be a release blocker, but I don't know that it's a regression.
/kind bug /sig node