One minute problem (Clock Drift Issues)

While working on a chatbot project based on Apache OpenNLP, we followed a microservice architecture, as the product was new and being built from scratch. We had multiple independent services that communicated internally using a custom token along with a user ticket for authentication.

The custom token was time-based, valid for only one minute, and required a positive time difference. If the client’s time was ahead of the server, the generated token would be considered invalid before it even reached the server because the server calculates the time difference as a negative value, which violates the validation rule requiring a positive time difference within one minute.

A request would be processed successfully only if both the custom token and the user ticket were valid. If either validation failed, the request would return an error response.

We deployed all microservices on a centralized high-performance server, allowing each service to run independently. During development, instead of running the entire microservice ecosystem locally, I would start only the required services on my machine. For other service dependencies, my application would communicate with the instances running on the centralized server.

A developer was debugging an issue and encountered an authentication error due to an invalid custom token. However, when he enabled the debugger, the same request started working. Interestingly, the same request worked perfectly fine for others, regardless of whether they used the debugger or not. This unexpected behavior left him confused and unable to proceed further, leading to a deeper investigation. On a positive note, we now know that this is a machine-specific issue.

To confirm this, we tested the scenario again with and without the debugger. The issue consistently did not occur when the debugger was on but reappeared when the debugger was off.

As we analyzed the situation, we had a wild guess—the debugger might be delaying the request due to multiple breakpoints. To test this theory, we simply added a one-minute sleep in the code and disabled the debugger. Guess what? The issue did not occur even without the debugger. Eureka! We had our first clue—the problem was related to time.

Upon further investigation, we discovered that his machine's clock was one minute ahead of the actual time.

In-Depth Analysis of the Issue:

Since the system time was one minute ahead, the code generated the token with a future timestamp. On the server side, during validation, the time difference was calculated as a negative integer.

According to our custom token validation logic, the time difference must be a positive number and within one minute. Since the negative difference violated this rule, the server rejected the request with an "Invalid Token" error.

However, when using the debugger or adding a sleep delay, the request was delayed long enough for the token to fall within the valid time window. By the time it reached the server, the token was considered just created, allowing successful validation and returning the expected response.

We corrected the system time, and the issue was completely resolved. Everything is now working perfectly, just like on other machines.

Our curiosity didn’t stop there—we had another question: Why don’t data centers face this issue?

As we dug deeper, we realized this was a clock drift issue. In large-scale environments like data centers, system clocks are synchronized using Network Time Protocol (NTP), ensuring all machines maintain accurate and consistent time. This prevents issues like the one we encountered.

P.S.: There were hidden sentence between the second and third paragraphs, which could only be seen by selecting all the text. If we had known the details in that order from the start, the issue would have been resolved immediately. However, since we discovered them at the very end, I planned to reveal them in the same way. 😊

Out of Memory Issue in Production

When I was working on an NLP-based chatbot, we occasionally encountered Out of Memory (OOM) issues. These were not frequent enough to disrupt the service entirely, but happened often enough to raise concern. Since we were building the application from scratch, we hadn’t initially included the appropriate Java arguments to generate a heap dump, which made it difficult to debug the issue when it happened. These OOM errors occurred sporadically, and during that time, the system was serving more than 50,000 concurrent users. Since none of the customers reported issues, we couldn't immediately prioritize deeper investigation. Eventually, we added the necessary Java arguments to generate a heap dump when an OOM occurs and waited for the issue to happen again. While waiting may not sound like the best strategy, we did our best to reproduce the problem under controlled scenarios. If you're interested in learning how to capture and analyze heap dumps, I’ve written a dedicated post about...

Debug Diaries

Search This Blog