About

A showcase of interesting debugging sessions and other technical writeups related to software development or security challenges.

From these case studies, we can extract:

Reusable methodologies to apply in similar scenarios;
Disclosed information that can spark new bug reports or patches. Consider how oftentimes interface errors are aggravated by insufficient, misleading, or unintended messages.

Inspiration

Computer Security

Data Analysis

Systems Programming

Contraptions Programming

Yak Shaving

Development Challenges

Exposing Kafka brokers inside a k8s cluster via load balancers. After scaling up the number of brokers, an external application could not send batched requests to some brokers.
- There was a misconfiguration, where public addresses were not set for load balancers associated with extra Kafka nodes. This resulted in cluster internal addresses to be exposed, which were unreachable outside the cluster.
- I found this challenge interesting due to several confounding factors:
  - Log messages simply stated timeouts expiring request batches (which might not be ready to send for reasons besides loss of network connection);
  - Broker connection issues were ignored (Kafka client silently removed unreachable nodes, which was only confirmed with a debugger);
  - Downscaling Kafka didn’t rollback to a correct state (rolling update triggered via Flux reconciliation simply brought down extra nodes, without restarting any remaining nodes, so one of them could be part of the misconfigured set);
  - Although we had readiness probes for these brokers, they only covered reachability inside the cluster. What could help would be external healthchecks;
Sorting paginated search results for a web interface. These results were retrieved from multiple databases, running distinct database engines. It would be highly inefficient to retrieve the full result sets in a single request.
- The solution I developed was to asynchronously perform, for each database, a ranged sql query. At the application level, we merged and sorted the result sets of these queries. If we returned back to the web interface some results of a given database, we would increment and cache the corresponding range offset, so that requesting the next page would fetch the next ranged result set.
- I found this challenge interesting due to implementing an algorithm from scratch for a complex use case which was not contemplated by the frameworks we were using.
Managing an application’s lifecycle with the service manager systemd. When the process was stopped with our service, some subprocesses did not perform a clean shutdown, and a manual subprocess start was required. However, stopping the application manually resulted in all subprocesses successfully shutting down.
- The root cause was found while comparing the system calls between the two shutdown procedures. The service sent a kill signal to the parent process and each child, while the manual stop only sent a kill to the parent process, which in turn sent network requests to each subprocess containing a command to gracefully shutdown. After reconfiguring the service to only send a kill signal to the parent process, the issue was solved.
- I found this challenge interesting due to requiring low-level analysis, since there were no evidences for this behaviour in typical indicators such as application logs.
Running applications in distinct hosts, although they did not have support for this scenario. An endpoint of an application (host A) returned an address for another application in the sub-network (host B). This address was consumed by both A and an external application in a VPN network (host C). A sub-network address couldn’t be resolved by C, while a VPN network address couldn’t be resolved by host A.
- The solution I applied was to add a NAT OUTPUT rule in the firewall of the endpoint host, causing locally-generated packets to a given IP and port in the VPN network range to be sent to a sub-network IP and port instead. This allowed A to communicate with B, while setting an address reachable by C.
- I found this challenge interesting due to requiring cross-cutting knowledge in networking, allowing us to continue using our applications in the scenario we needed.

More…

Debugging: Methodologies, Case Studies
Reverse Engineering: Methodologies, Case Studies