Currently our API has a status endpoint which consists of:
Database read. Get current time. This is useful to see if the API is active: it needs the Web Server to be correctly configured, time configured and be able to use the database. Unfortunately the same wasn’t true to other services like our Content Provider (AS) and File Transcoder (HFP). During this time I standardised the response of the endpoints so all these projects return the same response if HTTP OK.
For the Content Provider:
DB read (not available if it’s an EC2 instance and ignored) API request for authorisation. HEAD request for file (also switches if it’s EC2) Get current time. For the File Transcoder:
HEAD request for original content (different if EC2) API request for authorisation. Get current time. None of the status endpoints uses any caching, these are direct hits - caching can be tested with Smoke Tests which are run after a deploy is successful.
Although we already had checks to see if an application worked there wasn’t certainty that the minimum requirements were met.
As more services are deployed and more environments and clusters are used this was getting out of control and now we can just hit a specific service to see if it works.
After the work was done the status endpoints were updated into Zabbix and the Load Balancers which will give a more realistic health check of the service for performance.
At the same time there was also some work related to this on the Web team which added status checks during deploy by hitting the application main page for HTTP OK.
With some modifications to this deploy script it was possible to see if any service was successfully deployed and receive a warning on Teamcity so there won’t be a wait for warnings on the errors log or on the load balancers.
The more useful information we get and the earliest we can get it will continue to improve our uptime and make changes safer with a very quick feedback time.
# Useful to do in the future:
The difference services use different technologies, hard to have the exact same behaviour and options. Status should be a resource, in that way we should be able to see different sub-groups like /DB and /API to check those connections that group into the main one. Error messages and codes need to be more explicit. Make every single application and service have the same status endpoint. The next month I did some work to try out ways to use this information which will be on a next post