This story is about a recent event we experienced migrating WSO2 ESB 4.8.0 to WSO2 Enterprise Integrator 6.5.0. Together with our customer, we’ve developed an enterprise grade services platform that connects and integrates a significant number of their backend applications. The platform fulfills business integration processes in a secure, timely and complete manner. We’ve been successfully running the production platform on a single node installation for several years, but occasionally experienced a peak load on the platform, sometimes proving challenging to keep the platform up and running without interruptions.
As the platform grew, it was acknowledged that an upgrade was necessary both at the application and infrastructure level. Therefore a project was initiated to upgrade from a single node WSO2 ESB 4.8.0 with WSO2 Data Services Server (DSS) 3.2.2 to a high-available and failover capable infrastructure based on WSO2 Enterprise Integrator 6.5. We would get several benefits with upscaling. Things like performing maintenance on the platform without outage and being able to handle peak-loads more resilient because we could now configure auto-healing capabilities on the platform.
The end result was a multi-node setup on RHEL 7 OS fronted by a Linux-HA load balancer and reverse proxy having a WSO2 Enterprise Integrator 6.5, Apache ActiveMQ and a database as backend fully configured in a high-available configuration.
The services platform consisted of:
- Some custom OSGi modules.
- 3 custom Axis2 services.
- 61 CARs containing 119 services.
- 2 supporting Spring-boot services.
To make it all work we have:
- Connectivity with ActiveMQ providing guaranteed message delivery.
- All database data sources connecting to MS SQL. (Customer’s choice of database product)
- Custom profiles for HTTPs outbound traffic.
- VFS adapters configured.
- Connectivity with 20 backend applications.
As you can see, quite a big setup. Initially, we had a reasonable startup of 240 seconds for WSO2 Enterprise Integrator to get the platform up and running. But after a while the startup time suddenly changed to more than 1,200 seconds! It also took an awfully long time to browse the registry with the Carbon console. Each click took about 10 to 20 seconds to get to the next step in the registry tree.
This blog will tell how we eventually found and solved the slow startup issue to get back to the original 240 seconds at the beginning.
Observing
First of all, we rechecked WSO2’s performance tuning guidelines to ensure everything was respected. All settings were applied, and we were able to eliminate this item. We’ve also perfectly met the requirements regarding machine resources suggested by WSO2. On all nodes in the cluster, we were able to see that the CPU load and memory usage was normal. Secondly, we also experienced faster startup times around various irregular time moments during testing at night or at the late afternoon when the business was not pressuring the platform that much.
We started observing JVM and WSO2 resources through JMX. We have Jolokia deployed on our server to monitors its internal state and thus it was easy to observe the MBeans for database data sources. The MBeans showed differences between slow and fast startups. During slow startups, the registry data source needed over 56,000 borrowed connections, where during normal startup it was not more than 76. This gave us the hint to look into tuning the data source configuration, so that we would have enough connections in the database connection pool available to support all startup and deployment routines.
After tuning the data source and a new startup, we had a great startup time of 210 seconds. With this we thought the issue was solved. Until the slowness returned after half a day.
The search continues
During a slow start, these types of logs always appear in the beginning:
[EI] 2020-02-12 03:27:03,971 WARN [Timer-4] {org.wso2.carbon.application.deployer.internal.AppDeployerServiceComponent} - Waiting for required OSGi services: org.wso2.carbon.application.deployer.synapse.service.SynapseAppDeployerService,org.wso2.carbon.mediation.initializer.services.SynapseEnvironmentService,[EI] 2020-02-12 03:27:04,368 WARN [Timer-5] {org.wso2.carbon.core.internal.StartupFinalizerServiceComponent} - Waiting for required OSGi services: org.wso2.carbon.application.deployer.service.CappDeploymentService,org.wso2.carbon.server.admin.common.IServerAdmin,org.wso2.carbon.throttling.agent.ThrottlingAgent,[EI] 2020-02-12 03:28:03,970 WARN [Timer-4] {org.wso2.carbon.application.deployer.internal.AppDeployerServiceComponent} - Waiting for required OSGi services: org.wso2.carbon.application.deployer.synapse.service.SynapseAppDeployerService,org.wso2.carbon.mediation.initializer.services.SynapseEnvironmentService,[EI] 2020-02-12 03:28:04,368 WARN [Timer-5] {org.wso2.carbon.core.internal.StartupFinalizerServiceComponent} - Waiting for required OSGi services: org.wso2.carbon.application.deployer.service.CappDeploymentService,org.wso2.carbon.server.admin.common.IServerAdmin,org.wso2.carbon.throttling.agent.ThrottlingAgent,[EI] 2020-02-12 03:28:23,650 INFO [localhost-startStop-1] {org.wso2.carbon.mediation.initializer.ServiceBusInitializer} - Starting ESB...
When an OSGi module is waiting to be deployed, it’s because it waits for other dependent OSGi modules to be deployed and available. After tuning the registry data source, we had no idea what exactly the problem was.
We executed a couple of scenarios to see where it started to slow down:
WSO2 Enterprise Integrator Setup | Result |
Empty installation | Super-fast startup below 30 seconds |
With configuration but leaving out the registry.xml part | Still a fast startup |
With configuration + registry.xml part | Slow startup with OSGi WARN messages. |
Full setup (configuration and services) but leaving the registry.xml out. | Fast Startup around 60 seconds! |
Full setup (configuration and services) and registry.xml included. | Slow startup |
As we have a clustered setup, we originally altered the registry.xml to reflect the cluster configuration. We decided to disconnect the registry from the database and go back to the H2 database which is configured as default in WSO2 products.
The conclusion was that when switching data sources from the registry pointing to the MS SQL database (running on a Windows 2012 server), the startup was slow. This led our investigation efforts towards the database itself rather than to the WSO2 configuration.
Narrowing the problem area
To see what happens to WSO2’s registry data source, we had to setup spy logs with a tool called log4jdbc DriverSpy. It’s a process that sits between the data source and the database connection and can publish some very useful application logging regarding query performance.
These are the steps that we took for setting up DriverSpy:
1. Download the log4jdbc driver and add to the <EI_HOME>/lib folder.
2. Add the following modifications to the <EI_HOME>/conf/datasources/master-datasources.xml file.
2a. Change the data source URLs like below by adding the ‘log4jdbc’prefix to it
<url>jdbc:log4jdbc:sqlserver://hostname:1433;databaseName=REG_DB;applicationIntent=ReadWrite</url>
In our case the above is based on a connection to MS SQL, but you can perfectly execute the same kind of procedure for any kind of database connection.
2b. Modify the ‘driverClassName’ parameter value as below
<driverClassName>net.sf.log4jdbc.DriverSpy</driverClassName>
2c. Add the following properties to the <EI_HOME>/conf/log4j.properties file.
log4j.logger.jdbc.sqlonly=INFOlog4j.logger.jdbc.sqltiming=INFOlog4j.logger.jdbc.audit=OFFlog4j.logger.jdbc.resultset=ERRORlog4j.logger.jdbc.connection=ERROR
3. Restart the server
During the restart, we could see how much time it takes to complete database statements. The output prints logs like below and we can see that the below query has been executed in 1 millisecond.
>[2020-02-14 12:43:12,437] [EI-Core] INFO - sqltiming SELECT REG_MEDIA_TYPE, REG_CREATOR, REG_CREATED_TIME, REG_LAST_UPDATOR, REG_LAST_UPDATED_TIME, REG_VERSION, REG_DESCRIPTION, REG_CONTENT_ID, REG_UUID FROM REG_RESOURCE WHERE REG_PATH_ID=45 AND REG_NAME = 'edit.gif' AND REG_TENANT_ID=-1234 {executed in 1 msec}
The blog posts “How to Enable JDBC Spy Logs for WSO2 Servers” and “How to use log4jdbc with WSO2 products to log JDBC calls” give a deeper explanation of the above.
After counting all the query calls during startup, we had the following results:
Normal Start:
Time for DB calls: 14.490s
Time for car app deployments: 2m 44s
Total startup time: 353s (5m 53s)
Slow Start:
Time for DB calls: 168.696s (2m 49s)
Time for car app deployments: 15m 36s
Total startup time: 1244s (20m 44s)
According to the result of the analysis phase, it seems that during the slow start, a lot of time has been consumed for DB calls and CAR deployments. The time consumption for CAR deployments takes the highest portion of total startup time.
It was now clear the database was the problem, also because of the irregular behavior in times when no one is working in the office. On a Saturday night, we had to reach out to the DBA team to check what’s going on with the database.
Focusing on windows
During our investigation we had already contacted the DBA to have him look for indicators in the database. He had looked only at queries that we’re queued longer than normal and that was not the case. After our latest finding, we were convinced that the database was the culprit and asked the DBA team and Operations team to continue investigation on their side. The Operations-team discovered that on the database server (Windows 2012 server, having 2 vCPU’s) one CPU was fully used all the time and the other one mostly as well averaging over 80%, all the way up to 100%!
One process is responsible for this overhead. “svchost.exe” with the parameters -k netsvc that constantly uses 50% CPU. This is a process used by many Windows services. Another process that regularly uses 50% CPU together with svchost.exe is the tiworker.exe, which is used to perform Windows OS updates. It’s not possible to stop this service (or this process is hanging). When stopping, we presume it will automatically restart after a while. If you search for “tiworker.exe“ on the Internet, you’ll see a lot of common issues around Windows 2012 performance.
The Operations team decided to clear the Windows-update cache and restart the update process which caused it to finish properly and the database regained resource access.
After monitoring the performance while the tiworker.exe service was not executed, we saw a substantial improvement in the startup time. 276 seconds. (From a situation where the startup was over 1,300 seconds).
Why do you need API Management in the frst place?
Download nowSummary
At our customer, having a big services-platform, we experienced a sudden drop in startup performance of our newly installed and configured WSO2 Enterprise Integration v6 installation. One moment is was starting up in an acceptable time frame, which then slows down more than 500%. We investigated various parts of the WSO2 installation directing our efforts eventually to our database-server and more specifically the Windows update services.
In this blog we’ve described how we approached this problem and which tools and steps we took to peel off the layers of the problem eventually ending up in an acceptable WSO2 startup once again.
Yenlo is the leading, global, multi-technology integration specialist in the field of API-management, Integration technology and Identity Management. Known for our strong focus on best-of-breed hybrid and cloud-based iPaaS technologies.
Yenlo is the product leader and multi-award winner in WSO2, Boomi, MuleSoft and Microsoft Azure technologies and offers best-of-breed solutions from multiple leading integration vendors.