Step by Step Trouble Shooting

From Documentation

Performance SOP

This article tries to provide general steps and details how to determine a bottleneck in a slow performing ZK application. Additionally it offers some conclusions and tips, what the next steps would be after identifying the problem area.

Identify the Bottleneck

Usually the bottleneck can be found in one of these areas:

  • client
  • network
  • server

To breakdown a slow performing web application is a good idea to start, where the bad performance is perceived... in the browser. Most modern browsers provide very sophisticated tools supporting the search for a bottleneck and draw some conclusions, and eliminate other possible causes easily.

chrome network tab
Developer tools - Net(work)
Chrome -> [F12] / [CTRL + SHIFT + I]
Firefox -> [CTRL + SHIFT + Q]
Firefox with Firebug -> [F12]
IE9+ -> [F12]
IE8 & others -> fiddler2

Investigating the network traffic by following the questions below the biggest problem area(s) should become apparent after a few minutes:

1. Are there one or more long running requests?

NO → #Client Side Issue

YES

2. Is it a static resource?

YES (js, css, images...) → check #ZK Server Configuration (debug / cache / compression) → STILL SLOW → #Network Issue

NO (dynamic request into ZK application)

  • *.zul = full page request (can be followed by ajax requests)
  • zkau/* = ajax request

Example showing a long waiting time i.e. server takes 3.83 seconds to create the response

3. Which PHASE of the request is slowest ? (wording based on Chrome developer tools)

CONNECTING (or one of Proxy, DNS Lookup, Blocking, SSL)

3.a) Is this a network problem (everything between browser and ZK Application)?
  • test ping / trace route to different servers
  • test dns lookup timing
YES → #Network Issue
NO → #Server Side Issue (application takes long time to accept connection, or even times out)


SENDING

3.b) Is the request unreasonably big? (rare case, usually due to an upload (reasonable), or form posting a lot of data)
YES → #Client Side Issue
NO
3.c) Is the bandwidth low?
  • e.g. try upload the same amount of data to the server via ftp/scp to check possible upload speed
YES → #Network Issue
NO → #Server Side Issue (application server receiving request data slowly)


WAITING#Server Side Issue (application server taking long time to prepare response)


RECEIVING

3.d) Is the response unreasonably big?
YES → #ZK Server Configuration (render on demand / compression)
NO
3.e) Is the bandwidth low?
  • e.g.try to download the same amount of data from the server via ftp/scp to check download speed
YES → ask your administrator to fix it ;)
NO → #Server Side Issue (appserver sending response data slowly)

Client Side Issue

If there is no significant time spend on the Network and Server side, the slowdown must happen somewhere on the client side.

Client side performance is affected my many factors and may vary with different browser types versions. Other factors are:

  • operating system
  • available memory
  • CPU speed / load
  • screen resolution
  • graphics card speed

So it is good to compare the client performance on different computers with different browsers, to identify the configuration causing the issue.

If client performance among configurations / browsers is equally bad, the issue will more likely be found in the Rendering area → #Client Side Profiling

Once you identified the client side rendering takes very long, check the size of the response, if the Client engine needs to render a lot (e.g. a Grid with 1000 lines) it will take its time. So compare the timing with a smaller response, and consider if this can be prevented by reducing the data sent to the client using Render on Demand or Pagination (Most users don't need 1000 lines visible at once)


Performance degrading over time when using the application (while network and server timings remain constant) might indicate a client side memory leak → #Client Memory Issue

Client Side Profiling

Make sure your local computer is not under heavy CPU load, and has "enough" Memory available, before starting to profile the Javascript execution in the browser.

To measure and break down the time spent in the JS engine you can try the following steps (in chrome), and interpret the results:

  1. switch to the "profiles"-tab
  2. choose "Collect Javascript CPU Profile"
  3. click "start"
  4. perform your action e.g. reload the screen, or load the search results
  5. click "stop"
  6. switch to "Flame Chart" (choice at the bottom)

You'll get a nice view like this:

Js profile flame chart.png

This brilliant visualization of the JS execution flow and stack depth can be used / interpreted in many ways to extract the information you require.

Another interesting view to determine the render time is the Timeline - Events view in Chrome

The timeline on the top indicates the whole period between "start" and "stop", I selected the range we are interested in, and the colorful area at the bottom gives details about which methods are actually called and their timing (you can zoom in and out using the mouse wheel too), clicking on one method will directly lead you to the associated line in the source code (enabling debug-js will help when using this feature).

The small peak (at 1300ms) on the left side is my actual event refreshing the page. The gap until 1500ms represents the idle time of the JS engine waiting for the first response from the server, then you can follow in which order the JS files are executed and which of them consume most of the time.

Additional waiting times in the middle indicate load time of additional JS files and garbage collection times. At about 1840ms the JS engine stops meaning ZK has finished rendering the page widgets and updating the DOM elements.

We can conclude our page took about 540ms (after the initial event) to load and render in order to become available the user and here is no significant slowdown on either JS or network side.

General conclusions:

  • more wider mountains → mean more JS time (e.g. ZK render time, a third party library)
  • more valleys (flat lines) → mean more waiting time (mostly network, maybe also CSS formatting or other time the browser does not assign to the JS engine)

A similar but less detailed and colorful view is available in Firefox [Shift + F5] showing some basic timeline. And in IE you can also get profiling data.

I find the flame-chart in chrome most powerful, and even if the Performance issue only occurs in a different browser it is a good starting point visualize the timing of the problem, and understand the complexity of the execution path (e.g. in IE you get the information that most time is spend in method x(), and in chrome you can actually see the method in a bigger context). Or when using an older version of IE, you can define the methods of interest where to put some log statements to trace performance manually.

Client Memory Issue

Use a system management tool (e.g. on windows: task manager or process explorer) to watch the memory consumption of your browser over time.

If memory is increasing repeating an action on a page -> subsequently remove/add items from your ZUL file to identify the component causing this issue.

Again Chrome offers a memory timeline view, real time memory profiling functions and JS heap dumps.

Server Side Issue

After identifying the the server side as the bottleneck the we can drill down the problem even further.

There are many things on the server side, that can cause a slowdown, in the response. Of course you'll first check that there is enough physical memory available and that no other (unrelated) process on the server causes CPU load while you are actually idling in the ZK application.

Basically ensure that the application server has all the resources it needs and is configured to use them:

e.g.

  • CPUs (multi cores)
the server process might only have limited access to available CPU cores
  • Memory
you might have alot physical memory but the JVM is configured to only use a small value
  • Incoming connections:
application server might be configured to handle only small number of simultaneous requests even if it could handle more
→ queueing/denying additional requests exceeding these limits
  • DB connection pools
there might be a very fast DB waiting for your input, but your connection pools are too small


Then perform the "slow" operation and observe CPU load, IO times and Memory.

Busy or Waiting?

If the the CPU is almost idle but most time is spent in IO operations, then it could be a slow or very busy hard disk, or long running network operations (such as accessing external web services or querying a DB on a different host). The network or DB doesn't necessarily have to be slow, it could just be a large number of very quick calls to external resources.

E.g. 300 requests of 10ms (I would not consider slow) each, still take 3 seconds - during these 3 seconds your java process could be idly waiting for the responses and you'll see a low figure on CPU compared to a high value in IO wait times.

You'll most likely know which external resources you are accessing in your code, so commenting these out and temporarily replacing them with mock implementations will help you exclude these. If you have no idea about any continue with Server Side Profiling to identify the places that take most time when creating the response.

Here some helpful tools to determine busy or waiting processes:

Linux:

  • top (in most cases just sufficient to distinguish between busy or waiting)

Windows:


Performance Debugging/Logging/Tracing

Feeling Lucky

Sometimes real profiling might be overkill or you just want to make a quick test. Then a simple way to find a bottleneck is to launch your server in debug mode without any break point. Then trigger the slow operation. Right in the middle of the operation, "Suspend" the execution (yes you can suspend manually without defining a break point), in your IDE. You'll have several suspended threads. Just examine their call stacks, and usually the longest stack is the one you are interested in. Then go down in the stack to search for the bottleneck (Chances are high, that you'll suspend the execution during the actual call causing the bottleneck - either it takes a long time to execute, or it is called very very often).

In eclipse a very obvious case looks might look like this:

Suspended process.png

No so lucky

In most cases the the bottleneck will not be as outstanding as in the above image, but still the call stack can be a good source to start from. e.g. outputting some counters and tracing information to the log.

Putting extra tracing information, will not always be possible and also takes some time. So if this is not quickly possible it is better to do some real profiling, but may sometimes be a quick way and sometimes the last option available, if profiling is not possible, and not debugging port is available.

Server Side Profiling

If the JavaVM is busy during the operation or you have no idea which parts in your application consume most of the time, profiling is the final weapon.

There are several Profiling tools in various price ranges and there is JVisualVM (included in the JDK) so it should be available everywhere.

here is nice tutorial about how to get started with JVisualVM

https://blogs.oracle.com/nbprofiler/entry/profiling_with_visualvm_part_1 https://blogs.oracle.com/nbprofiler/entry/profiling_with_visualvm_part_2

Memory Issue

ZK Server Configuration

Network Issue

If something in your network infrastrucure (routers, proxies, web servers...) is causing the performance issues, there is little you can do as web application developer.

Here some ideas to identify possible bottlenecks trying to reduce network complexity by:

  • using ip addresses directly → indicates DNS problem
  • avoiding proxies, routers, firewalls (e.g. access the application server from a browser on a remote desktop "closer" to the actual server)
  • accessing the application server directly, instead of going through a webserver or load balancer
  • disabling SSL and check difference

→ "Kindly" inform your network administrator about your observations and ask for help identifying, excluding, fixing these infrastructure problems.