Friday, May 29, 2009

core file analysis on solaris

A core dump is a detailed snapshot of the process at that point in time.
It has all the information related to OS threads(pstack info), data mapping to memory locations (pmap info), PC (program counter), stack pointer etc.
A core file has be studied by experienced OS admin or the vendor. It is lot easier to run debugger on the system where process crash has occurred.

Before the core dump/core file is sent to vendor, we can check few things and in most cases we should be able to figure out the culprit.

With pstack, we will able to get the LWP that triggered the crash.

# pstack core
core 'core' of 6604: /WebLogic/jdk1.5.0_10/bin/sparcv9/java -Xms2048m -Xmx4096m
----------------- lwp# 28 / thread# 28 --------------------
ffffffff7eea8ebc uadmin (6, 0, fffffffe454fb7f0, 10050, 10000, 11c00) + 4
ffffffff7ee3e698 addsev (f9c8, 10011b5d0, 0, ffffffff7eadd140, 0, ffffffff7ea1a000) + 40
ffffffff7e7b2de8 __1cCosFabort6Fi_v_ (1, fc00, e, ffffffff7ea1a000, 26726c, b400) + 60
ffffffff7e84a348 __1cHVMErrorOreport_and_die6M_v_ (0, ffffffff7eac9424, ffffffff7eac93f0, ffffffff7e9076ec, ffffffff7eabd938, 0) + c68
ffffffff7e36ea90 JVM_handle_solaris_signal (b, 13070, fffffffe454fbf00, 10022fa60, fffffffe454fc1e0, 10800) + b00
ffffffff7f217f6c __sighndlr (b, fffffffe454fc1e0, fffffffe454fbf00, ffffffff7e36df60, 0, 0) + c
ffffffff7f211ab0 call_user_handler (fffffffe52702c00, fffffffe454fc1e0, fffffffe454fbf00, 0, 0, 0) + 25c
ffffffff7f211c74 sigacthandler (b, fffffffe454fc1e0, fffffffe454fbf00, fffffffffffffff8, ffffffffffffffe0, 2) + 68
--- called from signal handler with signal 11 (SIGSEGV) ---
fffffffe43b06204 Java_weblogic_socket_DevPollSocketMuxer_initDevPoll (10022fba8, fffffffe454fc590, 10800, 0, 1bfbfa0, ffff) + 2fc ffffffff75c0d098 ???????? (10022fa60, fffffffe454fc680, fffffffe454fc590, ffffffffffffff10, 11800, ffffffff7ea1a000) ffffffff75c0d038 ???????? (80000000, 0, 0, 0, 8, fffffffe454fbdc1) ffffffff75c05978 ???????? (c000, 6, 0, ffffffff75c17180, 3, fffffffe454fbed1) ffffffff75c0023c ???????? (fffffffe454fc8b8, fffffffe454fcc80, a, fffffffe577b54c8, ffffffff75c0b560, fffffffe454fcca0)
ffffffff7e29b9b8 __1cJJavaCallsLcall_helper6
................
.................
ffffffff7e7b2878 __1cG_start6Fpv_0_ (10022fa60, d800, b000, b378, ffffffff7eabd1a4, ffffffff7ea1a000) + 210
ffffffff7f217c9c _lwp_start (0, 0, 0, 0, 0, 0)

From the above, we can figure out that signal 11 was raised fron initDevpoll function.
Java_weblogic_socket_DevPollSocketMuxer_initDevPoll

We will get more details with a debugger.
I am using a debugger mdb that ships with solaris OS. There are other debuggers like gdb and dbx that can be used.

Debuggers can do a lot more than simple commands. We can debug an active userprocess, live OS kernel and device drivers etc.
Complete documentatio of mdb can be found at
http://docs.sun.com/app/docs/doc/817-2543

I am running this against the same core file as shown above.

# mdb core
mdb: warning: core file is from SunOS 5.9 Generic_118558-36; shared text mappings may not match installed libraries
Loading modules: [ libthread.so.1 libc.so.1 ld.so.1 ]
Now once you are in debugger session, you can run commands (called dcmd)
The "status" command as shown below would print the summary informtion of the process.

> ::status
debugging core file of java (64-bit) from
executable file: /WebLogic/jdk1.5.0_10/bin/sparcv9/java
initial argv:
/WebLogic/jdk1.5.0_10/bin/sparcv9/java -Xms2048m -Xmx4096m .
threading model: multi-threaded
status: process terminated by SIGABRT (Abort)

To get the stack that caused the crash you can use "stack" command:

> ::stack
libc.so.1`uadmin+4(6, 0, fffffffe454fb7f0, 10050, 10000, 11c00)
libc.so.1`addsev+0x40(f9c8, 10011b5d0, 0, ffffffff7eadd140, 0, ffffffff7ea1a000)
libjvm.so`__1cCosFabort6Fi_v_+0x60(1, fc00, e, ffffffff7ea1a000, 26726c, b400)
libjvm.so`__1cHVMErrorOreport_and_die6M_v_+0xc68(0, ffffffff7eac9424,ffffffff7eac93f0, ffffffff7e9076ec, ffffffff7eabd938, 0)
libjvm.so`JVM_handle_solaris_signal+0xb00(b, 13070, fffffffe454fbf00, 10022fa60, fffffffe454fc1e0, 10800)
libthread.so.1`__sighndlr+0xc(b, fffffffe454fc1e0, fffffffe454fbf00,ffffffff7e36df60, 0, 0)
libthread.so.1`call_user_handler+0x25c(fffffffe52702c00, fffffffe454fc1e0,fffffffe454fbf00, 0, 0, 0)
libthread.so.1`sigacthandler+0x68(b, fffffffe454fc1e0, fffffffe454fbf00,fffffffffffffff8, ffffffffffffffe0, 2)
libmuxer.so`Java_weblogic_socket_DevPollSocketMuxer_initDevPoll+0x2fc(10022fba8, fffffffe454fc590, 10800, 0, 1bfbfa0, ffff)
...........
libjvm.so`__1cG_start6Fpv_0_+0x210(10022fa60, d800, b000, b378, ffffffff7eabd1a4, ffffffff7ea1a000)
libthread.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

This would give a little more information than pstack by printing the shared object of the funtion.
In the above case the crash originated from:
libmuxer.so`Java_weblogic_socket_DevPollSocketMuxer_initDevPoll
// libmuxer.so is the shared object where initDevPoll is called from

In this scenario this was an issue from weblogic performance pack libmuxer.so and getting the updated performance pack from BEA support fixed the issue.

Tuesday, May 26, 2009

WebLogic WLST Script Generator

WLS10.x comes with a recording feature whereby WLST scripts can be generated by recording a set of actions through Admin Console.
This is a screen snapshot of recording to generate WLST script.



Once generated, the script can be augmented to suit our needs.
For instance, I created a set of managed servers and added them to a cluster through Admin console.
The generated script (say sample.py) would look like:

startEdit()
cd('/')
cmo.createServer('managed1')
cd('/Servers/managed1')
cmo.setListenAddress('')
cmo.setListenPort(7003)
cd('/')
cmo.createCluster('testcluster')
cd('/Clusters/testcluster')
cmo.setClusterMessagingMode('unicast')
cmo.setClusterBroadcastChannel('')
cd('/Servers/managed1')
cmo.setCluster(getMBean('/Clusters/testcluster'))
activate()
startEdit()
cd('/')
cmo.createServer('managed2')
cd('/Servers/managed2')
cmo.setListenAddress('')
cmo.setListenPort(7005)
cmo.setCluster(getMBean('/Clusters/testcluster'))
activate()
startEdit()

We can augment the script by adding following actions:
1) Connect to Admin Server
2) Start a change session
3) < Add the generated WLST code >
4) Disconnect from Admin Server
5) Exit WLST

This would look like:

# connect to Admin Server(username being "system" and password "weblogic" and url localhost:7001)
connect(system,weblogic,t3://localhost:7001)
edit()
< Add the generated WLST code >
disconnect()
exit()

Now the script can be called either from shell script or ant.
Here is how you call from shell script.
I learnt this from David's blog.

#!/bin/ksh
/common/bin/.wlst.sh /mywlst/sample.py "$@"

Running wlst script through ant is documented at http://e-docs.bea.com/wls/docs100/config_scripting/using_WLST.html#wp1093337

Another posting I found useful was to automate tasks like creating domain by reading a CSV file in WLST.
Check out Bala's blog for this and other WLST related features.

Tuesday, May 19, 2009

WebLogic JDBC with Oracle RAC

WebLogic can connect to oracle RAC either through weblogic multi pools or through Connect-Time failover (underlying JDBC driver will be responsible for failover)

In case of Multi Pool/Multi Datasource, each pool/datasource is assigned to one RAC node and weblogic JDBC framework would make sure of loadbalancing/failover. For WLS8.1, this can be done by configuring multi pools and from WLS9.x through Multi datasource.

In case of Connect-Time failover, underlying JDBC driver will perform the failover under failure conditions.

Transaction handling would different based on whether the transaction is local or Global( XA transaction with Prepare/commit phases).

As for global transaction, weblogic does not provide transparent failover to database connections.
If RAC node fails before COMMIT phase, application would receive an exception and this has to be handled in the application by getting a new connection and attempting to process the transaction.

There is good article that provides step by step instructions on setting on weblogic multi datasource with oracle RAC
http://www.oracle.com/technology/products/weblogic/howto/rac/index.html

Thursday, May 14, 2009

Types of java.lang.OutOfMemoryError

These are the list of OutOfMemoryErrors a typical java application can see:

java.lang.OutOfMemoryError: Perm Space
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: unable to create new native thread
java.lang.OutOfMemoryError: requested xxxx bytes for Chunk::new. Out of swap space

java.lang.OutOfMemoryError: Perm Space
This indicates JVM ran out of PermSpace.
permspace is specific to jvm implementations.
For Eg, SUN JVM has one but Jrockit doesn't have an allocated permspace, i.e its part of heap.
To understand what information is stored in permspace, an example would be:
classA obj = new classA()
obj is an object i.e instance of classA.

Now JVM maintains the structure for "obj" on heap where as stucture for classA will be on permspace.
This is the reason applications that has lots of classes(libraries, JSPs) tend to occupy more permspace.
Also, we would see this space used when classloading (dynamic) is done.
Typical example would be application deployments on appservers like weblogic.
There is other kinds of information stored on permspace like information for JVM optimizations etc.

java.lang.OutOfMemoryError: Java heap space
This is a typical java out of memory from application.
This message simply says, for this particular request the response from JVM was "OutOfMemoryError".
Typically we see this message if JVM heap is full and FullGC is not able to reclaim any objects.
This can be an expected memory shortage or a memory leak.
Typically out of memory from memory leaks occur over a period of time, where as out of memory from load would occur under peak usage and JVM would recover (continue servicing requests even if some request fails with OutOfMemory error.
Obviously for load based OOM, increasing heap would help.
For memory leak scenario, we would have to use profiler tools to get to root cause. JDK6 provides a utility jmap that would serve the basic needs. We can have JVM generate a heap dump on OOM error. The heap dumps can be read by tools like jhat.
There are lot of commercial tools available like YourKit, Jprofiler etc to analyze memory leaks.

java.lang.OutOfMemoryError: unable to create new native thread
This would mean, the java process size has reached its limit.
When an application creates a java thread object, correspondingly an OS thread has to be created and to create this thread OS tries to reserve space for its stack (thread stack to store local variables) within the process address space.
In most cases, we will see this error if the application spawns too many threads.

java.lang.OutOfMemoryError: requested xxx bytes for Chunk::new. Out of swap space
This would mean the process could not allocate memory as the virtual memory was full. This can happen if there are too many process running on the machine or there is some other process colocated on same machine that consuming most of the process memory.
If neither is the case, the the current process might be having a native memory leak. This can some from native code within the application and rarely can be from JVM itself which has JNI code.

Tuesday, May 12, 2009

Dealing with java.net.BindException: Address already in use

Depending on the software, we see these errors. The underlying reason is the same.
We see the error if a process tries to listen on address+port thats already in use.

java.net.BindException: Address already in use

Address already in use: make_sock: could not bind to address

Unable to bind socket - Address already in use

Under ideal circumstances, after socket connection is dropped (both ends acknowledge FIN), underlying application would call close() which inturn is handled by OS kernel. Unless the OS kernel releases the concerned network resource, the port cannot be used by another process.

TO fix the issue, get the process ID thats already listening and kill it.

On Windows we can use netstat:
Here I am looking for my weblogic process running on port 7001.
netstat would give the process ID (2056 in this case).

C:\>netstat -o -a -n | findstr :7001
TCP   0.0.0.0:7001   0.0.0.0:0   LISTENING   2056
Go to windows process list and kill the process.

On Unix we can use lsof (need lsof package installed)
Here I am looking for my weblogic server listening on port 7002.
18664 is the PID.

# lsof -i | grep 7002
java   18664   root   47u   IPv4   0x30079bee4e0   0t0   TCP   unixmachine:7002   (LISTEN)

We can get the complete process listing as:
# /usr/ucb/ps auwwx | grep 18664

Kill the process as:
# kill -9 18664

Monday, May 11, 2009

Horizontal scaling vs vertical scaling of JVMs.

What is the best approach for scaling a JVM ?

Horizontal scaling : Stacking more memory (java heap) to existing JVM.
Vertical scaling : Adding more JVMs.

One of the fundamental scaling questions we hear is whether to add more heap to existing JVM/s or add more instances with existing heap sizing.

The answer is "it depends". Adding more heap to exisiting JVM/s would be better interms of performance (TPS), but this will hold true only to a certain limit.

Once the JVMs approach 2GB+ heaps, adding more heap will start to degrade
performance because of factors like GC. So, after some point adding more instances would make sense.

Based on simple webapp tests, horizontal scaling is better until 2GB heaps and there after vertical scaling of JVMs would be a better option.

Monday, May 4, 2009

garbage collection in java

Recently one of our developers wanted to know why the application seems to use a lot more memory than the steady state (amount of memory JVM consumes after the startup of application server with application deployed) when there were no users on the system for a while.

For the purpose of discussion, say the JVM has max of 1GB heap and steady state memory consumption is 300MB. The heap usage was around 700MB on an idle system.

Well, this is an expected behavior from the JVM. (SUN JVM)
Garbage collection is a condition based trigger. This would mean garbage collection is reactive and not proactive.
Most probable reason for this is GC is resource intensive operation and we only want this when needed.
Only a request will trigger a Full GC , If the request needs 100MB of memory (as an example) and JVM is not able to find 100MB of contiguous memory, this will trigger a Full GC, else, it will not initiate a GC cycle. (even if there are objects that have no references and are ready to be GCed).
Just because the system/jvm is idle, GC will not be "proactive" to clean up the unused objects.