Wednesday, February 15, 2012

Masking confidential information in log files with logback framework


Recently I came across one logging related issue where in I was supposed to mask some confidential information before logging. I was using logback framework. The simplest approach would be to programmatically mask the information before log statement.

E.g: Suppose we have a method logCCDetails(), which logs credit card details. Below is the pseudo code to do logging with masking.

logCCDetails(){
       LoggerFactory logger = LoggerFactory.getLogger(MyClass.class);
       logger.info(mask(accountNumber));
       logger.info(accountType);
}


//method to mask the confidential number
String mask(String acctNum){
//logic to replace all the digits of account number to 'X'
//then return the masked account number
}


So, basically we are using an utility method mask() to mask the confidential information before logger.info() statement. This approach works well if logger.info() statements for confidential information is in our application code. In those cases where we are using some external jars and that jar's code is having these logging statement, we can't add mask() method call before log statement, so this approach will not work.
In my case I was using Spring framework RestTemplate(available in spring-web jar) and apache HttpClient(available in commons-httpclient jar) to invoke web service calls using Json request/response.  Apache HttpClient internally logs every json request/response and headers before and after invoking web service call. There were some web service calls which had some confidential information in request/response. My task was to mask those confidential information in logs.
Below is the entry defined in logback.xml for web service calls:

<appender name="RESTSERVICE"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>restservice.log</file>
<append>true</append>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>restservice-%d{yyyy-MM-dd}.log</fileNamePattern>
<MaxHistory>1</MaxHistory>
</rollingPolicy>
<layout class="ch.qos.logback.classic.PatternLayout">
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %m%n</Pattern>
</layout>
</appender>


As per the logback documentation, there is a conversion specifier called "replace", that can be used to replace strings in log statement. The format of this specifier is as follows:
replace(p){r, t} : Replaces occurrences of 'r', a regex, with its replacement 't' in the string produces by the sub-pattern 'p'. For example, "%replace(%msg){'\s', ''}" will remove all spaces contained in the event message. This can be used in above configuration inside <Pattern> tag.
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %replace(%msg){'\s', ''}%n</Pattern>


I tried this in multiple ways, but it didn't work as expected. Probably I might be doing something wrong.

Then the solution I tried was to add my own custom Pattern Layout class. In above configuration the configured pattern layout class is ch.qos.logback.classic.PatternLayout provided by logback. I extended this class and overriden the doLayout() method as follows:


public class RestServicePatternLayout extends PatternLayout{


@Override
public String doLayout(LoggingEvent event) {
   String message=super.doLayout(event);
   if(event.getLoggerRemoteView().getName().equalsIgnoreCase("httpclient.wire.content")){
    message = MaskingUtil.maskConfidentialInfo( message );
   }
   return message;
 }
} 


public class  MaskingUtil {
     public static String maskConfidentialInformations(String message) {
      //  first match the pattern of confidential information
     // for e.g. - the pattern inside json for confidential info could be "confInfo":"1234".
    // Once match is found, replace all the matching strings with "X"
    }
}


This method returns the String back to the calling method, that will be printed in log files. In above code, I am comparing the logger name with string "httpclient.wire.content". This is the name which apache httpclient jar uses to initialize LogFactory for logging web service requests/responses. You can find this string in org.apache.commons.httpclient.Wire.java file.
Once the pattern matches, we mask the confidential information with X's. In above code I have written, a separate utility method to do this masking task.

After this, I configured this custom PatternLayout class in logback.xml as follows:

<appender name="RESTSERVICE"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>restservice.log</file>
<append>true</append>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>restservice-%d{yyyy-MM-dd}.log</fileNamePattern>
<MaxHistory>1</MaxHistory>
</rollingPolicy>
<layout class="com.my.logging.RestServicePatternLayout">
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %m%n</Pattern>
</layout>
</appender>


One of the advantage what I could see is that, in logback.xml we can configure our custom layout to run for some specific appenders to improve performance. The appender should have logger name configured in such a way that it runs only for the log statements where we can expect the confidential information. We don't need to run it for all the log statements. Since we are doing String comparison every time before actual logging, it may impact performance if we run it for every log statement. So, to avoid this, I configured the "RESTSERVICE" appender only for below logger:



<logger name="httpclient.wire" additivity="false">
<level value="debug" />
<appender-ref ref="RESTSERVICE" />
</logger>
The logger name "httpclient.wire" is used only for web service requests/responses, so our custom layout will run only for those log statements. This way we can avoid unnecessary String comparison for other log statements.







Wednesday, February 8, 2012

JAVA Garbage Collection


JAVA Garbage Collection
I was reading some of the articles about java garbage collection, so thought of posting it. This post is completely based on my personal understanding of garbage collection and can’t claim that everything I am writing is correct. So, please read it at your own risk ;)

From the name it is clear that Garbage Collection is removing garbage from your neighborhood. By reading garbage word itself, we must be feeling to clean that out. Obviously we all like to live in clean place (except few who prefers to live in mess :) ). Same applies to a JAVA applications, there having too much of garbage can cause OutOfMemory error and our application may crash. Our first task is to find out what could be the garbage for a java application.
Lets look into JAVA memory model first.




- PC Register: It stores the pointer to the next JVM instruction to be executed. It will keep on changing as the execution progresses to next JVM instruction. So, it does not look like garbage.
- Method Stacks: In above diagram I have mentioned 2 stacks - JVM and native method stacks.
JVM stacks keeps frames for every thread. Thread adds one frame for every method execution. Every frame stores local variables, intermediate calculations of a method execution, method parameters and return value. Once the thread execution completes, the respective frame will be removed from stack. So, this memory area also does not seem to be a candidate for garbage collection. Native Method Stacks is also similar to normal stacks except that It keeps frames for a native method execution.
- Method Area: The method area stores class structure. That includes methods, constructors, constants and static fields.This can be garbage collected sometimes. If a class becomes unreferenced by application, then JVM will unload this class by using Garbage collection. This can happen in some specific scenarios, I will not be discussing this type of garbage collection in this blog.
- Heap: Heap stores class instances(objects) and arrays. Throughout execution of any JAVA application, we create numerous objects. These objects are of varying size depending on class structure. This is the area which consumes maximum of memory among all memory areas. JVM does not provide control to programmer to explicitly remove any object from Heap, once that object is no longer in use. Due to large sizes of objects we must need to deallocate this memory time to time once we feel that some of the unused objects are no longer in use. To reclaim the memory occupied by the unused object JVM uses it’s own Garbage Collector, which runs as a daemon thread by JVM and reclaims unused objects from the heap memory.

Now we know the memory area which requires to be garbage collected.

Now we can formally say:
Garbage Collection is a process using which JVM reclaims the object from heap memory, once the object is assumed to be no longer needed.
In C++, It was programmer’s responsibility to manually release these dynamically allocated objects, so it was more error prone. In case where due to programmer’s error if lots of unused objects are present in heap memory and not reclaimed, then it may lead to Out Of Memory error and ultimately the application will crash. In JAVA, this task is done automatically by JVM, so it frees programmer from this complicated task and let him concentrate more on logic implementation.
JVM uses a daemon thread called Garbage Collector(GC) to execute the garbage collection task. Before reclaiming an object from Heap memory, it invokes finalize() method. The finalize() method is declared in the java.lang.Object class. It has below signature:
protected void finalize () throws throwable { //  }
finalize() method is called just before garbage collection. It is not called immediately after the object goes out of scope. So, we should not rely on this method for normal program operation. The intent of this method is to release system resources such as open files or open sockets before getting collected.

When an object will be garbage collected ?
An object will be eligible for garbage collection when it is unreachable from application code. Below are the 4 scenarios, which can make an object unreachanble:
1) When reference to an object is set to null explicitly.
method()1{
Object obj = new Object();
obj = null;
-----------
-----------
}
In above code, we create an instance of class Object. After that we explicitly set the reference obj to null. After this statement, we can’t reach the object created in 1st statement, so this object is eligible for garbage collection.
2) When object reference goes out of scope.
method2(){

for(int i=0;i<10;i++){

Object obj = new Object();

break;

}

----------------------------------

----------------------------------
}
In above code, once execution comes out of for loop, the object created inside the loop becomes unreachable. So, it became eligible for garbage collection.
3) Island of Isolation
If an object obj1 has internal reference to another object obj2 and obj2 has internal reference to obj1 but none of them has any outside reference. In this case both of the objects will not be reachable, then both obj1 and obj2 will be eligible for garbage collection.
class A{
B b;;
}
class B{
A a;
}
class C{
public static void main(String[] args){
A a = new A();
a.b = new B();
B b = new B();
B.a=new A();
a =null;
b=null;
//after above 2 lines a and b both are set to null. The objects referenced by a and     
// b has internal references to objects of type B and A respectively, but we can’t
// access those 2 objects. So those will be garbage collected.
}
}
4) Weak references
Weak reference does not prevent referred object from being garbage collected. Any object which does not have any other reference apart from weak reference, can be garbage collected.

Can we programmatically invoke GC?
We can't force JVM to run GC. We can only request JVM to do garbage collection, after that it is up to JVM whether to run GC immediately or after some time. System.gc() method is used for this purpose.


Now lets discuss some of the approaches of garbage collectors:
1) Reference Counting Collectors
In this approach JVM keeps count of references of every object in heap. Once  the count becomes zero the object will be reclaimed from heap. One of the disadvantage is overhead of incrementing/decrementing the counter. Another disadvantage is that it cannot identify those objects for garbage collection which comes under Island of Isolation category.
2) Tracing Collectors
In this approach, JVM first traces out the graph of object references starting with root node and marks all the referred objects as reachable. Other objects are assumed to be unreachable and later be reclaimed from heap. One of the tracing algorithm is “Mark and Sweep”, in which first phase is marking the object as reachable and in sweep phase it frees up the memory occupied by unmarked objects.
3) Compacting Collectors
Compacting collectors moves live objects towards one end of heap area. So, other end of heap become contiguous free area, from which new objects can be allocated.  This process could be slow because of overhead of moving live objects to one end of heap. This includes marking the objects as live, copying live objects to other end of heap and updating the object references to point to new location. Below picture shows the status of heap memory before and after GC run:
4) Copying Collectors
In this approach, all the live objects are copied to a new area in heap as contiguous area similar to compacting collectors. But it does not use mark and sweep algorithm. It copies live objects on the fly and updates the referencing pointers. This algorithm is known as “Stop and Copy”. It also divides heap into two parts similar to compacting collectors. It uses only one heap area out of two areas at once. Once one area is full, it moves all the live objects from this area to another heap area. One of the disadvantage of this approach is that it requires more heap area, since it uses only half of the heap area at once.
5) Generational Collectors
In above algorithms every object in being scanned and/or moved in every GC run.  With large number of objects in memory, these algorithms may not perform good. Also in practical scenario we may have observed that some objects have very short life and some objects have very long life. So, copying those long lived objects multiple times is not a good strategy from performance point of view. Generation collectors are designed to address these performance issues. Generational collectors divides the heap into multiple young and old heap areas. After surviving few GC runs the object is promoted to next level(from young to old). In this approach GC runs more often in younger heap area and less often in older heap area. That makes sure that short lived objects are reclaimed quickly and long lived objects are not scanned/moved more often.
The default arrangement of generations is shown in below picture:
(picture from http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html)

Young generation is divided into one Eden and two Survivor spaces. Initially after object creation, it gets allocated into Eden space. If object from Eden survives after next minor garbage collection, then it moves to Survivor 1 and then to Survivor 2. After major garbage collection if the object survives then it is promoted to Tenured area.  The perm generation stores data needed by the virtual machine to describe objects that do not have an equivalence at the Java language level. Example: objects describing classes and methods.
6) Train(Incremental) Collectors
JVM usually pauses other running programs while running GC. Collecting larger heap area may take longer time, that will ultimately cause other programs to stop for a longer time. To address this issue, incremental approach is used. In this approach, GC runs on a portion of heap area in one run than running on whole heap memory. This approach avoids other running programs to pause for longer time and does garbage collection incrementally.


JVM provides parameters to set JVM memory areas to different values. We can also set JVM parameters to set Heap memory components to different values to tune GC. Depending on our java application requirements, we may need more younger heap memory and lesser tenured memory or vice versa. I am not covering the JVM parameters used for these purposes.

Please let me know your feedback on this post.