Sunday, March 3, 2013

My experience with Nodejs

After being a java developer for many year, I got a chance to develop a web application on Node.js recently. This blog is about my good/bad experience with Nodejs while doing that project. Node is a javascript platform for server side programming. Earlier my thought about Javascript was that it is a superb thing for UI programming and it really does amazing things. Node changed that perspective about javascript, it enables javascript to run on server side. For those who heard about Node first time, may think WHAT...., javascript can run on server side (I thought the same). But it is really Cool technology to do server side programming. It uses Google V8 VM(used by Chrome) as a runtime environment which is known for its blazingly fast performance. In Node there is a single process which serves all the incoming requests. It uses asynchronous callbacks, which makes sure that the single process is not stuck in one request in case of slow I/O’s, and parallely processes other requests.
Before writing my findings about using Node, I want to put a disclaimer that for brilliant programmers it does not matter which technology they use, they will always make the things smoother, but for an average programmer like me(and many others) there is always a good and another not so good programming paradigm to use for solving a problem.OK, now after this disclaimer I think good programmers won't beat me up after reading my thoughts about Node :)

Let's come to the findings:

  • In Node, you will be writing code in javascript. Almost every programmer who worked on web application must be knowing little bit of javascript programming. It has very simple syntax that is very similar to other programming languages. Overall, learning this programming paradigm won’t consume too much of time if you have programming background in any language. Once you are comfortable with writing asynchronous blocks, you are done.
  • Node servers are very lightweight. You don’t need to use any external web server for this purpose. Only a few lines of code and you are done. Below is a sample Node server code:
      var http = require("http");
      http.createServer(function(request, response) {
             response.writeHead(200, {"Content-Type": "text/plain"});
             response.write("Hello World....");
          response.end();
          }).listen(8080);
          console.log(‘server has started...’);

Save above lines of code in a file “sample.js”. Run this code by typing node sample.js and you server is ready. Go to your browser and type http://localhost:8080, and you will see “Hello World...” displayed in your screen. While executing above code you might have observed that, http.createServer() function is executed and it registers the internal callback function which has request and response as function arguments. Then it goes on to execute next statement and prints “server has started...” without waiting for http requests. Whenever you hit the url “http://localhost:8080”, it calls the internal callback function and returns the response. So, here your execution does not get stuck at createServer() function and continues to next statement and the callback function does the job of accepting http requests whenever it is called. That is the power of its asynchronous callback programming, which makes sure that your single process does not get stuck anywhere. Being a single process application you don’t need to bother about multithreading issues, you only need to write down your logic in correct asynchronous blocks.

  • Due to its very low memory consumption per connection and asynchronous I/O approach, it is highly scalable during high loads. It can handle large number of connections with very less number of servers deployed, that will lower your hardware cost also.
  • As you might have observed all the I/O’s are done through asynchronous way. Your server will make I/O request and go to process others tasks, There is no I/O waiting involved that saves your CPU time. But for CPU intensive work, which happens internal to server without having too much of I/O, your single process will keep on doing that work until it finishes and will not be able to do any other task, which makes other requests waiting during that period. So, if your application has too many I/O tasks than having long running CPU intensive task, then Node would perform very well for your application.
  • There is one major disadvantage of being single process application. If any error occurs at run time and not being handled, then the node process will stop and it will cause your server to stop and you need to start your server again.
  • npm (https://npmjs.org/) is repository of node modules which has thousands of node libraries for different purposes. You can also develop your own library and contribute there. I have used many libraries from npm, but I felt that most of those libraries are not very mature to be used for a big project. So, before using any of those library please analyze/test carefully.
  • To develop a big size enterprise level application, there are many general frameworks you will need, for example: a good mvc framework, logging, orm, web services(Rest/Soap) frameworks, unit test framework and many others specific to your application. I found few very good node frameworks available and used them without any issue. But for few areas, I could not find any mature framework and finally I ended up extending the available library to make that suitable for my application. If i look at other programming language frameworks I earlier used for those puposes, they are much more mature than Node libraries and serves all the general level requirements, may be because those libraries are developed and maintained by some of the well known organizations/groups and goes through a proper review cycle. But I feel with time this problem with go soon, once many tech companies will start  using Node, then more mature frameworks/libraries will start coming out. But at this time I can say about all the node libraries I tried, I found many of them not many useful.
  • Any big project goes through multiple changes and you will keep on doing refactoring the code to keep it clean. Being an interpreted language, it can’t tell a syntactically wrong statement, you will never know the error until the error happened in production. So, it is highly desirable to have unit test coverage as high as possible, otherwise code refactoring will be a nightmare for you. Having high code coverage will make sure that at least your code is syntactically correct. Personal I would prefer statically typed language over dynamically typed language for big projects keeping maintenance in mind.
  • Many times you might want to write sequential logic, which requires the outcome of first statement to be used in 2nd statement, outcome in 2nd in 3rd and so on...For example:
         i = fun1();
j = fun2(i)

k = fun3(j).

If all of the above functions are asynchronous in nature, then you code in Node will look like:

fun1(function(i)){
 fun2(i, function(j){
   fun3(j, function(k){
      //do some operation on k
   });

});
)};
Having these many levels of nested asynchronous blocks makes your code look not so good.
Let’s see another example, where one function returns an array and another function has to work on every element of returned array:
fun1(function(error, results) {
var completedTasks = 0;
for (var i in results) {
fun2(result[i], function(j){
    //do something with j
completedTasks++;
if(completedTasks == results.length){
console.log(‘completed all tasks’);
}
});
}
});
This kind of sequential execution code does not look good in asynchronous style of coding. There are design patterns which makes implementation of this type of logic look much decent than this approach. I won’t discuss those design patterns here, but you can use those to make your code look better than this. Personally I didn't like some of my code written in asynchronous way.

  • You need to be careful while writing synchronous blocks of code. If synchronous code in your application is going to consume higher time, then Node’s single process will be busy during that time and will not be able serve any other job. That will slow down overall performance of your application. So, better to be used to of writing more asynchronous blocks of code, otherwise wrong coding style will make your Node’s single process application perform slower.

At last I would say every technology choice depends on the nature  of problem. I would really prefer Node for those kind of applications which requires more I/O than CPU intensive work. Few examples are chat applications, server push kind of applications. These kind of applications requires supporting large number of user connections but will require mostly I/O operations for those requests. In these scenarios, I would prefer Node over other technologies. But for large size web applications which requires lots of logic to be written, having many other important problems to be solved than having only slow I/O's and is going to keep CPU busy for user requests, I would prefer other stable web technologies over Node.


Saturday, October 20, 2012

Timer vs ScheduledThreadPoolExecutor

Many time we come across scenarios, when we need a background thread to keep running and execute tasks at scheduled time. JDK provides java.util.Timer and java.util.concurrent.ScheduledThreadPoolExecutor for this purpose.

java.util.Timer
Timer was introduces in jdk 1.3. 
There is a single background thread for every Timer object that executes all the assigned timer's tasks We can create an instance of TimerTask(which implements runnable) and add it to Timer object. The Timer object will take care of running the TimerTask at scheduled time. Timer runs the tasks sequentially. So, if there are 4 tasks to be executed by the same Timer object at the same time, and any one of those task takes longer time, the other remaining tasks may get delayed. We should keep this in mind while designing our system.


code example:


import java.util.Timer;

public class TimerTest {

/**
* @param args
*/
public static void main(String[] args) {
Timer timer = new Timer();
timer.schedule(new MyTask(), 1000);
}

}

class MyTask extends java.util.TimerTask{
public void run(){
System.out.println("Time task is running");
}
}


In above code, I scheduled a task to run after 1000 milliseconds(1 second). 

A task can be scheduled for single or recurring execution. You can refer javadoc for various implementation of Timer.schedule() method.
Timer class also provides cancel() method to terminate any scheduled task.

Timer class internally maintains a queue and all the TimerTask objects are inserted inside that queue. The TimerTask object maintains different states : VIRGIN, SCHEDULED, EXECUTED, CANCELLED. When a task is created, its initial state will be VIRGIN. Once it gets picked up by Timer for execution, its state will be changed to SCHEDULED. After execution its state becomes EXECUTED. If we mark a Task status to CANCELLED (by calling cancel() method), then it will never be picked by the Timer for execution.

java.util.concurrent.ScheduledThreadPoolExecutor
You have seen above that in  case of Timer you have single working thread which executes multiple tasks sequentially. But it may not fulfill your purpose, if you want to have parallel execution or if you have long running tasks. 

ScheduledThreadExecutor provides multiple working threads by using Thread pool. This class extends ThreadPoolExecutor class and uses thread pool to have a pool of threads to execute multiple tasks in parallel.
It provides similar kind of scheduling methods as Timer. You can refer javadoc for method details.
ScheduledThreadPoolExecutor was introduced in JDK 1.5. 


Code example:


import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

public class
ScheduledThreadPoolExecutorTest
{

/**
* @param args
*/
public static void main(String[] args){
//creates thread pool of size 2
ScheduledThreadPoolExecutor threadPool = new ScheduledThreadPoolExecutor(2);
threadPool.schedule(new MyTask1(), 1, TimeUnit.SECONDS);
threadPool.schedule(new MyTask2(), 1, TimeUnit.SECONDS);
}

}
class MyTask1 implements Runnable
{
   public void run()
   {
      System.out.println("Task1 is running");
   }
}
class MyTask2 implements Runnable
{
   public void run()
   {
      System.out.println("Task2 is running");
   }
}


Above code runs MyTask1 and MyTask2 in parallel after 1 second with 2 threads in threadpool.

If you look at javadoc of ScheduledPoolExecutor, you can observe that it provides more richer implementation of methods than Timer.
It internally maintains java.util.concurrent.DelayQueue to keep all the runnable tasks. In DelayQueue, an object can be popped only when its delay has been expired.
You can also use Future and Callable using its submit() function, which will return Future representing the pending results of the task.
It provides shutDown() method to terminate the future executions, but the currently running task will be completed before shutdown.

In my opinion, one should go with ScheduledThreadPoolExecotor when multiple worker threads are required for parallel execution. Also when the requirement is complex and a richer infrastructure is required then ScheduledThreadPoolExecotor would be a good choice. For simple kind of implementation, Timer and TimerTask will be sufficient.

Thursday, May 10, 2012

Java utility to invoke Restful web services using Spring RestTemplate


This is a utility/framework to invoke Restful web services. Internally it uses Spring RestTemplate to invoke web services. Apart from spring jars, it uses some external jars. If your application already has these jars included, then you do not need to include these again.

You can download the RestFramework.rar file containing all the resources from below link:
https://docs.google.com/open?id=0B8O-miA80x0gUTBlSHoxQ0paWGc

The zip file contains below files:
  1.   RestfulWS.jar – jar to be included in your application. This has all the classes of this utility.
  2.  RestfulWS-src.jar – jar containing source code
  3.  External lib – other external jars required for RestfulWS.jar. Include these, if not present in your application.
  4.  Test – folder having all the test classes. Refer this to know how to use this utility.
  5. Class diagram.jpg - class diagram

Class Diagram:



If you are not able to view this class diagram, then you can refer the class diagram.jpg file included in RestFramework.rar.
In above class diagram, you can see that there are couple of concrete classes as well as couple of abstract classes. The concrete classes you can instantiate directly and use it by passing url, http method, and other required parameters through constructor and invoker method. The abstract classes at the lower end of the class hierarchy are meant to be used as a template. You can extend it and pass the required parameters(url, httpmethod, etc) by overriding corresponding methods in your concrete class.
This utility supports json as well as xml based Rest web services. If your request/response involves some other kind of data(for eg- in some web services call you may get array of bytes as response), then you can extend any of the suitable class in this hierarchy and add message converters(marshaller/unmarsheller) for that.

Details of interfaces, abstract classes and concrete classes:

    1) RestfulWebServiceInterface<WI,WO>
This is the top most interface parameterized with:
WI - input class type for web service
WO – class type for web service response

It has only one method execute(), which can be used to invoke web service.

    2) GenericRestfulWSInvoker<WI,WO>
This is an abstract class which implements RestfulWebServiceInterface<WI,WO>. It has implemented invoke() method of the RestfulWebServiceInterface. It also has multiple protected methods, which is used by execute() method internally. These methods can be overriden in subclasses if required.

    3)    JsonBasedRestfulWSInvoker<WI,WO>
This is a concrete class which extends GenericRestfulWSInvoker<WI,WO>. It overrides some of the protected methods of GenericRestfulWSInvoker<WI,WO> which is supposed to behave differently if the request and response type of web service is json. This class can be instantiated and web service can be invoked directly using instance of this class.
Please refer Test2.java in Test folder to refer sample code to use this.



    4)   XMLBasedRestfulWSInvoker<WI,WO>
This is a concrete class which extends GenericRestfulWSInvoker<WI,WO>. It overrides some of the protected methods of GenericRestfulWSInvoker<WI,WO> which is supposed to behave differently if the request and response type of web service is xml. This class can be instantiated and web service can be invoked directly using instance of this class.
Please refer Test1.java in Test folder to refer sample code to use this.

    5)  RestfulWSInvokerTemplateInterface<I,O,WI,WO>
This interface provides more generic and extended way of invoking web service. This interface can be used if your web service layer is isolated from your application’s other layer and you don’t want to expose the VO’s of web service layer (for request and response) to be exposed to other layer.
For e.g.:

In above diagram you can see, we have separate business layer and web service layer. Business layer passes object of type I. Web service layer accepts instance of I and converts that into instance of WI. This WI can then be converted into request and send to invoke web service call. Again response will be mapped to an instance of WO, which then be converted into instance of O and sent back to business layer. In this scenario “Business layer” is completely unknown about WI, WO and other web service specific details. If you follow this template pattern, then you need to create one concrete class for each of the web service call which will contain all the details about that particular web service call.

This interface has only one method invoke() to invoke web service

    6)  GenericRestfulWSInvokerTemplate<I, O, WI, WO>
This abstract class implements RestfulWSInvokerTemplateInterface<I,O,WI,WO> interface and extends GenericRestfulWSInvoker<WI,WO>.  It provides implementation of invoke() method. It also defines some of the abstract methods required for this template.

    7)   JsonBasedRestfulWSInvokerTemplate<I,O,WI,WO>
This is an abstract class which extends GenericRestfulWSInvokerTemplate<I, O, WI, WO>.
It overrides some of the protected methods of GenericRestfulWSInvoker<WI,WO> which is supposed to behave differently if the request and response type of web service is json. This class can be extended to create the concrete template class and web service can be invoked directly using instance of the extended class.
Please refer Test4.java in Test folder to refer sample code to use this. It also has a template class JsonBasedRestfulTemplateImpl.java which is used by Test4.java.

    8)  XMLBasedRestfulWSInvokerTemplate<I,O,WI,WO>
This is an abstract class which extends GenericRestfulWSInvokerTemplate<I, O, WI, WO>.
It overrides some of the protected methods of GenericRestfulWSInvoker<WI,WO> which is supposed to behave differently if the request and response type of web service is xml. This class can be extended to create the concrete template class and web service can be invoked directly using instance of the extended class.
Please refer Test3.java in Test folder to refer sample code to use this. It also has a template class XMLBasedRestfulTemplateImpl.java which is used by Test3.java.





Please let me know your thoughts about this utility, that will help me to improve this to make it more generic.








Tuesday, May 1, 2012

Python link checker to find broken links in a website

Recently I was doing an online course "CS 101:Building A Search Engine" at www.udacity.com. In this course I learnt about basics of python programming, crawlers and search engine. I thought of building a small utility using crawlers to locate broken links in a web site if any. This utility crawls every page of your web site and checks if all the links are accessible. In case of any error it returns the url along with their http error code. This utility is suitable for that kind of web sites where you have multiple pages connected through links(using "href") and does not have too many form submission and AJAX calls.

Algorithm:
  • Accepts 2 parameters - 1) start page url 2) max depth to which crawler runs.
  • It first crawls root page and add all the links into to-be-crawled list.
  • once a page is crawled, it gets removed from to-be-crawled list and added to crawled list along with its status code (for accessible page the status is "OK", for http error status will be http status code and for wrong urls the status will be "invalid url".
  • It keeps crawling all the links until it reaches the max depth.
  • After finishing crawling, it writes a file with name "site-health.txt" which will have all the urls along with their status.

Note:
  • This utility could be more useful during release phase or during support phase of the project where after every new build you want to make sure that all the links are working.
  • It does not crawl pages which has AJAX calls.
  • It only crawls pages which has links using "<a href="<url>"></a>
  • It crawls only those links which are internal to the domain name of the root url. It does not not crawl links external to root domain. For e.g: if your root url is www.a.com, and your website has link to an external site www.b.com. Then it will crawl all the links inside www.a.com domain, www.b.com, but it won't crawl links available on the site www.b.com. If you want to add some more domains for crawling then you may need to edit source code as follows:
    • find the statement domain = get_domain(root) in the source code, change this line to 
                   domain = get_domain(root, "b", "c", <other domains>)
                  then it will crawl root url domain and domain b, c and other domains given in above statement.
  • I have tested it using Python 2.7.2. Please make sure that you have Python 2.7.2 or later version installed in your machine.
Source code:
https://docs.google.com/open?id=0B8O-miA80x0gS2JnSkVqTkZtTWs

How to use:
  • Download  check-web-health.py from above link and open the source code in edit mode.
  • Go to the last line of the code. It has the line: check_web_health('http://google.com',2)
  • Edit this line to check_web_health(<url of start page>,<max depth of crawling>)
  • Save and run this program.
  • After this program exits, find a file with name "site-health.txt" in the same directory where the  check-web-health.py file is present. Each line in this file will have url along with its status.

My knowledge of Python programming is of intermediate level. So, probably there may be some issues with this utility. Please use it at your own risk :)

Please let me know the issues you faced while using this utility.

Thanks

Wednesday, February 15, 2012

Masking confidential information in log files with logback framework


Recently I came across one logging related issue where in I was supposed to mask some confidential information before logging. I was using logback framework. The simplest approach would be to programmatically mask the information before log statement.

E.g: Suppose we have a method logCCDetails(), which logs credit card details. Below is the pseudo code to do logging with masking.

logCCDetails(){
       LoggerFactory logger = LoggerFactory.getLogger(MyClass.class);
       logger.info(mask(accountNumber));
       logger.info(accountType);
}


//method to mask the confidential number
String mask(String acctNum){
//logic to replace all the digits of account number to 'X'
//then return the masked account number
}


So, basically we are using an utility method mask() to mask the confidential information before logger.info() statement. This approach works well if logger.info() statements for confidential information is in our application code. In those cases where we are using some external jars and that jar's code is having these logging statement, we can't add mask() method call before log statement, so this approach will not work.
In my case I was using Spring framework RestTemplate(available in spring-web jar) and apache HttpClient(available in commons-httpclient jar) to invoke web service calls using Json request/response.  Apache HttpClient internally logs every json request/response and headers before and after invoking web service call. There were some web service calls which had some confidential information in request/response. My task was to mask those confidential information in logs.
Below is the entry defined in logback.xml for web service calls:

<appender name="RESTSERVICE"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>restservice.log</file>
<append>true</append>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>restservice-%d{yyyy-MM-dd}.log</fileNamePattern>
<MaxHistory>1</MaxHistory>
</rollingPolicy>
<layout class="ch.qos.logback.classic.PatternLayout">
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %m%n</Pattern>
</layout>
</appender>


As per the logback documentation, there is a conversion specifier called "replace", that can be used to replace strings in log statement. The format of this specifier is as follows:
replace(p){r, t} : Replaces occurrences of 'r', a regex, with its replacement 't' in the string produces by the sub-pattern 'p'. For example, "%replace(%msg){'\s', ''}" will remove all spaces contained in the event message. This can be used in above configuration inside <Pattern> tag.
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %replace(%msg){'\s', ''}%n</Pattern>


I tried this in multiple ways, but it didn't work as expected. Probably I might be doing something wrong.

Then the solution I tried was to add my own custom Pattern Layout class. In above configuration the configured pattern layout class is ch.qos.logback.classic.PatternLayout provided by logback. I extended this class and overriden the doLayout() method as follows:


public class RestServicePatternLayout extends PatternLayout{


@Override
public String doLayout(LoggingEvent event) {
   String message=super.doLayout(event);
   if(event.getLoggerRemoteView().getName().equalsIgnoreCase("httpclient.wire.content")){
    message = MaskingUtil.maskConfidentialInfo( message );
   }
   return message;
 }
} 


public class  MaskingUtil {
     public static String maskConfidentialInformations(String message) {
      //  first match the pattern of confidential information
     // for e.g. - the pattern inside json for confidential info could be "confInfo":"1234".
    // Once match is found, replace all the matching strings with "X"
    }
}


This method returns the String back to the calling method, that will be printed in log files. In above code, I am comparing the logger name with string "httpclient.wire.content". This is the name which apache httpclient jar uses to initialize LogFactory for logging web service requests/responses. You can find this string in org.apache.commons.httpclient.Wire.java file.
Once the pattern matches, we mask the confidential information with X's. In above code I have written, a separate utility method to do this masking task.

After this, I configured this custom PatternLayout class in logback.xml as follows:

<appender name="RESTSERVICE"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>restservice.log</file>
<append>true</append>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>restservice-%d{yyyy-MM-dd}.log</fileNamePattern>
<MaxHistory>1</MaxHistory>
</rollingPolicy>
<layout class="com.my.logging.RestServicePatternLayout">
<Pattern>%d %-5p [%X{IPAddress}] [%X{SessionId}] %c - %m%n</Pattern>
</layout>
</appender>


One of the advantage what I could see is that, in logback.xml we can configure our custom layout to run for some specific appenders to improve performance. The appender should have logger name configured in such a way that it runs only for the log statements where we can expect the confidential information. We don't need to run it for all the log statements. Since we are doing String comparison every time before actual logging, it may impact performance if we run it for every log statement. So, to avoid this, I configured the "RESTSERVICE" appender only for below logger:



<logger name="httpclient.wire" additivity="false">
<level value="debug" />
<appender-ref ref="RESTSERVICE" />
</logger>
The logger name "httpclient.wire" is used only for web service requests/responses, so our custom layout will run only for those log statements. This way we can avoid unnecessary String comparison for other log statements.







Wednesday, February 8, 2012

JAVA Garbage Collection


JAVA Garbage Collection
I was reading some of the articles about java garbage collection, so thought of posting it. This post is completely based on my personal understanding of garbage collection and can’t claim that everything I am writing is correct. So, please read it at your own risk ;)

From the name it is clear that Garbage Collection is removing garbage from your neighborhood. By reading garbage word itself, we must be feeling to clean that out. Obviously we all like to live in clean place (except few who prefers to live in mess :) ). Same applies to a JAVA applications, there having too much of garbage can cause OutOfMemory error and our application may crash. Our first task is to find out what could be the garbage for a java application.
Lets look into JAVA memory model first.




- PC Register: It stores the pointer to the next JVM instruction to be executed. It will keep on changing as the execution progresses to next JVM instruction. So, it does not look like garbage.
- Method Stacks: In above diagram I have mentioned 2 stacks - JVM and native method stacks.
JVM stacks keeps frames for every thread. Thread adds one frame for every method execution. Every frame stores local variables, intermediate calculations of a method execution, method parameters and return value. Once the thread execution completes, the respective frame will be removed from stack. So, this memory area also does not seem to be a candidate for garbage collection. Native Method Stacks is also similar to normal stacks except that It keeps frames for a native method execution.
- Method Area: The method area stores class structure. That includes methods, constructors, constants and static fields.This can be garbage collected sometimes. If a class becomes unreferenced by application, then JVM will unload this class by using Garbage collection. This can happen in some specific scenarios, I will not be discussing this type of garbage collection in this blog.
- Heap: Heap stores class instances(objects) and arrays. Throughout execution of any JAVA application, we create numerous objects. These objects are of varying size depending on class structure. This is the area which consumes maximum of memory among all memory areas. JVM does not provide control to programmer to explicitly remove any object from Heap, once that object is no longer in use. Due to large sizes of objects we must need to deallocate this memory time to time once we feel that some of the unused objects are no longer in use. To reclaim the memory occupied by the unused object JVM uses it’s own Garbage Collector, which runs as a daemon thread by JVM and reclaims unused objects from the heap memory.

Now we know the memory area which requires to be garbage collected.

Now we can formally say:
Garbage Collection is a process using which JVM reclaims the object from heap memory, once the object is assumed to be no longer needed.
In C++, It was programmer’s responsibility to manually release these dynamically allocated objects, so it was more error prone. In case where due to programmer’s error if lots of unused objects are present in heap memory and not reclaimed, then it may lead to Out Of Memory error and ultimately the application will crash. In JAVA, this task is done automatically by JVM, so it frees programmer from this complicated task and let him concentrate more on logic implementation.
JVM uses a daemon thread called Garbage Collector(GC) to execute the garbage collection task. Before reclaiming an object from Heap memory, it invokes finalize() method. The finalize() method is declared in the java.lang.Object class. It has below signature:
protected void finalize () throws throwable { //  }
finalize() method is called just before garbage collection. It is not called immediately after the object goes out of scope. So, we should not rely on this method for normal program operation. The intent of this method is to release system resources such as open files or open sockets before getting collected.

When an object will be garbage collected ?
An object will be eligible for garbage collection when it is unreachable from application code. Below are the 4 scenarios, which can make an object unreachanble:
1) When reference to an object is set to null explicitly.
method()1{
Object obj = new Object();
obj = null;
-----------
-----------
}
In above code, we create an instance of class Object. After that we explicitly set the reference obj to null. After this statement, we can’t reach the object created in 1st statement, so this object is eligible for garbage collection.
2) When object reference goes out of scope.
method2(){

for(int i=0;i<10;i++){

Object obj = new Object();

break;

}

----------------------------------

----------------------------------
}
In above code, once execution comes out of for loop, the object created inside the loop becomes unreachable. So, it became eligible for garbage collection.
3) Island of Isolation
If an object obj1 has internal reference to another object obj2 and obj2 has internal reference to obj1 but none of them has any outside reference. In this case both of the objects will not be reachable, then both obj1 and obj2 will be eligible for garbage collection.
class A{
B b;;
}
class B{
A a;
}
class C{
public static void main(String[] args){
A a = new A();
a.b = new B();
B b = new B();
B.a=new A();
a =null;
b=null;
//after above 2 lines a and b both are set to null. The objects referenced by a and     
// b has internal references to objects of type B and A respectively, but we can’t
// access those 2 objects. So those will be garbage collected.
}
}
4) Weak references
Weak reference does not prevent referred object from being garbage collected. Any object which does not have any other reference apart from weak reference, can be garbage collected.

Can we programmatically invoke GC?
We can't force JVM to run GC. We can only request JVM to do garbage collection, after that it is up to JVM whether to run GC immediately or after some time. System.gc() method is used for this purpose.


Now lets discuss some of the approaches of garbage collectors:
1) Reference Counting Collectors
In this approach JVM keeps count of references of every object in heap. Once  the count becomes zero the object will be reclaimed from heap. One of the disadvantage is overhead of incrementing/decrementing the counter. Another disadvantage is that it cannot identify those objects for garbage collection which comes under Island of Isolation category.
2) Tracing Collectors
In this approach, JVM first traces out the graph of object references starting with root node and marks all the referred objects as reachable. Other objects are assumed to be unreachable and later be reclaimed from heap. One of the tracing algorithm is “Mark and Sweep”, in which first phase is marking the object as reachable and in sweep phase it frees up the memory occupied by unmarked objects.
3) Compacting Collectors
Compacting collectors moves live objects towards one end of heap area. So, other end of heap become contiguous free area, from which new objects can be allocated.  This process could be slow because of overhead of moving live objects to one end of heap. This includes marking the objects as live, copying live objects to other end of heap and updating the object references to point to new location. Below picture shows the status of heap memory before and after GC run:
4) Copying Collectors
In this approach, all the live objects are copied to a new area in heap as contiguous area similar to compacting collectors. But it does not use mark and sweep algorithm. It copies live objects on the fly and updates the referencing pointers. This algorithm is known as “Stop and Copy”. It also divides heap into two parts similar to compacting collectors. It uses only one heap area out of two areas at once. Once one area is full, it moves all the live objects from this area to another heap area. One of the disadvantage of this approach is that it requires more heap area, since it uses only half of the heap area at once.
5) Generational Collectors
In above algorithms every object in being scanned and/or moved in every GC run.  With large number of objects in memory, these algorithms may not perform good. Also in practical scenario we may have observed that some objects have very short life and some objects have very long life. So, copying those long lived objects multiple times is not a good strategy from performance point of view. Generation collectors are designed to address these performance issues. Generational collectors divides the heap into multiple young and old heap areas. After surviving few GC runs the object is promoted to next level(from young to old). In this approach GC runs more often in younger heap area and less often in older heap area. That makes sure that short lived objects are reclaimed quickly and long lived objects are not scanned/moved more often.
The default arrangement of generations is shown in below picture:
(picture from http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html)

Young generation is divided into one Eden and two Survivor spaces. Initially after object creation, it gets allocated into Eden space. If object from Eden survives after next minor garbage collection, then it moves to Survivor 1 and then to Survivor 2. After major garbage collection if the object survives then it is promoted to Tenured area.  The perm generation stores data needed by the virtual machine to describe objects that do not have an equivalence at the Java language level. Example: objects describing classes and methods.
6) Train(Incremental) Collectors
JVM usually pauses other running programs while running GC. Collecting larger heap area may take longer time, that will ultimately cause other programs to stop for a longer time. To address this issue, incremental approach is used. In this approach, GC runs on a portion of heap area in one run than running on whole heap memory. This approach avoids other running programs to pause for longer time and does garbage collection incrementally.


JVM provides parameters to set JVM memory areas to different values. We can also set JVM parameters to set Heap memory components to different values to tune GC. Depending on our java application requirements, we may need more younger heap memory and lesser tenured memory or vice versa. I am not covering the JVM parameters used for these purposes.

Please let me know your feedback on this post.