Thursday, 29 December 2016

How Google Search works ?

Like any other search engines, Google uses some special algorithms to generate search results. Google shares the general facts about these algorithms, but the specifics are their secret. This helps Google remain competitive with other search engines available and reduces the chance of someone finding out how to abuse the system as whole.

Google uses spider or crawlers - automated programs that crawl over millions of pages on web. Also like other search engines, Google maintains a large index of keywords  and metadata like where these words can be found and linking for the same. Google uses unique algorithm(called PageRank- assigns each page a relevancy score) to rank search results, which in turn determines the order Google displays results on search result webpage.

Spider does the search engine's grunt task. It scan the web pages and creates the indexes of keywords. It also scans pages and does categorization. Once a spider has visited a web page, it follows the links from that page to other pages. It continue to crawl from one site to next, which implies the search engine's index become more comprehensive and robust with time.

The relevancy score-PageRank for a page depends on few factors:

  • The frequency and location of keywords within the page.
  • How long the webpage has expired.
  • The number of other Webpages linked to the page in question.
lets understand it with an example where we are searching for the term "Dreaweaver".
 


As more pages link to Adobe's Dreamweaver page, the Adobe's PageRank increases. When Adobe's page rank is higher than the other, it shows up the top of the Google search result page. Since Google uses links to a webpage as an attribute to calculate relevancy score, its not easy to cheat the system. The genuine way to make sure your web page is high up on Google's search result is to provide great content to users so that they will link back to your page.The more links your page will gets, the higher the relevancy score will be.

Google uses lots of tricks to prevent people from cheating the system and get higher score. For example - as a Web page adds links to more sites, its voting power decreases. A Web page that has a high PageRank with lots of outgoing links can have less influence than a lower-ranked page with only one or two outgoing links.


Useful resources:
https://www.google.com/insidesearch/howsearchworks/thestory/

What happened when you type URL in browser and press ENTER ?

Here are the simplest steps followed when you time URL in browser and press ENTER -
  1. Browser checks cache; if requested object is in cache and is fresh, skip to #9
  2. Browser asks OS for server's IP address
  3. OS makes a DNS lookup and replies the IP address to the browser
  4. Browser opens a TCP connection to server (this step is much more complex with HTTPS)
  5. Browser sends the HTTP request through TCP connection
  6. Browser receives HTTP response and may close the TCP connection, or reuse it for another request
  7. Browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx)
  8. If cacheable, response is stored in cache
  9. Browser decodes response (e.g. if it's gzipped)
  10. Browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?)
  11. browser renders response, or offers a download dialog for unrecognized types

         Also, there are many other things happening in parallel to this (processing typed-in address, adding page to browser history, displaying progress to user, notifying plugins and extensions, rendering the page while it's downloading, pipelining, connection tracking for keep-alive, etc.).

Useful links:



Useful watch:

Thursday, 22 December 2016

Switching to HTTPS- It's easy!

For a back-end developer switching to HTTPS is fairly straightforward in practice. The basic steps are as follows:


  1. Purchase an SSL certificate and a dedicated IP from hosting company like godaddy.com
  2. Install and configure the SSL certificate. Ask Hosting company for help.
  3. Perform a full back-up of your site in case you plan to revert back,
  4. Configure any hard internal links within website, from HTTP to HTTPS.
  5. Update any code libraries, such as JavaScript, Ajax or any third-party plugin, if needed.
  6. Update htaccess applications, e.g: Apache Web Server. LiteSpeed, NGinx Config and your internet service manager function to redirect HTTP traffic to HTTPS.
  7. Update your CDN's SSL setting, if any,
  8. Implement 301 redirects on a page-by-page basis.
  9. Update any link used in marketing automation tools, such as email links.
  10. Set up HTTPS site in Google Search Console and Google Analytics
For a small website switching will be fairly straightforward, as some of the above points won’t apply in scenarios such as code libraries and CDNs. However, for a larger site, this is hardly a non-trivial event and should be managed by an experienced webmaster.

HTTP Vs HTTPS - Why one should Care?

Until recently, HTTPS was really used only by eCommerce sites for some specific pages like payment, login pages. Recent Google's announcement about HTTPS as a ranking signal and there failure could mean ranking will take a hit.

And that would mean less traffic and less business.

So, first of all, What exactly HTTP'S' means??

HTTP stands for Hypertext transfer protocol that enables communication( exchange message) between different system on internet. It is used for transferring data from a web server to a browser to view the web page.

HTTP ( note that no 's' on the end) data is not secure/encrypted and it can be intercepted by third parties between  hops(routers) to gather data being transfer between two systems.

HTTPS is the secure version of HTTP and involves the use of an SSL certificate  where SSL stands for secure socket layer, which creates a secure encrypted connection between server and browser.  It offers a base level web security.



Using HTTPS, the computer devices agree on a code between them and then they scramble the message using this code so that no one in between can read the message. The code is used on Secure Socket Layer(SSL) also known as Transport Layer Security to send message back and forth. .This keeps message safe from hackers/interceptors.

SSL certificates contain the computer owner's - public key.

The owner shares the public key with anyone who needs it. Using this shared public key other user encrypt messages to the owner. The owner share the public key using SSL certificate. The owner does not share private key to anyone.

HTTPS uses asymmetric public key Infrastructure(PKI). PKI uses two key to encrypt communication known as public key and private key. Anything encrypted with the public key can only be decrypted by the private key.

As names suggest, the private key should keep strictly protected and should be accessible by the owner of private key.The private key remains securely ensconced on the web server. The public key is intended to be distributed to anybody and everybody. Recipient's Public key is used to encrypt the message and the decryption key is recipient's private key.


Though private and public keys are related mathematically, it is not be feasible to calculate the private key from the public key. In fact, intelligent part of any public-key crypto system is in designing a relationship between two keys.

This is all for current entry. We will discuss the process of switching to HTTPS in next entry.

Enjoy Reading!!

Monday, 19 December 2016

Design a Batch processing system

There are numerous batch oriented application in place today that can leverage on-demand processing, including claims processing, large scale transformation, media trans-coding and multi-part data processing work.

Batch processing architectures are often synonymous with highly variable usage patterns that can significant usage peak ( month end processing) followed by significant period of under-utilization.

There are numerous approaches to building a batch processing architecture. In this entry we will discuss a basic batch processing architecture that supports job scheduling, job status inspection, upload raw data, outputting job result and reporting job performance data.


Users interact with the Job Manager application which is deployed on an Amazon Elastic Computer Cloud (EC2) instance. This component controls the process of accepting, scheduling, starting, managing and completing batch jobs. It also provides access to final results, job and worker statistics, and progress information.

Raw job data is uploaded to Amazon Simple Storage Service (S3), a highly-available and persistent data store.

Individual job tasks are inserted by the Job Manager in an Amazon Simple Queue Service(SQS) input queue on the user's behalf.

Worker nodes are Amazon EC2 instances deployed on an Auto-Scaling group. This group is a container that ensures health and scalability of worker nodes. Worker nodes pick up job parts from the input queue automatically and perform single tasks that are parts of the list of batch processing steps.

Interim result from worker nodes are stored in Amazon S3.

Progress information and statistics are stored on the analytics store. This component can be either an Amazon SimpleDB domain or a relational database such as an Amazon Relational Database Service(RDS) instance.

Optionally, completed tasks can be inserted in an Amazon SQS queue for chaining to a second processing stage.

Friday, 16 December 2016

Design content and media sharing system

Media sharing is one of the hottest markets on internet. Customers have a staggering appetite for placing photos and video on social networking sites and sharing their photo/video in custom online photo/video albums.

The growing popularity of media sharing means scaling problem for site owners who face ever-increasing storage and bandwidth requirements and increased go-to-market pressure to deliver faster than the competition.

In this entry we will discuss an example of a highly available, durable and cost effective media sharing and processing platform using amazon web service.



Sharing content file involves uploading media files to online service. In this configuration, an elastic load balancer distributes incoming network traffic to upload servers, a dynamic fleet of Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon CloudWatch monitors these servers and Auto-Scaling manages them, automatically scaling EC2 capability up or down based on load. In the above diagram, a separate end-point to receive media upload was created in order to off-load this task from the website's servers.

Original uploaded files are stored in Amazon Simple Storage Service (Amazon S3) , a high available and durable storage service.

To submit a new file to be processed, upload web servers push a message into a Amazon Simple Queue Service (Amazon SQS) queue. This queue acts as a communication pipeline between file reception and file processing components.

The processing pipeline is a dedicated group of Amazon EC2 instances used to execute any kind of post processing task on the uploaded files ( eg. video transcoding, image resizing etc). To automatically adjust the needed capacity, Auto Scaling manages this group. We can use Spot instances to dynamically extend the capacity of the group and to significantly reduce the cost of file processing.

Once the processing is completed. Amazon S3 stores the output files. Original files can be stored with high durability.

Media related files can be put in a relational database like Amazon Relational Database Service (Amazon RDS) or in key-value store like Amazon SimpleDB.

A third fleet of EC2 instances is dedicated to host the website front- end of media sharing service.

Media files are distribute from Amazon S3 to end user via Amazon CloudFront offers low latency delivery through a worldwide network of edge locations.

Saturday, 10 December 2016

Design a high available and fault tolerant system

Most of the higher level services provided by Cloud provider like Amazon (Amazon Simple Storage Service-S3, Amazon SimpleDB, Amazon Simple Queue service-SQS, Amazon Elastic Load Balancing-ELB etc) have been built with fault tolerance and high availability in mind. Services that that provides basic infrastructure, such as Amazon Elastic Compute Cloud(EC2) and Amazon Elastic Block Storage(EBS), provides specific features, such as availability zones, elastic IP addresses, and snapshots that a fault tolerant and high available system must take advantage of it and use it correctly. Just moving into the cloud does not make system fault-tolerant or high available.

In this entry we will discuss a design of fault tolerant and high available system.

Design Overview:
Load Balancing is an effective way to increase the availability of a system. Instances that fail can be replaced seamlessly behind the load balancing system while other instance continue to operate. Elastic Load Balancing can be used to balance across instances in multiple availability zones of a regions.

Amazon EC2 is hosted in multiple locations world wide. The locations are composed of regions and Availability zones. Each region is a separate geographical area. Each region has multiple, isolated locations known as Availability zones. Amazon EC2 provides you the ability to place resources, such as instances and data in multiple locations. Resources are not replicated across regions unless you do so specifically.

Availability zones(AZs) are distinct geographical locations that are engineered to be insulated from failures in other availability zones. By placing Amazon EC2 instances in multiple availability zones, an application can be protected from failure at a single location. It is important to run independent application stack in more than one AZ , either in same region or another region.If the application fails in one zone application continue running in another zone. When you design such system you need to understand the zone dependencies.


Elastic IP addresses are public IP addresses that can be problematically mapped between instances within a region. They are associated with AWS account and not with specific instance or lifetime of instance. 

Elastic IP can be used to work around host or availability zone failure by quickly remapping the address to another running instance or replacement instance that's just started. Reserved instances can help guarantee that such capability is available in another zone.

Valuable data will not be stored stored only on instance storage without proper backups, replications or ability to recreate the data. Amazon Elastic Block Storage (EBS) offers persistent off-instance storage volume that are about an order of magnitude more durable than the on-instance storage. EBS volumes are automatically replicated within a single availability zone. To increase durability further, point in time snapshot can be created to store data on volumes in Amazon S3, which is then replicated to multiple AZs. While EBS volume are tied to a specific AZ, snapshot is tied to a region. Using a snapshot, you can create new EBS volume in any of the AZs of same region. This is an effective way to deal with disk failure or other host level issues, as well as with the problems effecting the AZ. Snapshot are incremental, so its better to hold on to recent snapshots.

This is all about our high available and fault tolerant system. Keep Reading!!