Codingslover Programming Blog, Tutorials, jQuery, Ajax, PHP, MySQL

Wednesday 3 October 2012

How Google Works

Published on: Wednesday, October 03, 2012 by Unknown - 1 comment

If you aren’t interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).
Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

Googlebot, a web crawler that finds and fetches web pages.
The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.
Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.
Screen shot of web page for adding a URL to Google.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.
When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.
Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.
To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.

3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.
PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.
Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for an interpretation of the concepts and the practical applications contained in Google’s patent application.
Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.
Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).
Let’s see how Google processes a query.

$1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book--it tells which pages contain the words that match any particular query term. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 3. The search results are returned to the user in a fraction of a second.$

For more information on how Google works, take a look at the following articles.

Google’s page on Google’s Technology, www.google.com/technology/.
How does Google collect and rank results?, www.google.com/newsletter/librarian/librarian_2005_12/article1.html.
Google’s PageRank Algorithm and How it Works, www.iprcom.com/papers/pagerank/
Google’s PageRank Explained and How to Make the Most of It, www.webworkshop.net/pagerank.html

tags (keywords): crawling, google, PageRank, queries, results, spider, stop words, technology, URLs

Sunday 30 September 2012

How web was developed?

Published on: Sunday, September 30, 2012 by Unknown - 14 comments

Web Development:

Web development is a broad term for the work involved in developing a web site for the Internet (World Wide Web) or an intranet (a private network). This can include web design, web content development, client liaison, client-side/server-side scripting, web server and network security configuration, and e-commerce development. However, among web professionals, "web development" usually refers to the main non-design aspects of building web sites: writing markup and coding. Web development can range from developing the simplest static single page of plain text to the most complex web-based internet applications, electronic businesses, or social network services.

For larger organizations and businesses, web development teams can consist of hundreds of people (web developers). Smaller organizations may only require a single permanent or contractingwebmaster, or secondary assignment to related job positions such as a graphic designer and/or information systems technician. Web development may be a collaborative effort between departments rather than the domain of a designated department.

Web development as an industry

Since the mid-1990s, web development has been one of the fastest growing industries in the world. In 1995 there were fewer than 1,000 web development companies in the United States, but by 2005 there were over 30,000 such companies^{[citation needed]} in the U.S. alone. The growth of this industry is being pushed by large businesses wishing to sell products and services to their customers and to automate business workflow.

Typical Areas

Web Development can be split into many areas and a typical and basic web development hierarchy might consist of:

[.]Client Side Coding

Ajax Asynchronous JavaScript provides new methods of using JavaScript, and other languages to improve the user experience.
Flash Adobe Flash Player is an ubiquitous browser plugin ready for RIAs. Flex 2 is also deployed to the Flash Player (version 9+).
JavaScript JavaScript is a ubiquitous client side platform for creating and delivering rich Web applications that can also run across a wide variety of devices. It is a dialect of the scripting language ECMAScript.
jQuery Cross-browser JavaScript library designed to simplify and speed up the client-side scripting of HTML.
Microsoft Silverlight Microsoft's browser plugin that enables animation, vector graphics and high-definition video playback, programmed using XAML and .NET programming languages.
HTML5 and CSS3 Latest HTML proposed standard combined with the latest proposed standard for CSS natively supports much of the client-side functionality provided by other frameworks such as Flash and Silverlight

Looking at these items from an "umbrella approach", client side coding such as XHTML is executed and stored on a local client (in a web browser) whereas server side code is not available to a client and is executed on a web server which generates the appropriate XHTML which is then sent to the client. The nature of client side coding allows you to alter the HTML on a local client and refresh the pages with updated content (locally), web designers must bear in mind the importance and relevance to security with their server side scripts. If a server side script accepts content from a locally modified client side script, the web development of that page is poorly sanitized with relation to security.

[edit]Server Side Coding

ASP (Microsoft proprietary)
CSP, Server-Side ANSI C
ColdFusion (Adobe proprietary, formerly Macromedia, formerly Allaire)
CGI
Groovy (programming language) Grails (framework)
Java, e.g. Java EE or WebObjects
Lotus Domino
Node.js
Perl, e.g. Catalyst, Dancer (all open source)
PHP (open source)
Python, e.g. Django (web framework) (open source)
Real Studio Web Edition
Ruby, e.g. Ruby on Rails (open source)
Smalltalk e.g. Seaside, AIDA/Web
SSJS Server-Side JavaScript, e.g. Aptana Jaxer, Mozilla Rhino
WebDNA (WSC proprietary)
Websphere (IBM proprietary)
.NET and .NET MVC Frameworks (Microsoft proprietary)

The World Wide Web has become a major delivery platform for web development a variety of complex and sophisticated enterprise applications in several domains. In addition to their inherent multifaceted functionality, these web applications exhibit complex behavior and place some unique demands on their usability, performance, security and ability to grow and evolve. However, a vast majority of these applications continue to be developed in an ad-hoc way, contributing to problems of usability, maintainability, quality and reliability.(1)(2) While web development can benefit from established practices from other related disciplines, it has certain distinguishing characteristics that demand special considerations. In recent years of web development there have been some developments towards addressing these problems and requirements. As an emerging discipline, web engineering actively promotes systematic, disciplined and quantifiable approaches towards successful development of high-quality, ubiquitously usable web-based systems and applications.(3)(4) In particular, web engineering focuses on the methodologies, techniques and tools that are the foundation of web application development and which support their design, development, evolution, and evaluation. Web application development has certain characteristics that make it different from traditional software, information system, or computer application development.

Web engineering is multidisciplinary and encompasses contributions from diverse areas: systems analysis and design, software engineering, hypermedia/hypertext engineering, requirements engineering, human-computer interaction, user interface, information engineering, information indexing and retrieval, testing, modelling and simulation, project management, and graphic design and presentation. Web engineering is neither a clone, nor a subset of software engineering, although both involve programming and software development. While web engineering uses software engineering principles, web development encompasses new approaches, methodologies, tools, techniques, and guidelines to meet the unique requirements for web-based applications.

Client Side + Server Side

Google Web Toolkit provides tools to create and maintain complex JavaScript front-end applications in Java.
Dart provides tools to create and maintain complex JavaScript front-end applications as well as supporting server-side code in Dart (programming language).
Opa is a high-level language in which both the client and the server parts are implemented. The compiler then decides which parts run on the client (and are translated automatically toJavaScript) and which parts run on the server. The developer can tune those decisions with simple directives. (open source)
Pyjamas is a tool and framework for developing Ajax applications and Rich Internet Applications in python.
Tersus is a platform for the development of rich web applications by visually defining user interface, client side behavior and server side processing. (open source)

However languages like Ruby and Python are often paired with database servers other than MySQL (the M in LAMP). Below are example of other databases currently in wide use on the web. For instance some developers prefer a LAPR(Linux/Apache/PostgreSQL/Ruby on Rails) setup for development.

Codingslover.com

Wednesday 3 October 2012

How Google Works

1. Googlebot, Google’s Web Crawler

2. Google’s Indexer

3. Google’s Query Processor

Sunday 30 September 2012

How web was developed?

Web Development:

[.]Client Side Coding

[edit]Server Side Coding

Client Side + Server Side

[.]Database Technology

Recent Posts

Popular Posts

Blog Archive

Blogger news

Categories

Profile

Followers

Blogroll

Category