May 2008

Tuesday, May 27, 2008

Better Performance - LEFT JOIN or NOT IN?

Which method of T-SQL is better for performance LEFT JOIN or NOT IN when writing query? Answer is : It depends! It all depends on what kind of data is and what kind query it is etc. In that case just for fun guess one option LEFT JOIN or NOT IN. If you need to refer the query which demonstrates the mentioned clauses, review following two queries.

USE AdventureWorks;
GO
SELECT ProductID
FROM Production.Product
WHERE ProductID
NOT IN (
SELECT ProductID
FROM Production.WorkOrder);
GO
SELECT p.ProductID
FROM Production.Product p
LEFT JOIN Production.WorkOrder w ON p.ProductID = w.ProductID
WHERE w.ProductID IS NULL;
GO

Reference : Pinal Dave (http://www.SQLAuthority.com)

Continue lendo...

SQL SERVER Database Coding Standards and Guidelines Complete List Download

Just like my previous series of SQL Server Interview Questions and Answers Complete List Download, I have received many comments and emails regarding this series. Once I go through all the emails and comments, I will make summary of them and integrate them with my series. I have also received emails asking me to create PDF for download. I have created that as well. Please feel free to download it and use it.

Please ask me any questions you might have. Contact me if you are interested in writing mini series with me.

Download SQL SERVER Database Coding Standards and Guidelines Complete List

Complete Series of Database Coding Standards and Guidelines
SQL SERVER Database Coding Standards and Guidelines - Introduction
SQL SERVER - Database Coding Standards and Guidelines - Part 1
SQL SERVER - Database Coding Standards and Guidelines - Part 2
SQL SERVER Database Coding Standards and Guidelines Complete List Download

Other popular Series
SQL Server Interview Questions and Answers Complete List Download
SQL SERVER - Data Warehousing Interview Questions and Answers Complete List Download

Taken from : http://blog.sqlauthority.com

Continue lendo...

Delete Duplicate Records - Rows - Readers Contribution

This works in 2000. WARNING: According to MS, SET ROWCOUNT will not work with INSERT, DELETE, and UPDATE in later versions.

– Create table with a number of values between zero and nine
select a+b+c as val
into dbo.rmtemp
from (select 0 a union all select 1 union all select 2 union all select 3) a
, (select 0 b union all select 1 union all select 2 union all select 3) b
, (select 0 c union all select 1 union all select 2 union all select 3) c

– Show what you’ve got
select val,count(*) row_count from dbo.rmtemp group by val

– Limit processing to a single row
set rowcount 1
– While you’ve got duplicates, delete a row
while (select top 1 val from dbo.rmtemp group by val having count(*) > 1) is not null
delete from dbo.rmtemp where val in (select top 1 val from dbo.rmtemp group by val having count(*) > 1);
– Remove single row processing limit
set rowcount 0

– Confirm that only uniques remain
select val,count(*) row_count from dbo.rmtemp group by val

– Clean up
drop table dbo.rmtemp

Reference : Pinal Dave (http://www.SQLAuthority.com)

Continue lendo...

Four Basic SQL Statements - SQL Operations

There are four basic SQL Operations or SQL Statements.

SELECT - This statement selects data from database tables.

UPDATE - This statement updates existing data into database tables.

INSERT - This statement inserts new data into database tables.

DELETE - This statement deletes existing data from database tables.

If you want complete syntax for this four basic statement, please download FAQ (PDF) from SQL SERVER - Download FAQ Sheet - SQL Server in One Page

Reference : Pinal Dave (http://www.SQLAuthority.com

Continue lendo...

COMMON SQL PROBLEMS

COMMON SQL PROBLEMS - 1

NOT USING AN INDEX:

 In most cases, an index should be used for each table in the query. Generally, when an index isn’t used the entire table is scanned. This is bad :>(
 Know what tables are indexed and how. Understand how/when indexes are used. [More slides on this.]
 Understand how certain predicate constructions prevent use of an index. [More slides on this.]
 Use ‘showplan’ to confirm expectations about index usage.
 If an apparently obvious index was not used…understand why.

COMMON SQL PROBLEMS - 2
 Avoid joining too many tables.
 Much depends on the indexes used and the efficiency of those indexes.
 Max is about 6-7 tables on IRF2, but only if joined properly on narrow clustered index keys.
 Avoid joining more than 3 really big tables (> 10^6 rows)
 Avoid excessively complex predicate
 It is easy to write predicates that prevent proper index usage. Avoid this. [More slides on this.]
 Avoid combining two or more special predicate statements like ‘GROUP BY’ with an aggregate and/or ‘SORT’ and/or ‘COMPUTE BY’ and/or ‘HAVING’ et cetera.
 Avoid multiple OR operators in predicates.
 Avoid more than one subquery in a predicate.

COMMON SQL PROBLEMS - 3
 Some properly formed queries legitimately ask the RDBMS server to do a lot of work and may take time to execute.
 Know the size of the objects in the query and try to understand how much work is being requested.
 Start queries off more simply with fewer tables and/or more simple or more restrictive predicates to develop a performance baseline.

COMMON SQL PROBLEMS - 4
 When using cursors, acquire a large stock of garlic, crucifixes and wooden stakes.
 Use cursors ONLY when absolutely necessary. There are always unpredictable performance consequences to the use of cursors.
 Never use cursors when ‘set SQL’ will suffice.
 If cursors must be used, be attentive to transaction blocking issues.
 Some cursors operations require specific types of indexes to support them (like unique or clustered).
 Keep it simple…never use cursors.

By
Don Madden & Jerome Roa
NSIT/CSI

Continue lendo...

Ordering in SQL

Ordering Guarantees in SQL Server 2005

SQL is a declarative language that returns multi-sets (sets allowing duplicates) of rows. The operations exposed in SQL, such as join, filter, group by, and project, do not inherently have any ordering guarantees. ANSI SQL does expose an ORDER BY clause for the top-most SELECT scope in a query, and this can be used to return rows in a presentation order to the user (or to a cursor so that you can iterate over the results of query). This is the only operation in ANSI SQL that actually guarantees row order.

Microsoft SQL Server provides additional ordering guarantees beyond ANSI, mostly for backwards-compatibility with previous releases of the product. For example, variable assignment in the top-most SELECT list when an ORDER BY is specified is done in the presentation order of the query.

Example:

SELECT @a = @a + col FROM Table ORDER BY col2;

SQL Server also contains a non-ANSI operator called TOP. TOP allows you to limit a result set to a certain number or percentage of the result set. If an ORDER BY is used in the same scope, it qualifies rows based on the ORDER BY. So, TOP 1 … ORDER BY col will return the “first” row from that result set based on the order by list. However, SQL Server does not guarantee that the rows will be returned in that order from the intermediate result set. It only guarantees which rows actually qualify. You’d need to put an ORDER BY at the top of the query to guarantee the output order returned to the client. (In a previous blog entry, I noted how SQL 2005 actually doesn’t bother processing TOP 100 PERCENT … ORDER BY since it is “meaningless” under this definition).

Other operations in SQL Server also have this “which rows qualify” semantic. ROW_NUMBER, RANK, DENSERANK, and NTILE contain an OVER clause in which an ORDER BY can be specified. This order by guarantees the output of the operation, but not the order in which the rows are output. So, the following is valid output for the row_number query – the function outputs row values as if it had been evaluated in a specific order.

SELECT col2, ROW_NUMBER() OVER (ORDER BY col1) FROM T

Col2 Col1 row_num

1 1 1

3 3 3

2 2 2

Now, in practice, we don’t currently generate a plan that would return rows in this order for this specific query. However, different query plans for the same query can return rows in different order, so it is important to understand what is guaranteed and what is not. When you build an assumption in your SQL application during development about one query plan, then deploy it and a customer with their own data in the database gets a different plan, you can very quickly learn about this dependency the hard way. (A more plausible possibility for this example is that we could assign the row numbers backwards if we knew the exact rowcount from a previously completed operation in the SQL query plan).

Essentially, the SQL Server Query Optimizer is guaranteeing that the internal operator in the query tree will process its input in a particular order. There is no corresponding guarantee that the output of that operator will imply that the next operator in the query tree is performed in that order. The reordering rules can and will violate this assumption (and do so when it is inconvenient to you, the developer ;). Please understand that when we reorder operations to find a more efficient plan, we can cause the ordering behavior to change for intermediate nodes in the tree. If you’ve put an operation in the tree that assumes a particular intermediate ordering, it can break.

by Conor
(Taken from http://blogs.msdn.com/queryoptteam/)

Continue lendo...

How to Read Statistics Profile in SQL

In SQL Server, “Statistics Profile” is a mode in which a query is run where you can see the number of invocations of each physical plan operator in the tree. Instead of running a query and just printing the output rows, this mode also collects and returns per-operator row counts. Statistics Profile is used by the SQL Query Optimizer Team to identify issues with a plan which can cause the plan to perform poorly. For example, it can help identify a poor index choice or poor join order in a plan. Oftentimes, it can help identify the needed solution, such as updating statistics (as in the histograms and other statistical information used during plan generation) or perhaps adding a plan hint. This document describes how to read the statistics profile information from a query plan so that you can also debug plan issues.

A simple example query demonstrates how to retrieve the statistics profile output from a query:

use nwind
set statistics profile on
select * from customers c inner join orders o on c.customerid = o.customerid;

The profile output has a number of columns and is a bit tricky to print in a regular document. The key pieces of information that it prints are the plan, which looks like this:

StmtText
------------------------------------------------------------------------------------------------------
select * from customers c inner join orders o on c.customerid = o.customerid
|--Hash Match(Inner Join, HASH:([c].[CustomerID])=([o].[CustomerID]),
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Customers].[aaaaa_PrimaryKey] AS [c]))
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Orders].[aaaaa_PrimaryKey] AS [o]))

Other pieces of useful information are the estimated row count and the actual row count for each operator and the estimated and actual number of invocations of this operator. Note that the actual rows and # of executions are physically listed as early columns, while the other columns are listed later in the output column list (so you typically have to scroll over to see them).

Rows Executes
-------------------- --------------------
1078 1
1078 1
91 1
1078 1

EstimateRows
------------------------
1051.5834
1051.5834
91.0
1078.0

EstimateExecutions
------------------------
NULL
1.0
1.0
1.0

Other fields, such as the estimated cost of the subtree, the output columns, the average row size, also exist in the output (but are omitted for space in this document).

Note: The output from statistics profile is typically easiest to read if you set the output from your client (Query Analyzer or SQL Server Management Studio) to output to text, using a fixed-width font. You can then see the columns pretty easily and you can even move the results into a tool like Excel if you want to cluster the estimates and actual results near each other by rearranging the columns (SSMS also can let you do this).

There are a few key pieces of information needed to understand the output from Statistics profile. First, query plans are generally represented as trees. In the output, children are printed below their parents and are indented:
StmtText
------------------------------------------------------------------------------------------------------
select * from customers c inner join orders o on c.customerid = o.customerid
|--Hash Match(Inner Join, HASH:([c].[CustomerID])=([o].[CustomerID]),
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Customers].[aaaaa_PrimaryKey] AS [c]))
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Orders].[aaaaa_PrimaryKey] AS [o]))

In this example, the scan of Customers and Orders are both below a hash join operator. Next, the “first” child of the operator is listed first. So, the Customers Scan is the first child of the hash join. Subsequent operators follow, in order. Finally, query execution plans are executed in this “first” to “last” order. So, for this plan, the very first row is returned from the Customers table. (A more detailed discussion of operators will happen later in the article). Notice that the output from the graphical showplan is similar but slightly different:

(see attachment tree.jpg).

In this tree representation, both Scans are printed to the right on the screen, and the first child is above the other operators. In most Computer Science classes, trees are printed in a manner transposed from this:

Hash Join
/ \
Customers Orders

The transposition makes printing trees in text easier. In this classical view, the tree is evaluated left to right with rows flowing bottom to top.

With an understanding of the nuances of the query plan display, it is possible to understand what happens during the execution of this query plan. The query returns 1078 rows. Not coincidentally, there are also 1078 orders in this database. Since there’s a Foreign Key relationship between Orders and Customers, it requires that a match exist for each order to each customer. So, the 91 rows in Customers match the 1078 rows in Orders to return the result.

The query estimates that the join will return 1051.5834 rows. First, this is a bit less than the actual (1078) but is not a substantial difference. Given that the Query Optimizer is making educated guesses based on sampled statistical information that may itself be out-of-date, this estimate is actually pretty good. Second, the number is not an integer because we use floating point for our estimates to improve accuracy on estimates we make. For this query, the number of executions is 1 for both the estimate and actual. This won’t always be the case, but it happens to be true for this query because of the way hash joins work. In a hash join, the first child is scanned and a hash table is built. Once the hash join is built, the second child is then scanned and each row probes the hash table to see if there is a matching row.

Loops join does not work this way, as we’ll see in a slightly modified example.

select * from customers c with (index=1) inner loop join orders o with (index=1) on c.customerid = o.customerid

In this example, I’ve forced a loop join and the use of clustered indexes for each table. The plan now looks like this:

StmtText
-----------------------------------------------------------------
select * from customers c with (index=1) inner loop join orders o
|--Nested Loops(Inner Join, WHERE:([nwind].[dbo].[Orders].[Cust
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Customers].
|--Table Spool
|--Clustered Index Scan(OBJECT:([nwind].[dbo].[Orders

Beyond the different join algorithm, you’ll notice that there is now a table spool added to the plan. The spool is on the second child (also called the “inner” child for loops join because it is usually invoked multiple times). The spool scans rows from its child and can store them for future invocations of the inner child. The actual row count and execution count from the statistics profile is a bit different from the previous plan:

Rows Executes
-------------------- --------------------
1078 1
1078 1 <--- Loop Join
91 1 <--- Scan of Customers
98098 91 <--- Spool
1078 1 <--- Scan of Orders

In this plan, the second child of the loop join is scanned 91 times returning a total number of 98098 rows. For the actual executions, the total number of rows is to sum of all invocations of that operator, so it is 91*1078=98098. This means that the inner side of this tree is scanned 91 times. Nested Loops joins require rescans of the inner subtree (Hash Joins do not, as you saw in the first example). Note that the spool causes only one scan of the Orders table, and it only has one execution as a result. It isn’t hard to see that there are far more rows touched in this plan compared to the hash join, and thus it shouldn’t be a huge surprise that this plan runs more slowly.

Note: When comparing the estimated vs. actual number of rows, it is important to remember that the actual counts need to be divded by the actual number of executions to get a value that is comparable to the estimated number of rows returned. The estimate is the per-invocation estimate.

As a more complicated example, we can try something with a few more operators and see how things work on one of the TPC-H benchmark queries (Query 8, for those who are interested, on a small-scale 100MB database):

SELECT O_YEAR,
SUM(CASE WHEN NATION = 'MOROCCO'
THEN VOLUME
ELSE 0
END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
N2.N_NAME AS NATION
FROM PART,
SUPPLIER,
LINEITEM,
ORDERS,
CUSTOMER,
NATION N1,
NATION N2,
REGION
WHERE P_PARTKEY = L_PARTKEY AND
S_SUPPKEY = L_SUPPKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_CUSTKEY = C_CUSTKEY AND
C_NATIONKEY = N1.N_NATIONKEY AND
N1.N_REGIONKEY = R_REGIONKEY AND
R_NAME = 'AFRICA' AND
S_NATIONKEY = N2.N_NATIONKEY AND
O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
P_TYPE = 'PROMO BURNISHED NICKEL' AND
L_SHIPDATE >= CONVERT(datetime,(1156)*(30),121) AND L_SHIPDATE < CONVERT(datetime,((1185)+(1))*(30),121)
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR

As the queries get more complex, it gets harder to print them in a standard page of text. So, I’ve truncated the plan somewhat in this example. Notice that the same tree format still exists, and the main operators in this query are Scans, Seeks, Hash Joins, Stream Aggregates, a Sort, and a Loop Join. I’ve also included the actual number of rows and actual number of executions columns as well.

Rows Executes Plan
0 0 Compute Scalar(DEFINE:([Expr1028]=[Expr1026]/[Expr1027]))
2 1 |--Stream Aggregate(GROUP BY:([Expr1024]) DEFINE:([Expr1026]=SUM([par
2 1 |--Nested Loops(Inner Join, OUTER REFERENCES:([N1].[N_REGIONKEY]
10 1 |--Stream Aggregate(GROUP BY:([Expr1024], [N1].[N_REGIONKEY
1160 1 | |--Sort(ORDER BY:([Expr1024] ASC, [N1].[N_REGIONKEY] A
1160 1 | |--Hash Match(Inner Join, HASH:([N2].[N_NATIONKEY
25 1 | |--Clustered Index Scan(OBJECT:([tpch100M].[
1160 1 | |--Hash Match(Inner Join, HASH:([N1].[N_NATI
25 1 | |--Index Scan(OBJECT:([tpch100M].[dbo].
1160 1 | |--Hash Match(Inner Join, HASH:([tpch10
1000 1 | |--Index Scan(OBJECT:([tpch100M].[
1160 1 | |--Hash Match(Inner Join, HASH:([t
1160 1 | |--Hash Match(Inner Join, HAS
1432 1 | | |--Hash Match(Inner Join
126 1 | | | |--Clustered Index
0 0 | | | |--Compute Scalar(D
224618 1 | | | |--Clustered I
0 0 | | |--Compute Scalar(DEFINE
45624 1 | | |--Clustered Index
15000 1 | |--Index Scan(OBJECT:([tpch10
2 10 |--Clustered Index Seek(OBJECT:([tpch100M].[dbo].[REGION].[

I’ll point out a few details about the statistics profile output. Notice that the Compute Scalars (also called Projects) return zero for both columns. Since Compute Scalar always returns exactly as many rows as it is given from its child, there isn’t any logic to count rows again in this operator simply for performance reasons. The zeros can be safely ignored, and the values for its child can be used instead. Another interesting detail can be seen in the last operator in this printout (the Seek into the Region table). In this operator, there are 10 executions but only 2 rows returned. This means that even though there were 10 attempts to find rows in this index, only two rows were ever found. The parent operator (the Nested Loops near the top) has 10 rows coming from its first (left) child and only 2 rows output by the operator, which matches what you see in the seek. Another interesting tidbit can be found if you look at the estimates for the Seek operator:

Est.# rows Est. #executes
1.0 20.106487

The SQL Server Query Optimizer will estimate a minimum of one row coming out of a seek operator. This is done to avoid the case when a very expensive subtree is picked due to an cardinality underestimation. If the subtree is estimated to return zero rows, many plans cost about the same and there can be errors in plan selection as a result. So, you’ll notice that the estimation is “high” for this case, and some errors could result. You also might notice that we estimate 20 executions of this branch instead of the actual 10. However, given the number of joins that have been evaluated before this operator, being off by a factor of 2 (10 rows) isn’t considered to be too bad. (Errors can increase exponentially with the number of joins).

SQL Server supports executing query plans in parallel. Parallelism can add complexity to the statistics profile output as there are different kinds of parallelism that have different impacts on the counters for each operator. Parallel Scans exist at the leaves of the tree, and these will count all rows from the table into each thread even through each thread only returns a fraction of the rows. The number of executions (the second column in the output) will also have 1 execution for each thread. So, it is typical to just divide the number of threads into the total number of rows to see how many rows were actually returned by the table. Parallel zones higher in the tree usually work the same way. These will have N (where N is the degree of parallelism) more executions than the equivalent non-parallel query. There are a few cases where we will broadcast one row to multiple threads. If you examine the type of the parallelism exchange operation, you can identify these cases and notice that one row becomes multiple rows through the counts in the statistics profile results.

The most common use of the statistics profile output is to identify areas where the Optimizer may be seeing and using incomplete or incorrect information. This is often the root cause of many performance problems in queries. If you can identify areas where the estimated and actual cardinality values are far apart, then you likely have found a reason why the Optimizer is not returning the “best” plan. The reasons for the estimate being incorrect can vary, but it can include missing or out-of-date statistics, too low of a sample rate on those statistics, correlations between data columns, or use of operators outside of the optimizer’s statistical model, to name a few common cases.

by QueryOptTeam
Taken From : http://blogs.msdn.com/queryoptteam/

Continue lendo...

TIPS : an Intro to CHAID Analysis

CHAID Analysis : Cereal Case*

Below, step by step we give tips & tricks how to make CHAID Analysis with SPSS Software.

Firstly, from SPSS menu, click :
Analyse  Classify  Tree

Then, input ‘Preferred Breakfast’ into Dependent Variabel coloumn, and else variables to Independent Variables.

Click Ok.

Then Click Define Variables Properties at Classification Tree window
Click All Target Category and input to Exclude Coloumn. Click Continue.

At Classification Tree : Output, give checklist at ‘Tree sub menu’, ‘Topdown’ (in Orientation ), ‘Tables & Chart’ (in Node Contents), ‘Automatics’ (in Scale) and ‘Independent variables statistics’, ‘Node Definition’, then click Continue.

At Statistics submenu, give checklist at all item at Model & Node Performancesubmenu, then click continue.

We have result as below:

Model Summary

Specifications Growing Method CHAID
Dependent Variable Preferred breakfast
Independent Variables Age category, Gender, Marital status, Lifestyle
Validation NONE
Maximum Tree Depth 3
Minimum Cases in Parent Node 100
Minimum Cases in Child Node 50
Results Independent Variables Included Age category, Marital status, Lifestyle
Number of Nodes 13
Number of Terminal Nodes 8
Depth 2

Misclassification Costs

Observed Predicted
Breakfast Bar Oatmeal Cereal
Breakfast Bar .000 1.000 1.000
Oatmeal 1.000 .000 1.000
Cereal 1.000 1.000 .000
Dependent Variable: Preferred breakfast

For Cereal. Look at big Chart i.e node 5,8,9 with N=3, 12, 31, with Sum=139. look at table Classification below.

Classification

Observed Predicted
Breakfast Bar Oatmeal Cereal Percent Correct
Breakfast Bar 112 34 85 48.5%
Oatmeal 13 251 46 81.0%
Cereal 84 116 139 41.0%
Overall Percentage 23.8% 45.6% 30.7% 57.0%
Growing Method: CHAID
Dependent Variable: Preferred breakfast

(*) Data Cereal taken from SPSS

Continue lendo...

Tips: Minitab Tutorials

All versions of Minitab Statistical Software include step-by-step tutorials accessible through the Help menu.
Minitab also provides the following additional free tutorials:
Meet Minitab
Meet Minitab is a concise guide to help you quickly get started using Minitab. It shows you how to:
• Manage and manipulate data and files
• Produce graphs
• Analyze data and assess quality
• Design an experiment
• Generate reports
• Customize your Minitab software
• Access the Help files provided with Minitab
Meet Minitab also includes a thorough and easy-to-use reference section.
You can download a free electronic copy of Meet Minitab (PDF) or order a hard copy (book version) of the guide. Meet Minitab is also available to download in other languages.
Accessing the Power of Minitab
Tips and tricks to help you quickly harness the power of Minitab Statistical Software and save time.
Answers Knowledgebase
Search our Answers Knowledgebase for detailed information on performing statistical operations, as well as answers to hundreds of the most frequently asked technical support questions.
For articles about using Minitab for basic and applied statistics, visit our Help With Statistics page.

Taken from : http://www.minitab.com/resources/tutorials/

Continue lendo...

C++ Programming (Part 1)

C++ is a general-purpose, platform-neutral programming language that supports object-oriented programming and other useful programming paradigms, including procedural programming, object-based programming, generic programming, and functional programming.
C++ is viewed as a superset of C, and thus offers backward compatibility with this language. This reliance on C provides important benefits:
• Reuse of legacy C code in new C++ programs
• Efficiency
• Platform neutrality
• Relatively quick migration from C to C++
Yet it also incurs certain complexities and ailments such as manual memory management, pointers, unchecked array bounds, and a cryptic declarator syntax, as described in the following sections.
As opposed to many other programming languages, C++ doesn't have versions. Rather, it has an International ANSI/ISO Standard, ratified in 1998, that defines the core language, its standard libraries, and implementation requirements. The C++ Standard is treated as a skeleton on which vendors might add their own platform-specific extensions, mostly by means of code libraries. However, it's possible to develop large-scale applications using pure standard C++, thereby ensuring code portability and easier maintenance.
The primary reason for selecting C++ is its support of object-oriented programming. Yet even as a procedural programming language, C++ is considered an improvement over ANSI C in several aspects. C programmers who prefer for various reasons not to switch to object-oriented programming can still benefit from the migration to C++ because of its tighter type-safety, strongly typed pointers, improved memory management, and many other features that make a programmer's life easier. Let's look at some of these improvements more closely:
• Improved memory management. In C, you have to call library functions to allocate storage dynamically and release it afterwards. But C++ treats dynamically allocated objects as first-class citizens: it uses the keywords new and delete to allocate and deallocate objects dynamically.
• User-defined types are treated as built-in types. For example, a struct or a union's name can be used directly in declarations and definitions just as a built-in type:
• struct Date
• {
• int day;
• int month;
• int year;
• };
•
• Date d; //In C, 'struct' is required before Date
void func(Date *pdate); //ditto
• Pass-by-reference. C has two types of argument passing: by address and by value. C++ defines a third argument-passing mechanism: passing by reference. When you pass an argument by reference, the callee gets an alias of the original object and can modify it. In fact, references are rather similar to pointers in their semantics; they're efficient because the callee doesn't get a copy of the original variable, but rather a handle that's bound to the original object. Syntactically, however, references look like variables that are passed by value. Here's an example:
• Date date;
• void func(Date &date_ref); //func takes Date by reference
func(date); // date is passed by reference, not by value
• Default argument values. C++ allows you to declare functions that take default argument values. When the function call doesn't provide such an argument, the compiler automatically inserts its respective default value into the function call. For example:
• void authorize(const string & username, bool log=true);
authorize(user); // equivalent to: authorize(user, true);
In the function call above, the programmer didn't provide the second argument. Because this argument has a default value, the compiler silently inserted true as a second argument.
• Mandatory function prototypes. In classic C, functions could be called without being previously declared. In C++, you must either declare or define a function before calling it. This way, the compiler can check the type and number of each argument. Without mandatory prototypes, passing arguments by reference or using default argument values wouldn't be possible because the compiler must replace the arguments with their references or add the default values as specified in the prototype.
A well-formed C++ program must contain a main() function and a pair of matching braces:
int main()
{}
Though perfectly valid, this program doesn't really do anything. To get a taste of C++, let's look at a more famous example:
#include
int main()
{
std::cout<<"hello world!"<}
If you're a C programmer with little or no prior experience in C++, the code snippet above might shock you. It's entirely different from the equivalent C program:
#include
int main()
{
printf("hello world!\n");
}
Let's take a closer look at the C++ example. The first line is a preprocessor directive that #includes the standard header in the program's source file:
#include
contains the declarations and definitions of the standard C++ I/O routines and classes. Taking after C, the creators of C++ decided to implement I/O support by means of a code library rather than a built-in keyword. The following line contains the main() function. The program consists of a single line:
std::cout<<"hello world!"<Let's parse it:
• std::cout is the qualified name of the standard output stream object, cout. This object is automatically created whenever you #include .
• The overloaded insertion operator << comes next. It takes an argument and passes it on to cout. In this case, the argument is a literal string that we want to print on the screen.
• The second argument, std::endl, is a manipulator that appends the newline character after the string and forces buffer flushing.
As trivial as this program seems, it exposes some of the most important features of C++. For starters, C++ is an object-oriented language. Therefore, it uses objects rather than functions to perform I/O. Secondly, the standard libraries of C++, including , are declared in the namespace std (std stands for standard. An exhaustive discussion about namespaces is available in the "Namespaces" section.). Thus, instead of declaring standard functions, classes, and objects globally, C++ declares them in a dedicated namespace—thereby reducing the chances of clashing with user code and third-party libraries that happen to use the same names.

Continue lendo...

Path Analysis

Overview
Path analysis is an extension of the regression model, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researcher. The model is usually depicted in a circle-and-arrow figure in which single arrows indicate causation. A regression is done for each variable in the model as a dependent on others which the model indicates are causes. The regression weights predicted by the model are compared with the observed correlation matrix for the variables, and a goodness-of-fit statistic is calculated. The best-fitting of two or more models is selected by the researcher as the best model for advancement of theory.
Path analysis requires the usual assumptions of regression. It is particularly sensitive to model specification because failure to include relevant causal variables or inclusion of extraneous variables often substantially affects the path coefficients, which are used to assess the relative importance of various direct and indirect causal paths to the dependent variable. Such interpretations should be undertaken in the context of comparing alternative models, after assessing their goodness of fit discussed in the section on structural equation modeling (SEM packages are commonly used today for path analysis in lieu of stand-alone path analysis programs). When the variables in the model are latent variables measured by multiple observed indicators, path analysis is termed structural equation modeling, treated separately. We follow the conventional terminology by which path analysis refers to single-indicator variables.
Key Concepts and Terms
Note that path estimates may be calculated by OLS regression or by MLE maximum likelihood estimation, depending on the computer package. Two-Stage Least Squares (2SLS), discussed separately, is another path estimation procedure designed to extend the OLS regression model to situations where non-recursivity is introduced because the researcher must assume the covariances of some disturbance terms are not 0 (this assumption is discussed below). Click here for a separate discussion.
• Path model. A path model is a diagram relating independent, intermediary, and dependent variables. Single arrows indicate causation between exogenous or intermediary variables and the dependent(s). Arrows also connect the error terms with their respective endogenous variables. Double arrows indicate correlation between pairs of exogenous variables. Sometimes the width of the arrows in the path model are drawn in a width which is proportional to the absolute magnitude of the corresponding path coefficients (see below).
• Causal paths to a given variable include (1) the direct paths from arrows leading to it, and (2) correlated paths from endogenous variables correlated with others which have arrows leading to the given variable. Consider this model:

This model has correlated exogenous variables A, B, and C, and endogenous variables D and E. Error terms are not shown. The causal paths relevant to variable D are the paths from A to D, from B to D, and the paths reflecting common anteceding causes -- the paths from B to A to D, from C to A to D, and from C to B to D. Paths involving two correlations (C to B to A to D) are not relevant. Likewise, paths that go backward (E to B to D, or E to B to A to D) reflect common effects and are not relevant.
• Exogenous and endogenous variables. Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the measurement error term). If exogenous variables are correlated, this is indicated by a double-headed arrow connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables include intervening causal variables and dependents. Intervening endogenous variables have both incoming and outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows.
• Path coefficient/path weight. A path coefficient is a standardized regression coefficient (beta) showing the direct effect of an independent variable on a dependent variable in the path model. Thus when the model has two or more causal variables, path coefficients are partial regression coefficients which measure the extent of effect of one variable on another in the path model controlling for other prior variables, using standardized data or a correlation matrix as input. Recall that for bivariate regression, the beta weight (the b coefficient for standardized data) is the same as the correlation coefficient, so for the case of a path model with a variable as a dependent of a single exogenous variable (and an error residual term), the path coefficient in this special case is a zero-order correlation coefficient.
Consider this model, based on Bryman, A. and D. Cramer (1990). Quantitative data analysis for social scientists, pp. 246-251.

This model is specified by the following path equations:
Equation 1. satisfaction = b11age + b12autonomy + b13 income + e1
Equation 2. income = b21age + b22autonomy + e2
Equation 3. autonomy = b31age + e3
where the b's are the regression coefficients and their subscripts are the equation number and variable number (thus b21 is the coefficient in Equation 2 for variable 1, which is age.
Note: In each equation, only (and all of) the direct priors of the endogenous variable being used as the dependent are considered. The path coefficients, which are the betas in these equations, are thus the standardized partial regression coefficients of each endogenous variable on its priors. That is, the beta for any path (that is, the path coefficient) is a partial weight controlling for other priors for the given dependent variable.
Formerly called p coefficients, now path coefficients are called simply beta weights, based on usage in multiple regression models. Bryman and Cramer computed the path coefficients = standardized regression coefficients = beta weights, to be:

Correlated Exogenous Variables. If exogenous variables are correlated, it is common to label the corresponding double-headed arrow between them with its correlation coefficient.
Disturbance terms.The residual error terms, also called disturbance terms, reflect unexplained variance (the effect of unmeasured variables) plus measurement error. Note that the dependent in each equation is an endogenous variable (in this case, all variables except age, which is exogenous). Note also that the independents in each equation are all the variables with arrows to the dependent.
The effect size of the disturbance term for a given endogenous variable, which reflects unmeasured variables, is (1 - R2), and its variance is (1 - R2) times the variance of that endogenous variable, where R2 is based on the regression in which it is the dependent and those variables with arrows to it are independents. The path coefficient is SQRT(1 - R2).
The correlation between two disturbance terms is the partial correlation of the two endogenous variables, using as controls all their common causes (all variables with arrows to both). The covariance estimate is the partial covariance: the partial correlation times the product of the standard deviations of the two endogenous variables.
• Path multiplication rule: The value of any compound path is the product of its path coefficients. Imagine a simple three-variable compound path where education causes income causes conservatism. Let the regression coefficient of income on education be 1000: for each year of education, income goes up $1,000. Let the regression coefficient of conservatism on income be .0002: for every dollar income goes up, conservativism goes up .0002 points on a 5-point scale. Thus if education goes up 1 year, income goes up $1,000, which means conservatism goes up .2 points. This is the same as multiplying the coefficients: 1000*.0002 = .2. The same principle would apply if there were more links in the path. If standardized path coefficients (beta weights) were used, the path multiplication rule would still apply, but the the interpretation is in standardized terms. Either way, the product of the coefficients along the path reflects the weight of that path.
• Effect decomposition. Path coefficients may be used to decompose correlations in the model into direct and indirect effects, corresponding, of course, to direct and indirect paths reflected in the arrows in the model. This is based on the rule that in a linear system, the total causal effect of variable i on variable j is the sum of the values of all the paths from i to j. Considering "satisfaction" as the dependent in the model above, and considering "age" as the independent, the indirect effects are calculated by multiplying the path coefficients for each path from age to satisfaction:
age -> income -> satisfaction is .57*.47 = .26
age -> autonomy -> satisfaction is .28*.58 = .16
age -> autonomy -> income -> satisfaction is .28*.22 x .47 = .03
total indirect effect = .45
That is, the total indirect effect of age on satisfaction is plus .45. In comparison, the direct effect is only minus .08. The total causal effect of age on satisfaction is (-.08 + .45) = .37.
Effect decomposition is equivalent to effects analysis in regression with one dependent variable. Path analysis, however, can also handle effect decomposition for the case of two or more dependent variables.
In general, any bivariate correlation may be decomposed into spurious and total causal effects, and the total causal effect can be decomposed into a direct and an indirect effect. The total causal effect is the coefficient in a regression with all of the model's prior but not intervening variables for x and y controlled (the beta coefficient for the usual standardized solution, the partial b coefficient for the unstandardized or raw solution). The spurious effect is the total effect minus the total causal effect. The direct effect is the partial coefficient (beta for standardized, b for unstandardized) for y on x controlling for all prior variables and all intervening variables in the model. The indirect effect is the total causal effect minus the direct effect, and measures the effect of the intervening variables. Where effects analysis in regression may use a variety of coefficients (partial correlation or regression, for instance), effect decomposition in path analysis is restricted to use of regression.
For instance, imagine a five-variable model in which the exogenous variable Education is correlated with the exogenous variable Skill Level, and both Education and Skill Level are correlated with the exogenous variable Job Status. Further imagine that Education and each of the other two exogenous variables are modeled to be direct causes of Income and also of Median House Value, which are the two dependent variables. We might then decompose the correlation of Education and Income:
1. Direct effect of Education on Income, indicated by the path coefficient of the single-headed arrow from Education to Income.
2. Indirect effect due to Education's correlation with Skill Level, and Skill Level's direct effect on Income, indicated by multiplying the correlation of Education and Skill Level by the path coefficient from Skill Level to Income.
3. Indirect effect due to Education's correlation with Job Status, and Job Status's direct effect on Income, indicated by multiplying the correlation of Education and Job Status by the path coefficient from Job Status to Income.
As a second example decomposition for the same five-variable model is a bit more complex if we wish to break down the correlation of the two dependent variables, Income and Median House Value. Since here somewhat implausibly the two dependents are modeled not to have a direct effect from Income to House Value, the true correlation is hypothesized to be zero and all correlations are spurious.
4. The spurious direct effect of Education as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the path coefficient of Education to House Value.
5. The spurious direct effect of Skill Level as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Skill Level to Income by the path coefficient of Skill Level to House Value.
6. The spurious direct effect of Job Status as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Job Status to Income by the path coefficient of Job Status to House Value.
7. The spurious indirect effect of Education and Skill Level as a common antecedings variable directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the correlation of Education and Skill Level by the path from Skill Level to House Value and adding the product of the path from Skill Level to Income by the correlation of Education and Skill Level by the path from Education to Median House Value.
8. The spurious indirect effect of Education and Job Status as a common anteceding variables directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the correlation of Education and Job Status by the path from Job Status to House Value and adding the product of the path from Job Status to Income by the correlation of Education and Job Status by the path from Education to Median House Value..
9. The spurious indirect effect of Skill Level and Job Status as a common anteceding variables directly causing both dependents, indicated by multiplying the path coefficient from Skill Level to Income by the correlation of Skill Level and Job Status by the path from Job Status to House Value and adding the product of the path from Job Status to Income by the correlation of Skill Level and Job Status by the path from Skill Level to Median House Value..
10. The residual effect is the difference between the correlation of Income and Median House Value and the sum of the spurious direct and indirect effects.
Correlated exogenous variables. The path weights connecting correlated exogenous variables are equal to the Pearson correlations. When calculating indirect paths, not only direct arrows but also the double-headed arrows connecting correlated exogenous variables, are used in tracing possible indirect paths, except:
Tracing rule: An indirect path cannot enter and exit on an arrowhead. This means that you cannot have a direct path composed of the paths of two correlated exogenous variables.
• Significance and Goodness of Fit in Path Models
o To test individual path coefficients one uses the standard t or F test from regression output.
o To test the model with all its paths one uses a goodness of fit test from a structural equation modeling program. If a model is correctly specified, including all relevant and excluding all irrelevant variables, with arrows correctly indicated, then the sum of path values from i to j will equal the regression coefficient for j predicted on the basis of i. That is, for standardized data, where the bivariate regression coefficient equals the correlation coefficient, the sum of path coefficients (standardized) will equal the correlation coefficient. This means one can compare the path-estimated correlation matrix with the observed correlation matrix to assess the goodness-of-fit of path models. As a practical matter, goodness-of-fit is calculated by entering the model and its data into a structural equation modeling program such as LISREL or AMOS, which compute a variety of alternative goodness-of-fit coefficients, discussed separately.
o To modify the path model on uses modification indexes (MI) to add arrows and uses nonsignificance of path coefficients to drop arrows, in a model-building and model-trimming process discussed in the section on structural equation modeling.

Assumptions
• Linearity: relationships among variables are linear (though, of course, variables may be nonlinear transforms).
• Additivity: there are no interaction effects (though, of course, variables may be interaction crossproduct terms)
• Interval level data for all variables, if regression is being used to estimate path parameters. As in other forms of regression modeling, it is common to use dichotomies and ordinal data in practice. If dummy variables are used to code a categorical variable, one must be careful that they are represented as a block in the path diagram (ex., if an arrow is drawn to one dummy it must be drawn to all others in the set). If an arrow were to be drawn from one dummy variable to another dummy variable in the same set, this would violate the recursivity assumption discussed below.
• Residual (unmeasured) variables are uncorrelated with any of the variables in the model other than the one they cause.
• Disturbance terms are uncorrelated with endogenous variables. As a corollary of the previous assumption, path analysis assumes that for any endogenous variable, its distubance term is uncorrelated with any other endogenous variable in the model. This is a critical assumption, violation of which may make regression inappropriate as a method of estimating path parameters. This assumption may be violated due to measurement error in measuring an endogenous variable; when an endogenous variable is actually a direct or indirect cause of a variable which the model states is the cause of that endogenous variable (reverse causation); or when a variable not in the model is a cause of an endogenous variable and a variable the model specifies as a cause of that endogenous variable (spurious causation).
• Low multicollinearity (otherwise one will have large standard errors of the b coefficients used in removing the common variance in partial correlation analysis).
• No underidentification or underdetermination of the model is required. For underidentified models there are too few structural equations to solve for the unknowns. Overidentification usually provides better estimates of the underlying true values than does just identification.
o Recursivity: all arrows flow one way, with no feedback looping. Also, it is assumed that disturbance (residual error) terms for the endogenous variables are uncorrelated. Recursive models are never underidentified.
• Proper specification of the model is required for interpretation of path coefficients. Specification error occurs when a significant causal variable is left out of the model. The path coefficients will reflect the shared covariance with such unmeasured variables and will not be accurately interpretable in terms of direct and indirect effects. In particular, if a variable specified as prior to a given variable is really consequent to it, "we can do ourselves considerable damage" (Davis, 1985: 64) because if a variable is consequent it would be estimated to have no path effect, whereas when it is included as a prior variable in the model, this erroneously changes the coefficients for other variables in the model. Note, however, that while interpretation of path coefficients is inaccurate under specification error, it is still possible to compare the relative fit of two models, perhaps both with specification error.
• Appropriate correlation input. When using a correlation matrix as input, it is appropriate to use Pearsonian correlation for two interval variables, polychoric correlation for two ordinals, tetrachoric for two dichotomies, polyserial for an interval and an ordinal, and biserial for an interval and a dichotomy.
• Adequate sample size is needed to assess significance. Kline (1998) recommends 10 times as many cases as parameters (or ideally 20 times). He states that 5 times or less is insufficient for significance testing of model effects.
• The same sample is required for all regressions used to calculate the path model. This may require reducing the data set down so that there are no missing values for any of the variables included in the model. This might be achieved by listwise dropping of cases or by data imputation.

Taken from : http://www2.chass.ncsu.edu/garson/pa765/path.htm

Continue lendo...

Monday, May 26, 2008

Central Limit Theorem

The central limit theorem (CLT) states that the sum of a large number of independent and identically-distributed random variables will be approximately normally distributed (i.e., following a Gaussian distribution, or bell-shaped curve) if the random variables have a finite variance. Formally, a central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent identically distributed random variables will tend to be distributed according to a particular "attractor distribution".

Since many real populations yield distributions with finite variance, this explains the prevalence of the normal probability distribution. For other generalizations for finite variance which do not require identical distribution, see Lindeberg condition, Lyapunov condition, Gnedenko and Kolmogorov states.

History

Tijms (2004, p.169) writes:

“The central limit theorem has an interesting history. The first version of this theorem was postulated by the French-born mathematician Abraham de Moivre, who, in a remarkable article published in 1733, used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812. Laplace expanded De Moivre's finding by approximating the binomial distribution with the normal distribution. But as with De Moivre, Laplace's finding received little attention in his own time. It was not until the nineteenth century was at an end that the importance of the central limit theorem was discerned, when, in 1901, Russian mathematician Aleksandr Lyapunov defined it in general terms and proved precisely how it worked mathematically. Nowadays, the central limit theorem is considered to be the unofficial sovereign of probability theory.”

A thorough account of the theorem's history, detailing Laplace's foundational work, as well as Cauchy's, Bessel's and Poisson's contributions, is provided by Hald.^[1] Two historic accounts, one covering the development from Laplace to Cauchy, the second the contributions by von Mises, Pólya, Lindeberg, Lévy, and Cramér during the 1920s, are given by Hans Fischer.^[2] See Bernstein (1945) for a historical discussion focusing on the work of Pafnuty Chebyshev and his students Andrey Markov and Aleksandr Lyapunov that led to the first proofs of the C.L.T. in a general setting.

Classical central limit theorem

The central limit theorem is also known as the second fundamental theorem of probability. (The Law of large numbers is the first.) Let X₁, X₂, X₃, ... be a set of n independent and identically distributed random variables having finite values of mean µ and variance σ² > 0. The central limit theorem states that as the sample size n increases, the distribution of the sample average approaches the normal distribution with a mean µ and variance σ²/n irrespective of the shape of the original distribution.

Let the sum of the random variables be S_n, given by

S_n = X₁ + ... + X_n. Then, defining

$Z_n = \frac{S_n - n \mu}{\sigma \sqrt{n}}\,,$

the distribution of Z_n converges towards the standard normal distribution N(0,1) as n approaches ∞ (this is convergence in distribution).^[3] This means: if Φ(z) is the cumulative distribution function of N(0,1), then for every real number z, we have

$\lim_{n \to \infty} \mbox{P}(Z_n \le z) = \Phi(z)\,,$

or,

$\lim_{n\rightarrow\infty}\mbox{P}\left(\frac{\overline{X}_n-\mu}{\sigma/ \sqrt{n}}\leq z\right)=\Phi(z)\,,$

where

$\overline{X}_n=S_n/n=(X_1+\cdots+X_n)/n\,$

is the sample mean. ^[3] ^[4]

Proof of the central limit theorem

For a theorem of such fundamental importance to statistics and applied probability, the central limit theorem has a remarkably simple proof using characteristic functions. It is similar to the proof of a (weak) law of large numbers. For any random variable, Y, with zero mean and unit variance (var(Y) = 1), the characteristic function of Y is, by Taylor's theorem,

$\varphi_Y(t) = 1 - {t^2 \over 2} + o(t^2), \quad t \rightarrow 0$

where o (t² ) is "little o notation" for some function of t that goes to zero more rapidly than t². Letting Y_i be (X_i − μ)/σ, the standardized value of X_i, it is easy to see that the standardized mean of the observations X₁, X₂, ..., X_n is

$Z_n = \frac{n\overline{X}_n-n\mu}{\sigma\sqrt{n}} = \sum_{i=1}^n {Y_i \over \sqrt{n}}.$

By simple properties of characteristic functions, the characteristic function of Z_n is

$\left[\varphi_Y\left({t \over \sqrt{n}}\right)\right]^n = \left[ 1 - {t^2 \over 2n} + o\left({t^2 \over n}\right) \right]^n \, \rightarrow \, e^{-t^2/2}, \quad n \rightarrow \infty.$

But, this limit is just the characteristic function of a standard normal distribution, N(0,1), and the central limit theorem follows from the Lévy continuity theorem, which confirms that the convergence of characteristic functions implies convergence in distribution.

Convergence to the limit

If the third central moment E((X₁ − μ)³) exists and is finite, then the above convergence is uniform and the speed of convergence is at least on the order of 1/n^½ (see Berry-Esséen theorem).

The convergence normal is monotonic, in the sense that the entropy of Z_n increases monotonically to that of the normal distribution, as proven by Artstein, Ball, Barthe and Naor^{[citation needed]}.

Pictures of a distribution being "smoothed out" by summation (showing original density of distribution and three subsequent summations, obtained by convolution of density functions):

(See Illustration of the central limit theorem for further details on these images.)

A graphical representation of the central limit theorem can be formed by plotting random means of a population. Consider A_n. A_n will represent the mean of a random sample and X_n represents a single random variable from the sample:

A_n = (X₁ + ... + X_n) / n. Derive A_n from 1 to whichever sample size.

A₁ = (X₁) / 1

A₂ = (X₁ + X₂)/ 2

A₃ = (X₁ + X₂ + X₃)/3

For the CLT, it is recommended to plot the means upwards to 30 points (sample size 30).If we standardize A_n by setting Z_n = (A_n − μ) / (σ / n^½), we obtain the same variable Z_n as above, and it approaches a standard normal distribution.

The Central Limit Theorem, as an approximation for a finite number of observations, provides a reasonable approximation only when close to the peak of the normal distribution; it requires a very large number of observations to stretch into the tails.

The Central Limit theorem applies in particular to sums of independent and identically distributed discrete random variables. A sum of discrete random variables is still a discrete random variable, so that we are confronted to a sequence of discrete random variables whose cumulative probability distribution function converges towards a cumulative probability distribution function corresponding to a continuous variable (namely that of the normal distribution). This means that if we build a histogram of the realisations of the sum of n independent identical discrete variables, the curve that joins the centers of the upper faces of the rectangles forming the histogram converges toward a gaussian curve as n approaches $\infty$ . The binomial distribution article details such an application of the central limit theorem in the simple case of a discrete variable taking only two possible values.

Relation to the law of large numbers

The law of large numbers as well as The Central Limit Theorem are partial solutions to a general problem: "What is the limiting behavior of S_n as n approaches infinity?" In mathematical analysis, asymptotic series is one of the most popular tools employed to approach such questions.

Suppose we have an asymptotic expansion of f(n):

$f(n)= a_1 \varphi_{1}(n)+a_2 \varphi_{2}(n)+O(\varphi_{3}(n)) \ (n \rightarrow \infty).$

dividing both parts by $\varphi_{1}(n)$ and taking the limit will produce a₁ - the coefficient at the highest-order term in the expansion representing the rate at which f(n) changes in its leading term.

$\lim_{n\to\infty}\frac{f(n)}{\varphi_{1}(n)}=a_1.$

Informally, one can say: "f(n) grows approximately as $a_1 \varphi_{1}(n)$ ". Taking the difference between f(n) and its approximation and then dividing by the next term in the expansion we arrive to a more refined statement about f(n):

$\lim_{n\to\infty}\frac{f(n)-a_1 \varphi_{1}(n)}{\varphi_{2}(n)}=a_2$

here one can say that: "the difference between the function and its approximation grows approximately as $a_2 \varphi_{2}(n)$ " The idea is that dividing the function by appropriate normalizing functions and looking at the limiting behavior of the result can tell us much about the limiting behavior of the original function itself.

Informally, something along these lines is happening when S_n is being studied in classical probability theory. Under certain regularity conditions, by The Law of Large Numbers, $\frac{S_n}{n} \rightarrow \mu$ and by The Central Limit Theorem, $\frac{S_n-n\mu}{\sqrt{n}} \rightarrow \xi$ where ξ is distributed as N(0,σ²) which provide values of first two constants in informal expansion:

$S_n \approx \mu n+\xi \sqrt{n}.$

It could be shown^{[citation needed]} that if X₁, X₂, X₃, ... are i.i.d. and $E(|X_1|^{\beta}) < \infty$ for some $1 \le \beta <2$ then $\frac{S_n-n\mu}{n^{\frac{1}{\beta}}} \to 0$ hence $\sqrt{n}$ is the largest power of n which if serves as a normalizing function would provide a non-trivial (non-zero) limiting behavior. Interestingly enough, The Law of the Iterated Logarithm tells us what is happening "in between" The Law of Large Numbers and The Central Limit Theorem. Specifically it says that the normalizing function $\sqrt{n\log\log n}$ intermediate in size between n of The Law of Large Numbers and $\sqrt{n}$ of The Central Limit Theorem provides a non-trivial limiting behavior.

Alternative statements of the theorem

Density functions

The density of the sum of two or more independent variables is the convolution of their densities (if these densities exist). Thus the central limit theorem can be interpreted as a statement about the properties of density functions under convolution: the convolution of a number of density functions tends to the normal density as the number of density functions increases without bound, under the conditions stated above.

Since the characteristic function of a convolution is the product of the characteristic functions of the densities involved, the central limit theorem has yet another restatement: the product of the characteristic functions of a number of density functions tends to the characteristic function of the normal density as the number of density functions increases without bound, under the conditions stated above.

An equivalent statement can be made about Fourier transforms, since the characteristic function is essentially a Fourier transform.

Products of positive random variables

The central limit theorem tells us what to expect about the sum of independent random variables, but what about the product? Well, the logarithm of a product is simply the sum of the logs of the factors, so the log of a product of random variables that take only positive values tends to have a normal distribution, which makes the product itself have a log-normal distribution. Many physical quantities (especially mass or length, which are a matter of scale and cannot be negative) are the product of different random factors, so they follow a log-normal distribution.

Whereas the central limit theorem for sums of random variables requires the condition of finite variance, the corresponding theorem for products requires the corresponding condition that the density function be square-integrable (see Rempala 2002).

Lyapunov condition

Let X_n be a sequence of independent random variables defined on the same probability space. Assume that X_n has finite expected value μ_n and finite standard deviation σ_n. We define

$s_n^2 = \sum_{i = 1}^n \sigma_i^2.$

Assume that the third central moments

$r_n^3 = \sum_{i = 1}^n \mbox{E}\left({\left| X_i - \mu_i \right|}^3 \right)$

are finite for every n, and that

$\lim_{n \to \infty} \frac{r_n}{s_n} = 0.$

(This is the Lyapunov condition). We again consider the sum $S_n=X_1+\cdots+X_n$ , its expected value is $m_n = \sum_{i=1}^{n}\mu_i$ and its standard deviation is s_n, if we standardize it by setting

$Z_n = \frac{S_n - m_n}{s_n}$

then the distribution of Z_n converges to the standard normal distribution N(0,1).

Lindeberg condition

In the same setting and with the same notation as above, we can replace the Lyapunov condition with the following weaker one (from Lindeberg in 1920). For every ε > 0

\varepsilon s_n \right) = 0" class="tex" v:shapes="_x0000_i1058" border="0" height="50" width="381">

where E( U : V > c) is E( U 1{V > c}), i.e., the expectation of the random variable U 1{V > c} whose value is U if V > c and zero otherwise. Then the distribution of the standardized sum Z_n converges towards the standard normal distribution N(0,1).

Non-independent case

There are some theorems which treat the case of sums of non-independent variables, for instance:

Applications and examples

There are a number of useful and interesting examples arising from the central limit theorem. Below are brief outlines of two such examples and here are a large number of CLT applications, presented as part of the SOCR CLT Activity.

The probability distribution for total distance covered in a random walk (biased or unbiased) will tend toward a normal distribution.
Flipping a large number of coins will result in a normal distribution for the total number of heads (or equivalently total number of tails).

Signal processing

Signals can be smoothed by applying a Gaussian filter, which is just the convolution of a signal with an appropriately scaled Gaussian function. Due to the central limit theorem this smoothing can be approximated by several filter steps that can be computed much faster, like the simple moving average.

The central limit theorem implies that to achieve a Gaussian of variance σ² n filters with windows of variances $\sigma_1^2,\dots,\sigma_n^2$ with $\sigma^2 = \sigma_1^2+\dots+\sigma_n^2$ must be applied.

Notes

^ Andreas Hald, History of Mathematical Statistics from 1750 to 1930, Ch.17.
^ Hans Fischer: (1) "The Central Limit Theorem from Laplace to Cauchy: Changes in Stochastic Objectives and in Analytical Methods"; (2) "The Central Limit Theorem in the Twenties".
^ ^a ^b For decades, large sample size was set as n > 29; however, research since 1990, has indicated larger samples, such as 100 or 250, might be needed if the population is skewed far from normal: the more skew, the larger the sample needed. The conditions might be rare, but critical when they occur: computer animations are used to illustrate the cases. The cutoff with n > 29 has allowed Student-t tables to format in limited pages; however, that sample size might be too small. See below "Using graphics and simulation.." by Marasinghe et al, and see "Identification of Misconceptions in the Central Limit Theorem and Related Concepts and Evaluation of Computer Media as a Remedial Tool" by Yu, Chong Ho and Dr. John T. Behrens, Arizona State University & Spencer Anthony, Univ. of Oklahoma, Annual Meeting of the American Educational Research Association, presented April 19, 1995, paper revised in Feb 12, 1997, webpage (accessed 2007-10-25): CWisdom-rtf.
^ Marasinghe, M., Meeker, W., Cook, D. & Shin, T.S.(1994 August), "Using graphics and simulation to teach statistical concepts", Paper presented at the Annual meeting of the American Statistician Association, Toronto, Canada.

References

Henk Tijms, Understanding Probability: Chance Rules in Everyday Life, Cambridge: Cambridge University Press, 2004.
S. Artstein, K. Ball, F. Barthe and A. Naor, "Solution of Shannon's Problem on the Monotonicity of Entropy", Journal of the American Mathematical Society 17, 975-982 (2004).
S.N.Bernstein, On the work of P.L.Chebyshev in Probability Theory, Nauchnoe Nasledie P.L.Chebysheva. Vypusk Pervyi: Matematika. (Russian) [The Scientific Legacy of P. L. Chebyshev. First Part: Mathematics] Edited by S. N. Bernstein.] Academiya Nauk SSSR, Moscow-Leningrad, 1945. 174 pp.
G. Rempala and J. Wesolowski, "Asymptotics of products of sums and U-statistics", Electronic Communications in Probability, vol. 7, pp. 47-54, 2002.