Statistics Assignment

The purpose of this assignment is to give you an opportunity to demonstrate your skills in describing and analysing data using concepts and tools that we have developed in the course so far.

Below are instructions on how to collect a specified set of data and what to do with it. Your goal is to produce a** report **in MS Word discussing the data and submit this along with a single MS Excel workbook showing your workings. A suggested target range for the word count of the report is 700-1000 words.

I have prepared and attached an example Excel workbook which I will refer to below. Note: my Excel workbook is **not** a model answer. You may choose to use different visualisations and do not necessarily need all the computed statisitcs and charts I have included. It really depends on the features of the data you have, so you need to use your own judgement as to how to best present and describe the data. Besides, the primary output for this assignment is the report itself, not the workbook. Statistics homework help

**Data collection**

Collect quantitative data on two variables from the Sustainable Development Report 2021 website.

- Go to https://dashboards.sdgindex.org/and browse around the site to become familiar with its purpose and the information publicly available there.
- Go to the “Downloads” page and click on “Database EXCEL” to download the database of indicators used to assess countries’ progress towards the UN Sustainable Development Goals. You will be taking data from the
**“SDR2021 Data” sheet**in the workbook. Starting from**column AR**of that sheet, there are columns of cross-country data for the SDG indicators, one row for each country. Note that from**row 195**the data are for regional blocs so these should be__excluded__from the data you take. - You have been assigned
__two__variables according to the last digit of your Student ID number. You can find the variables assigned to you in the attached file “Assigned variables.xlsx”. For example, my student ID number (a long long time ago in a galaxy not so far away) ended with 2 so I would be using variables “Poverty headcount ratio at $1.90/day (%)” and “Cereal yield (tonnes per hectare of harvested land)”. I have chosen pairs of variables that may potentially have a statistical relationship. If you wish you are welcome to switch one of the variables with another one from the database that you are interested in investigating and you think is related to the variable you retain.

** **

**S: My student number ends with 7. Therefore, my topics are**

7 | Corruption Perception Index (worst 0-100 best) | 16 |

Population with access to clean fuels and technology for cooking (%) | 7 |

**Please refer to Variable Assigned excel sheet for more information.**

- Look up your variables in the Data Explorer on the website or in the report from page 75 (some newer variables are not included on the website yet, it seems). The main thing you want to understand is what a given value of each of your variables means. E.g. I found that the “Poverty headcount ratio at $1.90/day (%)”is the estimated percentage of the population that is living under the poverty threshold of US$1.90 a day.
- Each indicator has a number of associated columns in the Database workbook,
**the first column of the set has the data**, the others can be safely ignored. So for the indicators I use in my example Excel workbook, I took**data from columns QZ and GO (and only down to row 194)**. Using Excel’s Find tool is a quick way to find your data. Copy and paste the data you will use into your workbook. You should keep the country names alongside the data so that you can identify which observation is for which country. Statistics homework help

Prepare your data.

- Construct
__separate__univariate data sets for analysis. There will probably be many countries where there is no estimate for the indicators you are looking at.**If there is no observation recorded, do not assume the observed value is zero**. In general, missing observations in data rarely mean they should be replaced with zeros. Also consider if it is appropriate to include observations that are recorded as zero. In my example Excel workbook I have retained observations of zero for CO_{2}emissions because this suggests those countries are not exporting fossil fuels, while blank cells mean there is no observation. It is fine to have blank cells within your data ranges, Excel will usually ignore them (as long as they are truly blank). - Construct a bivariate (
__paired__) data set – i.e. for each country you should have an observation for both variables. You can see in my example Excel workbook how I use some Excel formulas and the Replace tool to blank out cells for countries where there is only an observation for one of the two variables. If you find that the number of countries that you have left in the bivariate data set is low, say less than 30, it might be best to go back to the Database and replace the variable that is causing many countries to be dropped. - In your report you should note any difficulties with the data preparation and implications of dropping countries from the data sets if such was required.

**An educated guess**

Guess the average value for each variable.

- Run your eye down the column of univariate data you have for each variable (the separate data not the paired data), and make a guess what you think the cross-country average would be for each one.
**Do not use Excel to calculate the averages here.** - Just take a note of your guesses; you will use them later.

**Data description**

Use numerical summary measures and graphical representations to describe the two variables (using the __separate data__)

- You can use the “Descriptive Statistics” tool in the data analysis tool pack and also calculate quartiles, coefficients of variation etc.
- Draw a histogram, boxplot, etc. for each data set.
- You should
__discuss__the**important and interesting features**of the data revealed by your descriptive statistics and graphical representations in your report. In my example workbook you will see the CO_{2}data is strongly positively skewed, so much so that the boxplot is almost meaningless. Two options I had was to drop some of the largest observations, or to transform the data. I chose the latter – by taking the log of the data I end up with a data set distribution that can be usefully presented on a boxplot or histogram. Outliers and skewness are common features of cross-country data like this, so you should be prepared to drop observations or transform data if necessary, and explain why you did this in your report. (Just because data is skewed doesn’t mean you have to transform it! You’ll notice I did not transform the literacy data.) Statistics homework help

Use numerical summary measures and graphical representations to consider if there might be a relationship between the two variables (using the __paired data__).

- Use the correlation coefficient and a scatterplot to see the strength and direction of the relationship (if any) between the two variables.
- In your report,
__discuss__the above and__explain__why you think the relationship might be causative, spurious, or driven by a third factor.

**Data analysis**

Construct confidence intervals (using the __separate data__).

- Now assume that the data for each variable is a random sample and construct a confidence interval for the population mean of each variable. Since you don’t know the population standard deviations you should use
**critical values from the Student t-distribution**. - State your confidence interval in your report, explaining what it means (to a layperson) and also discuss if you have any doubts about the validity of the interval.

Compute p-values (using the __separate data__).

- Now assume that your “educated guess” of the average for each variable is the true mean of that variable. How likely is it that you would observe the sample mean you have obtained, or something more extreme, if your parameter assumption for each variable is correct? I.e. find the
**two-tail p-value**associated with each sample mean.

*You can obtain the p-value by doing a two-tail hypothesis for the mean, for each data set.* - State the p-values in your report and explain their meaning. Conclude by stating whether your educated guesses were probably right or wrong. (There is no penalty if your educated guesses are wrong!)

**Report**

As noted above, your assignment output should consist of a report and a spreadsheet workbook. Imagine that the reader of your report is a busy executive with only a basic understanding of statistics. Your report should therefore be of professional appearance and be able to be fully understood without reference to the workbook. I.e. paste relevant charts into the report; __do not__ paste the full descriptive statistics table into the report but rather use an abridged table and/or discussion; __do not__ show the __computation__ of the confidence intervals and p-values in the report but do __state__ and __interpret__ them.

Remember the suggested word count is 700-1000 words but this is a guide only: if you accomplish everything required above with less, that is fine; ideally don’t go much over 1000 – this would indicate you are not being concise enough.

Finally, I have attached some collated feedback I provided to students last year. You may like to refer to this to see what I am hoping to see in your report.

**General:**

- Check that you included everything that was asked for in the report (not just the workbook). If you missed out computing or discussing e.g. the p-values I couldn’t give you marks for those!

**Data description – univariate**

- Things I looked for were
- A brief introduction to the variables: what do the quantities mean?
- Use of descriptive statistics to describe the main features of the data e.g. IQR, CV, mean, median, quartiles, standard deviation (together with empirical rule or Chebyshev theorem)
- A little discussion about outliers that were interesting or were removed
- Histograms, polygons, boxplots and/or normal probability plots and discussion of the data distribution shape

- If there was no observation in a data set for a country it should not be treated as zero. Also be suspicious of zeros that do appear in the raw data set – they might have ended up there in place of no observation.
- Be careful with units – state what the units of the variables are and keep using those units for things like mean, standard deviation etc. g. cereal yield is in tonnes per capita, not percentage.
- Left skewed or negatively skewed data has a peak near the top of the distribution and a long lower tail.
- A data set does not have to fall into {left skewed, symmetric/normal, right skewed}. There are many other variations (without specific names, you could just call it asymmetric for instance). Statistics homework help
- It wasn’t necessary to transform a data set for the univariate analysis if it was skewed, only if it was so skewed or outliers were so extreme that boxplots etc. were not useful. It could of course be useful for bivariate analysis if one or both variables seem to be lognormal, because then you could still see a linear relation in the scatterplot using the logged variable(s) and linear correlation and regression would be valid.

**Data description – bivariate**

- Things I looked for were
- A scatterplot and associated discussion (not a line chart or bar chart with series side by side)
- The correlation coefficient being stated and interpreted
- Discussion about why there might or might not be a relationship

- You shouldn’t just rely on the correlation coefficient – if the scatterplot indicates almost no relationship then you should say it is doubtful there is an actual relationship
- If the scatterplot indicates there is a relationship, but it is more likely to be non-linear, you should mention this.
- Don’t just say there might be a third variable and leave it at that. Some discussion is important to show you know what this actually means.

**Confidence intervals**

- Things I looked for were
- Statement of each confidence interval in a sentence, in the context of the variable
- Discussion of why the confidence intervals were valid (appealing to CLT)

- Some said the sample mean was a good estimate because it was inside the confidence interval. Of course, the sample mean is at the center of the interval! This does not guarantee anything about the accuracy of the estimate.
- If you have dropped countries from the data set, the confidence interval only estimates the mean of the population you have sampled from, which may not be the same population that the countries you dropped come from. g. if you drop many African countries from the data set due to lack of data, then you should use caution saying that the confidence interval estimate for the mean still applies to African countries (i.e. only valid if you are quite sure there is no systematic difference between such countries and those included in your sample in the context of your variable.

**P-values**

- Things I looked for were
- Computation of two-tail p-values in the workbook
- An interpretation of the two-tail (not one-tail) p-value for each variable in terms of how likely it is to observe the sample estimate if the assumed population parameter was correct (not in terms of rejection of the null hypothesis or not).

- Don’t guess the parameter value to be equal to the sample mean, otherwise the p-value is guaranteed to be 1 (and is just cheating!)

**Professional appearance**

- The report should have been easy to read, well-structured, and include charts with titles and axis labels etc.
- You shouldn’t put bar charts of all the countries for each variable into the report – if you want to highlight certain countries e.g. top 5 and bottom 5 just put them into a bar chart.
- You also shouldn’t paste in whole descriptive statistics tables or give values with all the decimal places Excel gives you by default. A good rule of thumb is two or three significant figures. g. 542,997 becomes 543,000 and 0.024897874 becomes 0.025 or 0.0249. Also don’t paste in the confidence interval or p-value calculation templates. Managers don’t have time or understanding for technical output. They won’t be impressed; they may just be confused or think you are showing off. They will be impressed if you can boil that technical material down in such a way that they can understand the key points enough to know what decisions need to be made. Statistics homework help