Sample Exam 增强版

本卷结合了一下年份的模拟卷

2022 S1, S2

2023 S2

2024 S1 S2

Question 1

Drew Conway Venn Diagram: Which of the following best explains the “Danger Zone” intersection in the Drew Conway Venn Diagram?

A. It describes people who are well versed in conducting end to end machine learning and report coefficients, but without understanding what they mean.

B. It describes people who are not sure what they are doing, although they can explain the meaning of the output of the coefficients.

C. It is an area that we should not enter as it is dangerous and can result in harmful analysis.

D. It is an area where people conduct trial and error experiments with data and report the best results.

答案

Question 2

Complete this sentence. The most common algorithm for linear regression aims to

Select one:

A. find a function which minimises the squared error between the data and proposed function. B. minimise the error between the data and proposed function. C. maximise the squared error between the data and proposed function. D. draw a straight line.

答案

Question 3

Machine learning is useful when:

A. human expertise is not available

B. humans cannot explain their expertise (as a set of rules)

C. humans are expensive to use for the work

D. ALL of the other case

答案

Question 4

Which of the following is true about “open data”?

A. Open data is both private and machine readable.

B. Open data is always useful.

C. Open data is a machine-readable data that is publicly available.

D. None of the above option.

答案

Question 5

Which of the following statements about Data Wrangling tools is TRUE:

A. Python and R are general purpose languages that cannot be used for data wrangling.

B. All data wrangling tools require users to write code.

C. Data wrangling tools are all open source.

D. ALL of the above statements are false.

答案

Question 6

What is the proper explanation for the following Python code:

A. It groups the data by the data based on Sex and Class and returns the average

B. It shows the average of age in each class and sex

C. It groups the data based on Sex and Class

D. It first groups the data by sex. Then shows the average age in different classes

答案

Question 7

What is Hadoop?

A. An abbreviation for “Hadrian's Loop”, a firewall management system

B. A programming language designed for agile development

C. An encryption system used extensively at Google

D. A system for partitioning computation across a compute cluste

答案

Question 8

The 3Vs of big data are important because:

A. they are an industry standard.

B. they are the basis for the development of more Vs (e.g. Value).

C. they are used to describe in what way a dataset may be too big to handle.

D. they are from the influential Gartner Inc.

答案

Question 9

Privacy: What is the technological reason for the continued increase in lack of privacy?

A. the flow of technology makes surveillance easier unless particular measures are set in place.

B. the increase in cybercrime and terrorism makes it a necessity.

C. the open internet and the cloud remove privacy.

D. it follows from Koomey's Law.

答案

Question 10

Which of the following statements about Python is TRUE?

A. The first element of an array in Python has the index 1.

B. Python was designed by statisticians.

C. Python is a scripting language.

D. Python is a proprietary programming language.

答案

Question 11

Unix shell commands like “less” and “grep”:

A. can be used to manipulate large data files easily

B. are poorly documented

C. are used to fit regression tree models

D. are examples of technology that is too old to be useful to a modern data scientist

答案

Question 12

Over the years, disk capacity is generally growing:

A. exponentially

B. quadratically

C. linearly

D. logarithmically

答案

Question 13

Which of the following is real world applications of Machine Learning?

A. Self-driving cars

B. Spam filtering

C. Weather forecasting

D. All of the options

答案

Question 14

The growth of NoSQL databases occurred because:

A. they were better suited for distributed implementation

B. variety, volume and specific processing demands of some classes of data challenges RDBMSs

C. they were more easily integrated with web client applications

D. enterprising database developers expanded in the niche markets of NoSQL

答案

Question 15

What one is correct about R and Python:

A. Python is more powerful compared to R as it has more libraries

B. R supports dataframe while Python does not have the concept of dataframe

C. Both R and Python can be used for data wrangling

D. R and Python are not comparable as they cannot be used for the same purposes

答案

Short Answer Question

Question 16

Explain what big data is

📝

Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, the variety or scope of the data points being covered, and the quality and availability of data (veracity).

Explain what big data is. Consider the four V’s of big data and explain veracity in a few words.

📝

BIG DATA is any attribute (among the V’s) that challenges constraints of a system capability or business need and veracity is uncertainty of data.

Volume is size of data.

Velocity is the frequency/Pace of incoming data that needs to be processed.

Variety refers to different types of data.

Veracity refers to the fact that how accurate or truthful a data set may be. More specifically, how accurate and reliable the data is?

Question 17

List some types of the metadata might be associated with an image.

📝

Descriptive metadata: title of image, the owner, the number of versions… Structural metadata: keywords, collections… Administrative metadata: rights management information, restrictions on use, licensing rights, and expiration date…

Question 18

Name two typical tasks performed while “wrangling data”.

📝

Handling missing values

Fixing the typos. Example: country name is Austria instead of Australia

Fixing the probable issues with dates. Example month is greater than 12)

Fixing impossible semantic errors, for example the start date of a staff is greater than the end time of that staff in our data set and we need to swap them (you don’t need to mention it is a semantic error. As long as you explain the error, you will get the full mark)

Question 19

Assume you are collecting data about traffic accidents in Melbourne to develop a predictive model. Would it be better to collect “more data” (e.g., the locations of accidents over many years) or “more types of data” (e.g., the types of vehicles involved, the weather conditions, etc)? Give a brief justification.

📝

Assuming there is sufficient data for building a predictive model, usually more types of data helps a predictive model more than just collecting the more data. However, if there are insufficient amount of data, then it would be better to ensure that there is sufficient data for the model building.

Question 20

If you want to audit your data, what features of your data will you check? Name two

📝

Outliers

Missing values

Question 21

Explain the differences between a classification and a regression. Which one can be used to predict a salary based on age and job title of a person?

📝

Classification: The depended variable is a categorical variable. i.e., discrete values (categorical), such as Spam or not Spam.

Regression: The depended variable is a continuous value such as price.

We can use regression to predict the salary based on the age and job as salary is a continuous value.

Question 22

Explain "sensitivity" as one of the classification metrics.

📝

Sensitivity describes (True Positive Rate) the ratio of how many predictions were correctly labelled positive out of all predictions that were actually positive. TP / TP + FN

Question 23

Would you consider user’s emails as to be sensitive information? Why or why not?

📝

Yes, emails should be considered as sensitive information. They may contain different types of private information such as addressed, cell phone numbers, vacation notices, financial information and so on. This information should be confidential.

Question 24

Explain the k-means algorithm.

📝

K-Means is an unsupervised clustering machine learning algorithm which groups the similar data points together to help us discover the underlying patterns by looking at the fixed number of clusters (k). The algorithm is as follows:

Define the K

Initialize the centroids

Assigns the data points to the centroids 4- Update the centroids

If the new values of the centroid changed significantly from the previous values, return to step 3; otherwise, stop the algorithm.

Question 25

Name two different data science roles (jobs) and explain their responsibilities.

📝

Data developers: people focused on the technical problem of managing data—how to get it, store it, and learn from it

Data researchers: people with an academic research background, using their training to “understand complex processes”

Data businesspersons: most focused on the organization and how data projects yield and profit

Data creatives: the broadest of data scientists, those who excel at applying a wide range of tools and technologies to a problem

(You could have used role names such as data scientist, data analyst, business analyst to answer this question as well.)

Question 26

There are different basic types of data. Name two and explain them briefly.

📝

Numeric-Discrete: Numeric, but the values are enumerable

Numeric-Continuous: Numeric, not enumerable (i.e., real numbers)

Categorical-Nominal: Discrete numbers of values, no inherent ordering

Categorical-Ordinal: Discrete number of states, but with an ordering

Question 27

Name two benefits of open data.

📝

Transparency

Public Service Improvement

Innovation and Economic value

Efficiency