自研模拟卷2

🧠 FIT1043 Introduction to Data Science – Mock Exam (Custom Edition)

Exam Duration: 2 hours 10 minutes

Total Marks: 65 marks

Part 1: 15 Multiple Choice Questions (15 marks)

Part 2: 25 Short Answer Questions (50 marks)

🧩 Part 1: Multiple Choice Questions (1 mark each)

选择题答案

B, A, C, B, B, B, A, B, C, A, B, B, C, B, B

Which of the following best describes machine learning?

A. Writing fixed rules for computers to follow

B. Developing algorithms that allow computers to learn from data

C. Cleaning and visualising data for human use

D. Creating new programming languages

In the Data Science Venn Diagram by Drew Conway, the intersection of hacking skills and domain knowledge without statistics represents:

A. Danger Zone

B. Machine Learning

C. Pure Research

D. Big Data

Which of the following statements about open data is TRUE?

A. It is always clean and ready for analysis

B. It is always confidential

C. It is publicly available and machine-readable

D. It cannot be shared

In Python, the function df.groupby('Gender')['Age'].mean() will:

A. Display the unique values of “Gender”

B. Compute the mean age for each gender

C. Filter the dataset by gender

D. Drop missing values

Which plot type best shows the relationship between two numeric variables?

A. Pie chart

B. Scatter plot

C. Bar chart

D. Box plot

Which statement about data wrangling is correct?

A. It only applies to structured data

B. It includes cleaning, transforming, and merging data

C. It is done after modelling

D. It refers only to visualisation

Which measure of central tendency is most affected by outliers?

A. Mean

B. Median

C. Mode

D. Percentile

In regression, overfitting occurs when:

A. The model fits too simply and misses patterns

B. The model fits the training data too closely

C. There is too little bias

D. The dataset has no variance

Which law suggests a new, lower-priced class of computing emerges roughly every decade?

A. Moore’s Law

B. Koomey’s Law

C. Bell’s Law

D. Zimmerman’s Law

Which Unix command counts the number of words in a file?

A. wc -w

B. wc -l

C. grep

D. cat

Which database type is best for hierarchical, semi-structured data?

A. SQL

B. Graph Database

C. CSV

D. Relational

Which of the following best defines metadata?

A. Extra notes written by the analyst

B. Data about data that helps describe or manage it

C. The data itself

D. Random annotations

The Random Forest algorithm is an example of:

A. Unsupervised learning

B. Reinforcement learning

C. Ensemble learning

D. Clustering

Which of the following best describes the purpose of a loss function?

A. To add noise to training data

B. To measure how far predictions are from true values

C. To visualise training results

D. To tune hyperparameters manually

Which is a privacy risk under Zimmerman’s Law?

A. Better encryption

B. Increased surveillance with advancing technology

C. Slower internet speeds

D. Stricter data governance

✍️ Part 2: Short Answer Questions (2 marks each)

Question 16

Define data science and give one example of its application.

📝

Data science is the process of collecting, cleaning, analysing, and interpreting data to extract meaningful insights and support decision-making.

Example: Using sales records to predict next-month demand.

Question 17

Explain the three V’s of big data and describe what veracity adds to them.

📝

Volume: The amount or size of data generated.

Velocity: The speed at which data is created and processed.

Variety: The different forms of data (text, image, video, etc.).

Veracity: Describes the accuracy and reliability of data—how truthful and trustworthy it is.

Question 18

What is the difference between structured, semi-structured, and unstructured data? Give one example each.

📝

Structured: Highly organised and stored in tables, e.g. SQL database of customers.

Semi-structured: Has some organisation but no fixed schema, e.g. JSON or XML file.

Unstructured: Raw formats such as text, photos, videos, e.g. social-media posts.

Question 19

Explain the difference between supervised and unsupervised learning, with an example of each.

📝

Supervised learning: Trains on labelled data to predict outputs, e.g. predicting house prices.

Unsupervised learning: Finds patterns or groups in unlabelled data, e.g. K-means clustering of customers.

Question 20

What is the difference between regression and classification problems?

📝

Regression predicts continuous numeric values (e.g. salary).

Classification predicts discrete categories (e.g. spam or not-spam).

Question 21

Describe the steps of the K-Means clustering algorithm.

📝

Choose the number of clusters K.

Randomly initialise K centroids.

Assign each data point to the nearest centroid.

Recalculate centroids as the mean of assigned points.

Repeat steps 3–4 until centroids stop changing significantly.

Question 22

Explain what a confusion matrix is and name two metrics derived from it.

📝

A confusion matrix compares predicted and actual classes to evaluate model performance.

Metrics include Accuracy, Precision, Recall (Sensitivity), and F1-score.

Question 23

What does the following Python code output?

📝

It filters and returns all rows where Age > 30 and Gender == 'Female'.

Question 24

Give two causes of data quality issues and one method to fix each.

📝

Missing values → Impute using mean/median or remove rows.

Inconsistent formatting → Standardise values (e.g. “U.S.A.” → “USA”).

Question 25

What is data wrangling and why is it necessary in data science?

📝

Data wrangling is cleaning and transforming raw data into a usable format for analysis.

It ensures accuracy and consistency before modelling or visualisation.

Question 26

What does a boxplot show, and how can it be used to detect outliers?

📝

A boxplot displays the five-number summary (min, Q1, median, Q3, max).

Points outside 1.5 × IQR from Q1 or Q3 are treated as outliers.

Question 27

What is the difference between bias and variance in model evaluation?

📝

Bias: Error from simplifying assumptions that miss true patterns.

Variance: Error from sensitivity to small changes in training data.

Question 28

Explain the bias–variance trade-off with a simple example.

📝

Simple models → high bias / low variance (underfitting).

Complex models → low bias / high variance (overfitting).

The best model balances both for optimal accuracy.

Question 29

What is the No Free Lunch Theorem, and what does it imply about ML algorithms?

📝

It states that no single algorithm performs best for all problems.

Algorithm performance depends on the specific data and task.

Question 30

Define an ensemble model and give one example.

📝

An ensemble model combines multiple individual models to improve accuracy and robustness.

Example: Random Forest (combines many decision trees).

Question 31

What is the purpose of data visualisation in exploratory data analysis (EDA)?

📝

It reveals trends, relationships, and anomalies, helping analysts understand data structure and communicate insights effectively.

Question 32

Explain what MapReduce does and how it helps with big data.

📝

Map: Splits large datasets into key-value pairs for parallel processing.

Reduce: Aggregates the intermediate results.

This allows distributed, scalable computation across many machines.

Question 33

Compare Hadoop and Spark in terms of processing method and speed.

📝

Hadoop: Batch, disk-based, slower; suitable for offline jobs.

Spark: In-memory, near real-time, faster for iterative analytics.

Question 34

Explain what metadata is and give one example related to image files.

📝

Metadata is data describing other data.

Example: EXIF metadata in photos—camera type, timestamp, GPS location, resolution.

Question 35

Describe Bell’s Law and give an example of a possible future computing class.

📝

Bell’s Law states that roughly every decade, new computer classes emerge through advances in technology.

Example: Wearable AI or quantum-edge devices as the next class after smartphones.

Question 36

What is data governance, and why is it important in organisations?

📝

Data governance defines policies for who can access, manage, and protect data.

It ensures ethical use, consistency, and compliance with regulations.

Question 37

Explain the difference between privacy, confidentiality, and security.

📝

Privacy: Control over how personal data is shared.

Confidentiality: Keeping sensitive information secret.

Security: Technical protection preventing unauthorised access or misuse.

Question 38

What is an API and how can it be used to collect data for a project?

📝

An API (Application Programming Interface) lets software communicate with other systems.

Developers can request external data (e.g. weather API) for analysis.

Question 39

Explain one real-world use of machine learning in daily life.

📝

Machine learning powers everyday tools such as email spam filters, movie recommendations, and facial recognition on smartphones.

Question 40

What does a loss function measure and how is it used during training?

📝

A loss function measures the difference between predicted and actual values.

Training algorithms minimise this loss to improve model accuracy.