🗒️FIT 1043 EXAM 1

Cs, Go!

FIT 1043|2021-7-2|最后更新: 2025-8-25|
Place
人员
定义

Data Science Venn Diagram, Drew Conway 的韦恩图

Hacking skills
need creativity to prepare data for analysis
需要创造力来准备数据以供分析。
Math and statistics knowledge
To extract insight from data, baseline familiarity with these required.
要从数据中提取洞察,需要对这些内容有基本的了解。
Substantive domain knowledge
What are the goals, constraints of a domain
领域的目标和限制是什么?
PPT 截图
notion image

Data Science Process (Standard Value Chain)

Collection - 收集数据
Engineering - 设计以及计算获得的结果
Governance - 质量安全+隐私 把握
Wrangling - 信息筛选,过滤
Analysis - 分析
Presentation - 数据可视化,结果展示
Internationalisation - 放到现实实例
PPT 截图
notion image
notion image

PD groupby

PPT 截图
notion image

Open Data

  • A common format for open data is “Linked Open Data (LOD)”
  • A basic theory of LOD is that data has more value
if it can be connected to other data.
  • Triples: subject, verb and object
  • Enables data from different sources to be connected and queried.
PPT 截图
notion image

API

One of the best ways for gathering data from data science project is to use APIs
API: Application Porgrammer Interface
Routines providing programatic access to an application.
  • API is like a user interface but it is designed for computers to access to the functionality of a software
  • Computers talk to each other, to access data or service from applications
  • API consumers VS API provider

Wrangling

  • Working with raw data is challenging!
    • Data comes in all shapes and sizes
    • Different files have different formatting
    • Mistakes in data entries
Process of transforming “row” data into data that can be analyzed to generate valid actionable results and insights

Data Wrangling

  • data pre-processing
  • data preparation
  • data cleansing
  • data transformation

Data Issues

Sources of Data Quality Issues

  • Interpretability Issue
  • Data format issue
  • Inconsistent and faulty data
  • Missing and incomplete data
  • Outliers
  • Duplicates

Data quality problems - Inconsistent and faulty data

  • misstyped data
  • inconsistent entry
  • extraneous (irrelevant or unrelated) data
 
PPT 截图
notion image
notion image
notion image
notion image
notion image
 

Discussions

Misspelling and inconsistency | 拼写错误或拼写不一致

💭
拼写错误或记录方法不持续
notion image

Irregularities | 不规则,违法的

💭
任何不合法的数据类型,比如不在Domain
notion image

Integrity Constraint Violation | 完整性约束违反

💭
需要更具情况而定, 比如人数不可能是小数点之类的
notion image

Duplications | 内容重复

💭
notion image

Missing values | 内容缺失

💭
notion image

Outliers | 异常值

💭
notion image

Basic Types of Data | 基础数据类型

Categorical-Nominal

  • Discrete number of values, no inherent ordering
  • E.g. country of birth, sex

Categorical-Ordinal

  • Discrete number of states, but with an ordering
  • E.g. Education status, State of disease progression

Numeric-Discrete

  • Numeric, but the values are enumerable
  • E.g. Number of live births, Age (in whole years)

Numeric-Continues

  • Numeric, not enumerable (i.e., real numbers)
  • E.g. Weight, Height, Distance from CBD
PPT 截图
notion image

Data Visualisation

  • It is often useful to visualise data
    • can sometimes quickly reveal patterns
    • However, going beyond two dimensions is problematic
  • For categorical data, standard visualisations
    • Bar graphs
    • Pie chats
  • For numeric data (continues and discrete), we can use
    • Histograms
    • Box plots
PPT 截图
notion image
notion image
notion image
notion image
quizes
Week 1 Quiz
  1. Data science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics.
    1. Big data is a broad term for data sets so large or complex that traditional data processing application are inadequate.
  1. Venn diagram
    1. Hacking skills: needs creativity to prepare data for analysis.
      Math and stat: to extract insight from data, baseline familiarity with these all required.
      Substantive Expertise: what are the goals, constraints of a domain.
       
      Hacking Skills and substantive expertise: Appears to be legitimate analysis without any understanding of how they got there or to interpret what they have created.
  1. Machine learning:
    1. definition: Machine learning is concerned with the development of algorithm and techniques that allow computers to learn.
    2. Why use machine learning: Human expertise is unavailable, many solutions need to be adapted automatically, and humans are expensive to use for the work, solutions changes overtime, there are large amount of data.
 
  1. Standard Value Chain:
    1. collection: getting the data
    2. Engineering: storage and computational resources across full lifecycle.
    3. Governance: overall management of data such as security across full lifecycle.
    4. Wrangling: data preprocessing, cleaning.
    5. Analysis: discovery (learning, visualisation, etc.)
    6. presentation: arguing the case that the result are significant and useful.
    7. Operationalisation: putting the result to work, so as to gain benefits or value.
The Key Difference
  • Engineer = working with data infrastructure & pipelines
    • Ensuring the data is usable (designing schemas, cleaning, transforming, integrating).
    • Lays the groundwork.
  • Analyse = extracting patterns & building intelligence
    • Applying statistics, ML, or AI to interpret data.
    • Generates actionable insights and predictions.
Think of it like building the kitchen vs. cooking a meal:
  • Database schema design = arranging the kitchen, shelves, and utensils so food can be handled properly.
  • Training an ML model = actually cooking the meal from ingredients to deliver value.
 
Week 2 quiz
  1. what is .ipynb: Interactive Python NoteBook
  1. >>> df.loc[(df[’Sex’] == ‘female’) & (df[’Survived’] == 1)]. What does the above code return :
    1. A table containing all the data for all female passengers that survived.
  1. Open data: public available, and machine readable. But it is not always usable hard to make sense of the huge amount of government data, and people need the right skills.
  1. API: So API is one of the nest ways for gathering data for data science project
    1. API: Application Programm Interface
      API is like a user interface but it is designed for computers to access to the functionality of a software.
      API consumers and API providers
  1. When we have missing data in a dataset, we could : replace with special unknown value, replace with average value, remove the row or column in which the missing data resides in
 
Week 3 quiz
df.corr: check correlation amount variables
  1. The issues with data:
    1. misspelling and inconsistency
    2. Irregularities
    3. Integrity contrust violation
    4. duplications
    5. missing values
    6. outliers
 
Week 4
 
Loading...