บทที่ 2: The Data Engineering Lifecycle — หัวใจของหนังสือทั้งเล่ม

4 min read

5 Stages ของ Data Engineering Lifecycle

graph LR
    G["1. Generation<br/>Source Systems"] --> S["2. Storage<br/>แทรกทุก stage"]
    S --> I["3. Ingestion<br/>Batch & Streaming"]
    I --> T["4. Transformation<br/>สร้างคุณค่าให้ data"]
    T --> V["5. Serving<br/>Analytics, ML, Reverse ETL"]

    U1["Security"] -.-> G
    U2["Data Management"] -.-> G
    U3["DataOps"] -.-> G
    U4["Data Architecture"] -.-> G
    U5["Orchestration"] -.-> G
    U6["Software Engineering"] -.-> G

6 undercurrents (Security, Data Management, DataOps, Architecture, Orchestration, Software Engineering) แทรกซึมทุก stage ของ lifecycle — ไม่ใช่แค่ step เดียว

1. Generation (Source Systems)

จุดเริ่มต้น — ที่ที่ข้อมูลถูกสร้างขึ้น data engineer ไม่ได้ควบคุม source systems แต่ต้องเข้าใจมันอย่างลึกซึ้ง

สิ่งที่ต้องประเมินทุก source system:

Rate & volume — events/sec, GB/hr
Consistency & errors — nulls, duplicates, bad formatting, late-arriving data
Schema — schemaless (app-defined) หรือ fixed-schema (relational)? Schemas เปลี่ยนเสมอ
Impact of reading — การ query source production DB เพื่อ ingestion จะ degrade performance หรือไม่?
CDC (Change Data Capture — การจับการเปลี่ยนแปลง) vs snapshots — full snapshots หรือ streaming change events?

2. Storage

Storage แทรกอยู่ทุก stage ของ lifecycle — ไม่ใช่แค่ขั้นตอนเดียว และส่วนใหญ่ใช้หลาย storage solutions พร้อมกัน

Key considerations:

Data temperature: hot (หลายครั้ง/วัน), lukewarm (เดือนละครั้ง), cold (archival)
Query capability: pure storage (S3) vs storage with compute (Snowflake)
Schema flexibility: schema-agnostic (object storage) → enforced schema (warehouse)
Metadata capture: การลงทุนกับ metadata คือการลงทุนกับอนาคต — lineage, schema evolution
อย่า commit "unnatural acts" เช่น random access updates บน object storage

3. Ingestion

จุดที่ ปวดหัวที่สุด ใน lifecycle — source systems อยู่นอกเหนือ control, data มาไม่ตรงเวลา, quality ห่วย

Batch vs Streaming:

"All data is inherently streaming" — batch คือการประมวลผล stream ใน chunks
Streaming-first เซ็กซี่ แต่มี trade-offs: downstream รับ rate ได้ไหม? คุ้มค่าไหม?
อย่าใช้ streaming จนกว่าจะมี business use case ที่ justify trade-offs

Push vs Pull:

Push — source system เขียนข้อมูลออกไปหา target
Pull — ingestion system query source
จริงๆแล้วส่วนใหญ่ผสมกันทั้ง push และ pull ผ่าน stages ต่างๆ

4. Transformation

จุดที่ข้อมูลเริ่ม สร้างคุณค่า — raw data เฉื่อยชาจนกว่าจะถูก transform

ลำดับการ transform:

Post-ingestion: map types, standardize formats, drop bad records
Schema transformation: normalization, restructuring
Business logic: apply domain rules (กำไร = รายได้ - ต้นทุน - marketing)
Aggregation & featurization: สำหรับ reporting หรือ ML

Key insight: Transformation เกิดขึ้น ทุกที่ ไม่ใช่แค่ stage เดียว — ใน-flight, ใน warehouse, ใน stream

5. Serving Data

จุดจ่ายของ — ถ้าไม่มีคนใช้ data = data vanity project

3 major use cases:

Analytics: BI (อดีต/ปัจจุบัน), Operational analytics (ปัจจุบัน real-time), Embedded analytics (customer-facing → multitenancy สำคัญมาก)
ML: feature engineering, feature stores, training pipelines — อย่าเพิ่งทำ ML จนกว่าจะมี data foundation ที่ดี
Reverse ETL: ส่งข้อมูลที่ transform แล้ว กลับไป ยัง source systems (CRM, ad platforms) — จากที่เคยเป็น antipattern กลายเป็น product category ใหม่

6 Undercurrents (กระแสที่แทรกทุก Stage)

1. Security

Principle of least privilege — ให้ access เท่าที่จำเป็น แค่เวลาที่จำเป็น
จุดอ่อนที่สุดคือ คนและวัฒนธรรมองค์กร ไม่ใช่เทคโนโลยี
Encryption ทั้ง at rest (ตอนจัดเก็บ) และ in transit (ตอนส่ง), IAM (Identity and Access Management), network security

2. Data Management

Data Governance: discoverability, accountability, security
Metadata 4 ประเภท:
- Business metadata — อะไรคือ "customer"?
- Technical metadata — schema, lineage, pipeline configs
- Operational metadata — job logs, error logs
- Reference metadata — country codes, units of measurement
Data Quality 3 เสา: accuracy, completeness, timeliness
Master Data Management (MDM): golden records — entity definitions ที่ consistent ทั้งองค์กร
Data Lineage: audit trail — data มาจากไหน ผ่านอะไรไปบ้าง
Ethics & Privacy: PII masking (Personally Identifiable Information — ปกปิดข้อมูลส่วนบุคคล), bias tracking, GDPR (General Data Protection Regulation — กฎหมายคุ้มครองข้อมูล EU)/CCPA (California Consumer Privacy Act) compliance

3. DataOps

DevOps สำหรับ data — 3 เสาหลัก:

Automation: cron → orchestration (Airflow/Dagster) → CI/CD (Continuous Integration/Continuous Deployment) → automated deployment Observability & Monitoring: "Data is a silent killer" — bad data อยู่เป็นเดือนโดยไม่มีใครรู้ Incident Response: หา problems ก่อน business report — blameless postmortems

4. Data Architecture

เข้าใจ business needs → translate เป็น requirements → design systems balanced for cost, simplicity, scale

5. Orchestration

ไม่ใช่แค่ scheduler — orchestration เข้าใจ dependencies ผ่าน DAGs (Directed Acyclic Graph — กราฟไร้วงจรที่บอกลำดับงาน)

Kick off tasks เมื่อ upstream complete ไม่ใช่ fixed time
Monitor external systems, set error conditions, alerting
Backfill historical runs
Orchestration คือ batch concept — streaming มี streaming DAGs (Directed Acyclic Graphs) ซึ่งสร้างยากกว่า

6. Software Engineering

Despite abstraction, data engineers ยังต้องเขียน code — แค่เปลี่ยนประเภท:

Core data processing (Spark, SQL, Beam)
Open source framework contribution
Streaming — inherently ซับซ้อนกว่า batch
Infrastructure as code (Terraform, Helm)
Pipelines as code (Airflow DAGs — Directed Acyclic Graphs)
General problem solving — tools ไม่มี connector, ต้อง custom

5 Stages ของ Data Engineering Lifecycle #

1. Generation (Source Systems) #

2. Storage #

3. Ingestion #

4. Transformation #

5. Serving Data #

6 Undercurrents (กระแสที่แทรกทุก Stage) #

1. Security #

2. Data Management #

3. DataOps #

4. Data Architecture #

5. Orchestration #

6. Software Engineering #