Instructure Developer Documentation Portal
Community
  • Introduction
  • Services
    • Elevate Standards Alignment - AB Connect API
      • Introduction
        • Authentication
        • Addressing Object Properties
        • Requesting Additional Properties in the Response
        • Filtering Using ODATA Like Statements
        • Sorting
        • Facets
        • Paging Data
        • Call Throttling
        • Working with Related Object
        • Error Responses
        • Character Set Support
        • How To Articles, Recommendations and Suggestions
        • Examples
        • Using AB Connect's Embeddable Widgets
      • Reference
        • Standards
        • Standard Collections
        • Events
        • Topics
        • Concepts
        • Assets
        • Asset Definitions
        • Asset Collections
        • Managing and Predicting Relationships
        • Providers
    • Canvas LMS
      • Basics
        • GraphQL
        • API Change Log
        • SIS IDs
        • Pagination
        • Throttling
        • Compound Documents
        • File Uploads
        • API Endpoint Attributes
        • Masquerading
      • OAuth2
        • OAuth2 Overview
        • OAuth2 Endpoints
        • Developer Keys
      • Resources
        • Access Tokens
        • Account Calendars
        • Account Domain Lookups
        • Account Notifications
        • Account Reports
        • Accounts
        • Accounts (LTI)
        • Admins
        • Analytics
        • Announcement External Feeds
        • Announcements
        • API Token Scopes
        • Appointment Groups
        • Assignment Extensions
        • Assignment Groups
        • Assignments
        • Authentication Providers
        • Authentications Log
        • Blackout Dates
        • BlockEditorTemplate
        • Blueprint Courses
        • Bookmarks
        • Brand Configs
        • Calendar Events
        • Collaborations
        • CommMessages
        • Communication Channels
        • Conferences
        • Content Exports
        • Content Migrations
        • Content Security Policy Settings
        • Content Shares
        • Conversations
        • Course Audit log
        • Course Pace
        • Course Quiz Extensions
        • Course Reports
        • Courses
        • Custom Gradebook Columns
        • Developer Key Account Bindings
        • Developer Keys
        • Discussion Topics
        • Enrollment Terms
        • Enrollments
        • ePortfolios
        • ePub Exports
        • Error Reports
        • External Tools
        • Favorites
        • Feature Flags
        • Files
        • Grade Change Log
        • Gradebook History
        • Grading Period Sets
        • Grading Periods
        • Grading Standards
        • Group Categories
        • Groups
        • History
        • InstAccess tokens
        • JWTs
        • Late Policy
        • Learning Object Dates
        • Line Items
        • LiveAssessments
        • Logins
        • LTI Launch Definitions
        • LTI Registrations
        • LTI Resource Links
        • Media Objects
        • Moderated Grading
        • Modules
        • Names and Role
        • New Quiz Items
        • New Quizzes
        • New Quizzes Accommodations
        • New Quizzes Reports
        • Notification Preferences
        • Originality Reports
        • Outcome Groups
        • Outcome Imports
        • Outcome Results
        • Outcomes
        • Pages
        • Peer Reviews
        • Planner
        • Poll Sessions
        • PollChoices
        • Polls
        • PollSubmissions
        • Proficiency Ratings
        • Progress
        • Public JWK
        • Quiz Assignment Overrides
        • Quiz Extensions
        • Quiz IP Filters
        • Quiz Question Groups
        • Quiz Questions
        • Quiz Reports
        • Quiz Statistics
        • Quiz Submission Events
        • Quiz Submission Files
        • Quiz Submission Questions
        • Quiz Submission User List
        • Quiz Submissions
        • Quizzes
        • Result
        • Roles
        • Rubrics
        • Sandboxes
        • Score
        • Search
        • Sections
        • Services
        • Shared Brand Configs
        • SIS Import Errors
        • SIS Imports
        • SIS Integration
        • Smart Search
        • Submission Comments
        • Submissions
        • Tabs
        • Temporary Enrollment Pairings
        • User Observees
        • Users
        • What If Grades
      • Outcomes
        • Outcomes CSV Format
      • Group Categories
        • Group Categories CSV Format
      • SIS
        • SIS CSV Format
      • External Tools
        • LTI
          • Introduction
          • Registration
          • Launch Overview
          • Configuring
          • Variable Substitutions
          • Deep Linking
          • Grading
          • Provisioning
          • PostMessage
          • Platform Notification Service
          • Placements
            • Placements Overview
            • Navigation
            • Homework Submission
            • Editor Button
            • Migration Selection
            • Link Selection (Modules)
            • Assignment Selection
            • Collaborations
        • xAPI
        • Canvas Roles
        • Plagiarism Detection Platform
          • Overview
          • Plagiarism Detection Platform Assignments
          • Plagiarism Detection Platform Users
          • Plagiarism Detection Submissions
          • Webhooks Subscriptions for Plagiarism Platform
          • JWT Access Tokens
      • Data Services
        • Live Events
          • Overview
            • Introduction
            • Setup
            • Caliper
            • Metadata
          • Event Format
            • Canvas
              • Account
              • Asset
              • Assignment
              • Attachment
              • Content
              • Conversation
              • Course
              • Discussion
              • Enrollment
              • Grade
              • Group
              • Learning
              • Logged
              • Module
              • Outcome
              • Outcomes
              • Plagiarism
              • Quiz
              • Rubric
              • Sis
              • Submission
              • Syllabus
              • User
              • Wiki
            • Caliper IMS 1.1
              • Assessment
              • Basic
              • Forum
              • Grading
              • Navigation Events
              • Session
    • Catalog
      • APIs
        • Analytics
        • Bulk Enrollments
        • Catalogs
        • Certificates
        • Completed Certificates
        • Courses
        • Email Domain Set
        • Enrollments
        • Orders
        • Programs
        • Progresses
        • Tags
        • User Registrations
        • Users
        • Waitlist Applicants
    • Credentials
      • Getting Started
      • Authentication
        • Password-Based Authentication
        • Authorization Code-Based Authentication
      • Pagination
      • APIs
        • Assertions
        • Backpack
        • Badgeclasses
        • Issuers
        • Organizations
        • Users
      • Release Notes
    • Data Access Platform
      • Key Concepts
      • Data Formats
      • Rate Limits & Policies
      • Datasets
        • Namespaces
          • canvas
            • canvas types
          • canvas_logs
          • catalog
        • Additional Notes
        • Entity Relationship Diagram
      • Query API
        • Authentication
        • Reference
      • Command Line (DAP CLI)
        • Getting Started
        • Secure Connection
        • Reference
          • dap snapshot
          • dap incremental
          • dap list
          • dap schema
          • dap initdb
          • dap syncdb
          • dap dropdb
      • Client Library
        • Examples
        • Reference
      • Release Notes
      • Status
    • DataSync
      • Interop API
      • Interop Data API
      • Grades Exchange API
      • OneRoster API
      • Platform API
    • Instructure Media
      • API Reference
        • Captions
        • Collection
        • Courses
        • Group
        • Insights
        • Media
        • Media Upload
        • Ping
        • Professional Captioning
        • Tags
        • Transfer Media
        • User
    • Quizzes
      • Quiz API
Powered by GitBook

Copyright © 2008-2024 Instructure, Inc. All rights reserved. Various trademarks held by their respective owners.

On this page
  • Text (TSV)
  • Comma-separated values (CSV)
  • JSON Lines
  • Parquet
  • Metadata
  • Action
  • Timestamp
  • Format transformations

Was this helpful?

  1. Services
  2. Data Access Platform

Data Formats

Understand how data is structured, transformed.

PreviousKey ConceptsNextRate Limits & Policies

Last updated 4 months ago

Was this helpful?

Text (TSV)

Text format is a simple tabular format in which each record (table row) occupies a single line.

  • Output always begins with a header row, which lists all metadata and data field names.

  • Fields (table columns) are delimited by tab characters.

  • Non-printable characters and special values are escaped with backslash (\), as shown below:

Escape
Interpretation

NULL value

\b

Backspace (ASCII 8)

\f

Form feed (ASCII 12)

Newline (ASCII 10)

Carriage return (ASCII 13)

Tab (ASCII 9)

\v

Vertical tab (ASCII 11)

\\

Backslash (single character)

This format allows data to be easily imported into a database engine, e.g. with PostgreSQL .

Output in this format is transmitted as media type text/plain in UTF-8 encoding.

Comma-separated values (CSV)

Comma-separated values (CSV) output follows with a few extensions:

  • Output always begins with a header row, which lists all metadata and data field names.

  • Strings are quoted with double quotes (") if they contain special characters such as the double quote itself, the comma delimiter, a newline, a carriage return, a tab character, etc.

  • Empty strings are always represented as "" to avoid ambiguity with missing values.

  • Missing values (a.k.a. NULL) are represented with no data (no characters between delimiters).

  • Each row has the same number of fields.

These extensions allow differentiating empty strings ("") from missing values (a.k.a. NULL, represented as no data), for which RFC 4180 defines no rules. If a field is missing, the comma separators are still included, i.e. multiple comma separators may follow one another in a row if there is no data in subsequent fields.

Double quotes act as escape sequences inside a quoted string. If there are two consecutive double quote characters (i.e. ""), the sequence is interpreted as a single double quote character ("). If a string contains newline or carriage return characters, they are emitted verbatim (in compliance with RFC 4180). As such, a record may be broken into several lines if the data contains newlines. (Some applications might not interpret these flawlessly, double-check your integration when you deal with CSV files.)

The following example demonstrates some of the above:

meta.action,key.pkey,value.prop1,value.prop2
U,1,a string,42
U,2,"a string, but in ""quotes"".",
D,3,,
U,4,"a multi-line
string",

Output in this format is transmitted as media type text/csv in UTF-8 encoding.

JSON Lines

Output in this format is transmitted as media type application/jsonlines in UTF-8 encoding.

Please note that JSONL skip values if they have value NULL.

Parquet

Parquet is a columnar storage format that provides efficient data compression and encoding schemes. It is highly optimized for querying large datasets and widely supported.

  • Data Structure:

    • Parquet format stores data in a binary columnar format.

    • It supports complex nested structures, including arrays and maps, efficiently storing schema and data side-by-side.

    • Parquet is self-describing, meaning the schema of the data is embedded in the file, eliminating the need for separate metadata files.

  • File Header:

    • Parquet files do not have a simple header row as seen in TSV or CSV formats. Instead, each file contains the schema embedded at the top.

    • The schema defines the column names, data types, and structural relationships in a format that can be read by various systems (e.g., Apache Hive, Apache Spark).

  • Compression:

    • Data in Parquet files is compressed, and the format supports various compression methods, including Snappy, Gzip, and LZO.

    • Parquet files are compressed in a columnar manner, meaning that each column's values are compressed separately, which improves compression rates.

  • Field Handling:

    • Parquet supports nullable fields, and NULL values are encoded according to the specified data type.

    • Fields in Parquet are stored using their native data type. For instance, strings are stored as UTF-8 encoded, while integers are stored as binary encoded integers.

  • Use Case:

    • Parquet is ideal for handling large datasets, particularly for analytics workloads in big data environments.

    • It is commonly used in conjunction with distributed processing systems such as Apache Spark and Apache Drill.

  • Media Type:

    • Parquet files are transmitted with the media type application/parquet when sent over HTTP, and they are typically stored in file systems that support columnar data formats, such as HDFS or AWS S3.

  • Example: Suppose a Parquet file contains the following dataset:

    pkey
    prop1
    prop2

    1

    value1

    42

    2

    value2

    NULL

    In Parquet, the schema and data are both embedded. The schema for this file would indicate that pkey is an integer, prop1 is a string, and prop2 is an integer or null. The data itself would be stored in a columnar format for efficient reading.

  • NULL Handling:

    • Parquet uses nullable columns for missing data and stores them efficiently.

Metadata

Output of DAP API may include record-level metadata in addition to table data.

In tabular formats (such as text and CSV), metadata are included in the output as additional columns. Consider the following example:

meta.action,key.pkey,value.prop1,value.prop2
U,1,"value1",42
U,2,"value2",NULL
D,3,,

This CSV output has a metadata section (meta), a primary key section (key) and a record value section (value). The metadata section contains a single field called action. The key and value sections comprise of several fields: pkey, prop1 and prop2.

In the JSON Lines format, metadata, key and value sections are top-level properties meta, key and value, and have properties of their own:

{ "meta": { "action": "U", ... }, "key": { "pkey": 1 }, "value": { "prop1": "value1", "prop2": 42 } }
{ "meta": { "action": "U", ... }, "key": { "pkey": 2 }, "value": { "prop1": "value2", "prop2": null } }
{ "meta": { "action": "D", ... }, "key": { "pkey": 3 } }

The set of metadata fields returned depends on the context. Some contexts may produce fields that other contexts do not. If output would contain no metadata fields, the section is omitted entirely.

Action

The metadata field action identifies whether a record is upserted (inserted or updated) or (hard) deleted for an incremental query. In the result of a snapshot query, all records are to be understood as upserted.

  • Upserted records (denoted by U) have all fields present in the data.

  • Deleted records (denoted by D) only have the primary key field in their data, other field values are missing.

Occasionally, the term soft delete is used, which in this context is equivalent to an update, and is denoted with a U, and all field values are included in the output.

Timestamp

The metadata field ts indicates when a record was last updated in the underlying transactional data lake table. For an incremental query with since and until timestamp parameters, ts for all returned records is always strictly greater than since, and always less than or equal to until.

The timestamp may correlate to but does not correspond to the real time when the event took place (e.g. when a student enrolled to a course). If you need to know when the event happened, use the timestamp embedded in the data. Specifically, many tables have timestamp data columns such as created_at or updated_at, which are controlled by the product or application that generates the event (e.g. Canvas).

Format transformations

Tabular data formats such as CSV cannot capture the hierarchy that JSON can represent easily. Nested JSON objects are flattened before they are included in the output. For example, consider the JSON data:

{
    "id": 1,
    "question": {
        "headline": "title",
        "text": "some text"
    },
    "answers": [
        { "answer": "A", "score": 0 },
        { "answer": "B", "score": 1 },
        { "answer": "C", "score": 0 }
    ]
}

Here, the property question with two fixed sub-properties can be flattened into CSV columns question.headline and question.text. However, the property answers cannot be flattened because the list has an indeterminate cardinality. Items with indeterminate cardinality are transmitted as a JSON string. (Cardinality check is performed on the data (JSON) schema, not the actual data.)

This is how text output would look like after flattening (tabs are shown as four spaces):

data.id    data.question.headline    data.question.text    data.answers
1    title    some text    [{"answer":"A","score":0},{"answer":"B","score":1},{"answer":"C","score":0}]

In a similar fashion, this is how CSV output would look after flattening:

data.id,data.question.headline,data.question.text,data.answers
1,title,some text,"[{""answer"":""A"",""score"":0},{""answer"":""B"",""score"":1},{""answer"":""C"",""score"":0}]"

If you wish to avoid format transformations entirely, use the JSON Lines data format.

When the output data is represented in the format, each record (table row) occupies a single line. Each line is a JSON object, which can be validated against the JSON schema returned by DAP API.

Timestamps are stored in fields of JSON type string, are formatted as per ISO-8601, and are to be understood as in time zone UTC. This is aligned with how timestamps are represented in the OpenAPI format date-time as per .

COPY
RFC 4180
JSON Lines
RFC 3339