Crawldata Schema

Introduction¶

The crawldata database is in our MongoDB server that stores all the data collected from different websites. We use a new collection for each report ID and history ID, and we use the convention data_<report_id>_<history_id> for the collection name. In each collection we use the following key mappings to reduce data storage.

Schema¶

Here is the JSON structure:

{
   "id" : Index ID,
   "t" : Type ID, # 2= JSON, 0=ROWS, 1=PAGE INFO
   "p" : Page ID,
   "n" : Page Name,
   "h" : History ID,
   "d" : Data,
   "o" : Object,
   “z” : TimeStamp,
}

Attributes:¶

Type ID: It represents the type of data the collection stores.

Note

0 represents row structured data.
1 represents stores page information (i.e what all the documents mean in the collection).The collection can only contain either JSON structured data or ROW structured data.
2 represents JSON Data.

Page ID: The page id is nothing but the SHA1 hash value of the page name. It is being used to count the number of records in the document.
Data: The data can be JSON structured data or Rows. In the case if it contains JSON data it will be stored as an array of Objects in the document with keys, ‘k’ and ‘v’ which represents ‘key’ and ‘value’.