NotesWhat is notes.io?

Notes brand slogan

Notes - notes.io

This is the test script for [DERA - ORDS Data Development / DODD-169 !]https://jira/browse/DODD-169
NCEN - HIVE only - Develop Automated Test to Verify all data points

#Issue description
The oozie workflow took multiple XML files and converted into AVRO file. Hive created external table to access AVRO file.
The purpose of this ticket is make sure no more and no less in the ETL process.
The comparison is done between XML (600+ files as of 4/5/2018) to AVRO query result on headerdata and formdata. All following rules need to comply.
1. All information in the headerdata and formdata in XML files are in the query result on AVRO.
2. XML hierarchy is maintained the same as that of query result.
3. All numberic type in XML will in double quote of the query result.
4. Order of siblings doesn't matter.
5. The query result "null" or "[]" values can be ignored.


#Solution (Python)
1. Convert the xml files into JSON files by using the tools from [Convert XML to JSON !]
https://github.com/bojanbjelic?tab=overview&from=2017-12-01&to=2017-12-31
2. Combine all Json files into one file, transform the format, such as EOL, space, cases, order
3. Compare the file with AVRO query result
4. Generate comparison report the differences.

#Test done
Using the .avro file to create an external table in HIVE can reproduce the following tests.
The source xml files are checked in to the subdirectories.
1. Three sample files
The schema and sample JSON converted from AVRO for the 3 sample files (came with XSD) is located at:
https://mdc-sec-derasas:18080/svn/nport-ncen-artifacts/branches/dev/Schema%20-%203-29/updates/EDGAR%20Form%20N-CEN%20XML%20schema%20files/.
2. 680 files in dev
https://mdc-sec-derasas:18080/svn/nport-ncen-artifacts/trunk/ncen_test_scripts/ncen_structured/xmlfiles


#Prerequest:
Due to the limitation of tools we can use, the full test need to run hive on linux server and Python 3.5 on Windows.
Preassumptions
* Database named as "ncen_1211_test" is created in hive and avro external table is "ncen_filer_comparison" at EDW-DEV server.
* Test xml files are stored ftp to the windows ..xmlfiles
* Following files are under ..testscripts directory

If the test environment is different, please update the related names.

#Actions before run this scripts:
step1: $hive -e "use ncen_1211_test; select doccode from ncen_filer_comparison order by doccode;" > xmlFileList.txt
read each line to get the xml filename
step2: $ hive -e "use ncen_1211_test; select doccode, headerdata, formdata from ncen_filer_comparison order by doccode;" > avro_queryOrdered.txt
step3: FTP all xml source files to the ..xmlfiles subdirectory and avro_queryOrdered.txt to Windows testing directory.
step4: run python script (CLI command as "python XML2JsonComparison.py xmlFileList.txt xmlfiles")

# Result and interpretation
The last line of out put will give true/false of the result. True means files are identical.
Any file comparison, such as diff, winmerge, can be used to view the details of difference. The line # match the file order in the xmlFileList.txt.


#Limitations and enhancement
Attributes are hardcoded in the scripts, it can be generated from xml schema file (xsd file).
     
 
what is notes.io
 

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

  • * You can take a note from anywhere and any device with internet connection.
  • * You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
  • * You can quickly share your contents without website, blog and e-mail.
  • * You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
  • * Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.


You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;


Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio



Regards;
Notes.io Team

     
 
Shortened Note Link
 
 
Looding Image
 
     
 
Long File
 
 

For written notes was greater than 18KB Unable to shorten.

To be smaller than 18KB, please organize your notes, or sign in.