Reporting ETL and OLAP Project
Contents |
Intern: Gstrak
Mentor (Primary): Michael Seaton
Mentor (Backup): Justin Miranda
[edit]
Background
[edit]
Goal
The goal of this project is to build ETL functionality into OpenMRS in order to support OLAP analysis.
[edit]
Use Cases
- TBD
[edit]
Design
[edit]
Definitions
- OLAP
- OLAP stands for Online Analytical Processing. OLAP generally refers to an approach of finding answers to questions that deal with multiple dimensions. For instance, how many patients were enrolled at a particular clinic in a particular year. Answering that question with a conventional, normalized database system would be complex and slow. Using OLAP, the answer would usually be pre-generated and waiting to be retrieved.
- OLAP System
- An OLAP system is a database that stores read-only, aggregated data. An OLAP system usually consists of a denormalized schema (star schema) to improve the speed of reporting. Any database can be used to create an OLAP system.
- Star Schema
- A star schema refers to the de-normalized data model used to store the aggregated data for an OLAP system. A star schema has a central fact table and outer dimension tables to help make it easier (and faster) to retrieve answers to complex questions.
- Data Warehouse
- A data warehouse is a database that stores an organization's historical, aggregated data. A data warehouse is basically an implementation of an OLAP system, usually with an added layer of cube structures that helps dimensionalize the data into "cubes". Mondrian is an example of a data warehouse tool.
- ETL
- ETL stands for Extraction, Transformation, and Load. This term refers to the manner in which a system would move data from a conventional, normalized schema to a de-normalized star schema. However, this term is loosely used for functionality that takes data from one data store, transforms it to create a new data structure, and stores the transformed data into a schema based on the new data structure.
[edit]
Requirements
- Define a star schema to answer a set of general questions
- Analyze reporting requirements
- Find general pattern of reporting questions
- Implement star schema
- Install MySQL
- Configure MySQL as a decision support / OLAP system
- Design star schema as a denormalized SQL schema
- Implement ETL functionality
- Install Pentaho's KETTLE
- Work with KETTLE Spoon to determine the types of ETL functionality available
- Implement functionality in OpenMRS module using KETTLE API
- Implement ETL scripts to move data from the OpenMRS to OLAP system
- Write scripts that will perform ETL functionality for
- Implement core reporting code within OpenMRS that can use KETTLE module
- Define a new interface or class that will use ETL functionality (IDataSet <-- TransformDataSet?)
- Implement data warehouse / OLAP system
- Install Mondrian
- Design cubes (Pentaho has a cube designer) based on reporting and analysis needs
- Define MDX queries to access data and answer questions
[edit]
Installation Guide
- Install OpenMRS
- You can find instructions for this at Step-by-Step_Installation_for_Implementers and Step-by-Step_Installation_for_Developers.
- Install Ant
- Download and install Ant from http://ant.apache.org/. You can find installation instructions at http://ant.apache.org/manual/install.html.
- Install MySQL databases (schemas)
- Download pentaho-openmrs.zip and unzip it. In the pentaho-openmrs directory that is created, you will find scripts and configuration files to customize Pentaho for MySQL and OpenMRS.
- To create the schema for the reporting database and others needed by Pentaho components execute the following command, from the pentaho-openmrs directory. Before executing, replace "password" with your own MySQL root user password.
- > mysql -u root -ppassword mysql < openmrs_pentaho_create_db.sql
- The above script will also create a user called "test", that is used to connect to the schemas.
- To create the database objects (tables, indexes, e.t.c.) and load test data execute the following commands:
- For the target database:
- > mysql -u test -ptest openmrs_dw < openmrs_dw.sql
- > mysql -u test -ptest openmrs_dw < etl_control.sql
- > mysql -u test -ptest openmrs_dw < etl_extract.sql
- For the ETL tool (Kettle) design repository
- > mysql -u test -ptest OPENMRS_REP
- mysql> source openmrs_rep.sql
- mysql> call p_repository_table_names_to_upper;
- For the other databases
- > mysql -u test -ptest hibernate < pentaho_hibernate.sql
- > mysql -u test -ptest quartz < pentaho_quartz.sql
- > mysql -u test -ptest shark < pentaho_shark.sql
- Install Pentaho BI Server
- Download the latest stable release pentaho_j2ee_deployments-1.2.1.625-GA.zip from sourceforge (http://downloads.sourceforge.net/pentaho/pentaho_j2ee_deployments-1.2.1.625-GA.zip?modtime=1178540158&big_mirror=0) under Business Intelligence Suite, 1.2.1 (GA).
- Unzip pentaho_j2ee_deployments-1.2.1.625-GA.zip to a directory of your choice, say [pentaho_home].
- Copy from the pentaho-openmrs directory:
- images directory to [pentaho_home]/pentaho-webapp
- META-INF directory to [pentaho_home]/pentaho-webapp
- If the MySQL port is not the default (3306) or MySQL is installed in a different machine make the corresponding changes to the [pentaho_home]/pentaho-webapp/META-INF/context.xml file for your data sources.
- Replace the [pentaho_home]/pentaho-webapp/jsp/Home.jsp file with the pentaho-openmrs/Home.jsp.
- Replace the [pentaho_home]/pentaho-res/hibernate/mysql5/hibernate.cfg.xml file with the pentaho-openmrs/hibernate.cfg.xml.
- From [pentaho_home] execute the following command:
- > ant war-pentaho-tomcat-mysql
- Using the Tomcat Manager web application go to the "Select WAR file to upload" field.
- Browse to [pentaho_home]/build/pentaho-wars/tomcat/mysql5/ and deploy the pentaho.war file.
- Browse to [pentaho_home]/build/pentaho-wars/ and deploy the pentaho-style.war file.
- Modify the following entries of the web.xml file of the pentaho web application, for your environment. The web.xml file is located under [Tomcat_Home]/webapps/pentaho/WEB-INF/.
#*:<context-param>- <param-name>base-url</param-name>
- <param-value>http://localhost:8080/pentaho/</param-value>
- </context-param>
- code>
- The above url will be used as the base url for the Pentaho web application.
- Download and unzip the pentaho-solutions.zip.
- Copy the pentaho-solutions directory next to the [Tomcat_Home] directory (they should be siblings).
- You may copy it to a directory of your choice, but change the following entry in the web.xml file:
#*:<context-param>- <param-name>solution-path</param-name>
- <param-value>pentaho-solutions</param-value>
- </context-param>
- code>
- Set up your data source references. In the web.xml add the following entry
#*:<resource-ref>- <description>OpenMRS DW Connection</description>
- <res-ref-name>jdbc/openmrs_dw</res-ref-name>
- <res-type>javax.sql.DataSource</res-type>
- <res-auth>Container</res-auth>
- </resource-ref>
- code>
- Install Kettle
- Download Kettle-2.5.0.zip under Pentaho Data Integration from sourceforge (http://downloads.sourceforge.net/pentaho/Kettle-2.5.0.zip?modtime=1178288438&big_mirror=0).
- Unzip it to a directory of your choice.
- Set an environment variable called KETTLE_HOME, which points at the installation directory.
- Copy the following directories and files from pentaho-openmrs to the [KETTLE_HOME] directory:
- .kettle
- etl_init.bat, etl_init.sh
- etl_daily.bat, etl_daily.sh
- Modify the kettle.properties file located at .kettle for your setup.
- To run the job that loads data for dimensions that are loaded only once (e.g. date_dim), execute the following scripts in the [KETTLE_HOME] directory:
- In Windows, at a command prompt run: etl_init.bat
- In UNIX: sh etl_init.sh
- This should be done only once. If you try to run it again the job will abort.
- To run the jobs that load historical data from the OpenMRS database as well as changed and new data from the last successful execution to refresh the data warehouse database:
- In Windows, at a command prompt run: etl_daily.bat
- In UNIX: sh etl_daily.sh
- The above scripts can be scheduled to run at specific time intervals using a scheduler, like Windows scheduler.
- To monitor the execution of jobs and transformations use your favorite SQL client to connect to the DW database and access these tables: audit and job_audit.
- Rejected records (records that cause errors) are stored in the error_event table.
Using your browser point to http://localhost:8080/pentaho to go to the Pentaho BI server homepage, which looks like the following image.
From there, use the menu Go->Solutions to browse the pages with the reports.
[edit]
Developer`s Guide
TO DO: Describe the development process.
[edit]
Schedule
[edit]
Week 0 (May 28 - June 2)
- Install / deploy OpenMRS.
- Work with OpenMRS mentor to write specifications for the project
- Sent example reports to Gjergji to go along with requirements
[edit]
Week 1 (June 3 - 9)
- Work with mentor to gather ETL/OLAP resources
- Independently research ETL/OLAP topics
- Discussed
- Setup trac/SVN account
- Created wiki pages for project specification and requirements
- ETL Integration Project - http://openmrs.org/wiki/Reporting_ETL_and_OLAP_Project
- KETTLE Integration Requirements - http://openmrs.org/wiki/KETTLE_Integration_Module
[edit]
Week 2 (June 10 - 16)
- Create KETTLE integration module branch in Subversion
[edit]
Week 3 (June 17 - 23)
[edit]
Week 4 (June 24 - 30)
[edit]
Week 5 (July 1 - 7)
[edit]
Week 6 (July 8 - 14)
[edit]
Resources
- None

