Multiple columns datatype changes
Team,
The datalinks in Hive is created with string datatypes for all numeric fields. Since I have to get the aggregation of all those fields. Changing one by one is time consuming. Is there a way to select multiple fields and change all of them from one datatype to another. I am doing import job and changing even then it take a min of 20sec to max of 40 sec per columns. I have nearly 400 fields which require data type changes.
Appreciate your help
-
Hello Ganesh.
At the moment there is no option for bulk column type changes in a DataLink / ImportJob configuration dialog. Datameer evaluates a data sample and makes a decision for data type's auto-detection based on this sample.
Here is the Data type mapping when importing from a Hive table. What data type appropriate Hive table's column have? I'm curious why e.g. a FLOAT or INTEGER column is autodetected as STRING in Datameer.Anyway, there is a workaround one can use to change datatypes faster. However, you should keep in mind that in case a Hive table column has a STRING datatype, it might have values that can't be converted into e.g. a FLOAT or INTEGER, such values will be dropped from import.
- Create a DataLink and keep all the column's datatypes as they auto-detected by Datameer.
- Use REST API to get the JSON definition of this artifact. Read Import/ Link/ Upload.
- Update the JSON file with required datatypes (you could automate or at least semi-automate this via a script).
- Use the edited JSON definition to create new or update the existing DataLink. Upload Import/ Link/ Upload.
In addition to the above. From my experience, Datameer users on rare occasions apply formulas to every column of a source dataset or require every imported column in the final report.
Are you going to apply formulas to all 400 columns you have in your dataset? How many columns should be included in the final result? These two questions are quite important to address before you start to build a use-case in Datameer. If you only need to apply formulas to 20-30 columns, it is not required to change the datatype for the other 370. In case 350 out of the 400 columns are not required either for data analysis or the final report, why ingest all of them into Datameer and process the extra data you do not need. I would recommend reviewing the source data at a DataLink or an ImportJob creation step and include only columns you plan to work with. -
Thanks Konsta Danyliuk.
Hive table is created with String types. I am doing a data quality exercise from another source and hence requirement is to get count of rows by all categorical variables and sum of all measures. since the measures are created as STRING it also flows as String and I am not able to sum the measures without converting them.
Since all fields are in scope, i have to apply the formula to convert them into FLOAT which is a tedious time consuming. Not familiar with REST API's as I am very new to Datameer.
Also is there a way to store all group by variables in a column and use them to get the record counts distribution of a particular group by column. for example if region, city, locality are the three columns and each of them will have its own data values, Requirement is to have colum name, count for each of the columns i passed in the previous sheet.
Again thank you very much for your post
-
Hello Ganesh.
According to the use-case description you've provided, it seems to load an artifact's definition via REST API, edit the JSON, and upload it back approach would be the best from a time-saving perspective. You could either change columns' datatype in a DataLink itself or edit the Workbook's JSON definition to add e.g. FLOAT function to desired columns. Please refer to Accessing Datameer Using the REST API section of our documentation to get yourself familiarised with the REST API method.Regarding your second question, may I ask you to provide a mock-up of the source data and the result you are looking for, about 10 records should be enough.
-
Hi thanks for your response. given below is a sample. two dimensions Region and Currency. Desired result expect the count by each dimension along with dimension values. for smaller count I created individual sheets by dimension and finally did a union of all sheets. here I have nearly 500 dimension and creating individual sheets is obviously not an efficient process. is there a way to list all the dimensions in one sheet and aggregation is done based on the datasheet and final result presented. Thanks in advance
-
Thank you, Ganesh.
Yes, the SQL approach for this use-case seems easier, especially if you have a lot of dimensions.
In your SQL query, you prepared data for each dimension and then united them to get a consolidated output. In Datameer, you could get such results in two different ways.Let's say you have the following source data.
Using the traditional approach, you would need to apply the following transformation for each dimension type,
- Create a new Sheet where you apply GROUPBY / GROPUPCOUNT to get the count for distinct values for the dimension.
- Create an additional column with the dimension name.
Then you would need to unite all the dimension Sheets to bring the data to a single page.
However, I understand that this approach is quite time-consuming, especially if you have 500+ diimentons.
Alternatively, you could leverage SQL Worksheets and execute the SQL query. The following query returns the same results as the traditional approach.
select 'Region' as Dimension, Region,count(cardno) as records from datagen group by Region union
select 'Currency' as Dimension, Currency,count(cardno) as records from datagen group by Currency union
select 'RepYear' as Dimension, RepYear,count(cardno) as records from datagen group by RepYear -
Hello Ganesh.
As far as I understand, you are trying to view data exported from Datameer as a CSV file in another tool, aren't you?
May I ask you to clarify how exactly have you get this dataset from Datameer and in what exactly tool are you trying to view it?
Are you able to see the mentioned values in standard notation in case you simply open this CSV file in notepad or any other text editor? -
Hello Ganesh.
May I ask you to provide:- The screenshot from the Workbook wherefrom you are trying to export/download the data. I would like to confirm how exactly the values look like in Datameer and what is the type of column (you could get the data type by hovering over the column header).
- The screenshot from the File -> Browse All Data menu of the Workbook that shows data appearance right before you click the Download Data button.
Have you tried to view this downloaded file via any text editor, not Excel or SAS, e.g. in notepad?
What Datameer version do you use?
Thank you in advance.
-
Hello Konsta Danyliuk,
Thanks for the suggestion. Column A is a float and column B is Bigdecimal. yet after downloading as a CSV and on opening we still get to see in scientific notation. to overcome this I formatted as number and saved as excel and it helped (second figure). considering the number of variables are close to 500 which need change, is there a way to do it at one go rather than doing one by one.
Thanks
-
Hello Ganesh.
When Datameer writes data into a CSV file, it is just a sequence of string values separated by a configured delimiter. The Float values of a certain length being represented in scientific notation is the standard behavior of the Java
.toSting()
method Datameer leverages to convert FLOATS to STRINGS.You could easily check the result already in a Workbook by converting FLOAT to STRING via the T function.
- Here is an example of the Datameer Workbook that shows differences in the same FLOAT value representation when I convert it directly to STRING (scientific notation) and when I perform FLOAT -> BIGDECIMAL -> STRING conversion.
- This is how this data looks like in the downloaded CSV file (FLOAT got converted into STRING in the scientific notation).
- The same data in Excel when I simply open the CSV file.
- The same data in Excel when I load a file via Data -> From Text menu.
From my perspective, Excel could be an indicator in this case, as it has its own settings for the appearance of different values.
To avoid scientific notation when FLOAT values are represented as STRINGS (e.g. after export into a CSV) you should convert them into BIGDECIMAL either in Datameer Workbook or when you initially ingest this data into Datameer.
In case you need to update the type for multiple columns, the easiest way to do this is via REST API. Below instructions are for a DataLink, but the same approach is applicable for Workbooks.
- Create a DataLink and keep all the column's datatypes as they auto-detected by Datameer.
- Use REST API to get the JSON definition of this artifact. Read Import/ Link/ Upload.
- Update the JSON file with required datatypes (you could automate or at least semi-automate this via a script).
- Use the edited JSON definition to create new or update the existing DataLink. Upload Import/ Link/ Upload.
- Here is an example of the Datameer Workbook that shows differences in the same FLOAT value representation when I convert it directly to STRING (scientific notation) and when I perform FLOAT -> BIGDECIMAL -> STRING conversion.
Please sign in to leave a comment.
Comments
17 comments