Expand a list to columns, one list element per column on the same row
I'm working on clustering a corpus of free text documents. On each row I have a document ID in the first column and an ordered element list of integer values in the second. In order to take advantage of K Means Clustering in Advanced Analytics, I need to spread the list across columns.
How can I do that?
-
Hi Steve,
There is a function called EXPAND that extrapolates a list/array into rows, but this isn't quite what you've asked for. I have an idea - do you know what the maximum number of array/list values you might have? I.e., is it a known maximum number of columns you'll need to create?
-
Hi Grant. You're correct, EXPAND doesn't apply in this case. Yes the max is known--I set that. There are exactly 1000 list elements in each document's list. I picked that length though in the steps to create the list from the docs. I could shorten it, but it needs to be pretty big, say >500.
Super curious to hear your idea...
-
Hi Steve,
It sounds like there would be exactly the same number of populated columns per row, which makes this a bit easier (or more computationally efficient) as compared to an indeterminate number of columns per row, which would require some lightweight Boolean logic for each column.
However, my idea probably isn't feasible for such a huge number of columns as you stated. My idea is that if it was a small number of columns you were deciphering, say, less than 30-60, I would manually create each column as a formula to pull-out the array's positional values using LISTELEMENT.
Another option, although I haven't quite thought through it in detail, would be to convert the list/array into a JSON object and then export that to JSON format. You might then be able to re-ingest the JSON object and have the array values materialized into columns. There are a number of JSON functions that could potentially help with this... Perhaps someone else will respond here with a more concrete way of how to achieve this using JSON functions and/or export/import via the JSON format as I don't have the precise recipe figured out in my head.
Does that last idea give you something you can experiment with?
-
Ah, right, it occurred to me too that manually inserting values via LISTELEMENT functions would work--wondered if the API could be used to automate/scale this somehow (I'm completely unfamiliar with the API).
Thought the SELECT function might be relevant, but I think the task is equally manual. There's a GROUPSELECT too, same problem.
There's a Pivot Table widget documented--that might work--need to explore how to deploy it.
-
Hi Steve,
You can use the REST API to achieve this task. I put together a quick python script that should achieve what you are looking for. It might need some minor tweaks but should work. Not sure if you are familiar with python, but feel free to shoot me an email if you need assistance.
FYI: The current Java API wont allow you to create a function that wouldautomatically populate multiple columns at once.
Script:
#creates muliple columns froma list field
import requests
import json#input variables
url = 'http://localhost:8081/rest/workbook/'
user = 'admin'
pwd = 'admin'
workbook_id = '1038'
mysheet = 'Expand_List' #Worbook's sheet name were you have the list element
mycolumn = 'List' #Column name with the list
listsize = 10 #List max size
newWbk = "/Users/admin/Research/Miscelanious/Workbook_NEW.wbk" #Name and path of new workbook.#other variables
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}#get workbook definition
data = requests.get(url+workbook_id, auth=(user,pwd))
if data.status_code == 200:
print "Succesful request..."
json_data = data.json()#find sheet with information
sheets = json_data['sheets']
for i in sheets:
if i['name']==mysheet:
nextColumn = i['nextColumnId']
#create a column for each list element
for x in range(0,listsize):
i['formulas'].append({"columnName": "C","columnId": str(nextColumn+x),"columnIndex": nextColumn+x,"formulaString": "=LISTELEMENT(#List;"+str(x)+")"})
i['columnStyles'].append({"columnId": str(nextColumn+x),"name": "Col_"+str(nextColumn+x),"width": 100,"thousandSeparator": True})
#update workbook json definition
json_data['sheets']=sheets#Change name
json_data['file']['path']=newWbk
print json_data
#Create new workbook with new parameters
r=requests.post('http://localhost:8081/rest/workbook', auth=(user,pwd),headers=headers,json=json_data)
print r.text
else:
print "Error while requesting workbook definition."
print data -
You know what would be much better?--update the K Means function to support a LIST as input. It has to be the case that most applications will end up with a list--the data almost never start out as a set of columns all ready for clustering. The whole point is to transform to create the dimensional data.
Please sign in to leave a comment.
Comments
8 comments