How to select features for clustering to detect the number of different unique products in a search result?
I am trying to use clustering to determine the number of products in a search of products. So far I am using kmeans clustering. I have run into a problem where I cannot determine good features to use. Here are my clustering results for one search.
As you can see, there is clearly three different groups showing up at the moment. However, when you look at the top terms of each cluster, there is a clear issue. A lot of the terms are repeating in each cluster.
There is another issue, when I try to have the model predict which clusters a few test items are in, which are clearly different products, it says that they are in the same group. Here are the test terms:
Adjustable DC-DC LM 2596 Converter Buck Step Down Regulator Power Module LW SZUS
10pcs Mini360 3A DC Voltage Step Down Power Converter Buck Module 3.3V 5V 9V 12V
5V Out, 6V to 12V In AMS1117-5.0 5.0V Step-Down Linear Voltage Regulator Module
10pcs Mini LM2596s 3A DC to DC Buck Converter Power Supply Step Down Module
I have thought about using the grammar in each title as a feature, but as there is an 80 character limit for the titles, there rarely is any grammar. I am at a loss for where to go next in terms of trying to pick more features that would get me better results. I am doing all of this in python using scikitlearn and nltk.
As you can see, there is clearly three different groups showing up at the moment. However, when you look at the top terms of each cluster, there is a clear issue. A lot of the terms are repeating in each cluster.
There is another issue, when I try to have the model predict which clusters a few test items are in, which are clearly different products, it says that they are in the same group. Here are the test terms:
Adjustable DC-DC LM 2596 Converter Buck Step Down Regulator Power Module LW SZUS
10pcs Mini360 3A DC Voltage Step Down Power Converter Buck Module 3.3V 5V 9V 12V
5V Out, 6V to 12V In AMS1117-5.0 5.0V Step-Down Linear Voltage Regulator Module
10pcs Mini LM2596s 3A DC to DC Buck Converter Power Supply Step Down Module
I have thought about using the grammar in each title as a feature, but as there is an 80 character limit for the titles, there rarely is any grammar. I am at a loss for where to go next in terms of trying to pick more features that would get me better results. I am doing all of this in python using scikitlearn and nltk.
Комментарии
Отправить комментарий