6

Dirty data science machine learning on non-curated data

 2 years ago
source link: https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Dirty data science machine learning on non-curated data

  1. 1. Dirty data science machine learning on non-curated data Gaël Varoquaux,
  2. 2. Dirty data science machine learning on non-curated data Gaël Varoquaux,
  3. 3. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster
  4. 4. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster On some dirty-data problems, progress in machine learning can ease the pain
  5. 5. Talk outline 1 What models cannot fit 2 Learning with missing values 3 Machine learning on dirty categories G Varoquaux 3
  6. 6. 1 What models cannot fit Outside of statistics’ comfort zone (X ∈ Rn×p ) G Varoquaux 4
  7. 7. 1 The full life-cycle of a data-science project Framing the domain question Finding and understanding the data Assembling and reshaping it Designing an AI / statistical model? Evaluating model performance Inspecting the model for unwanted behavior Bringing the model to stakeholders / production ?: what we think is cool G Varoquaux 5
  8. 8. 1 Understanding the data, between human and machine Age 60 26 38 139 52 86 17 48 Just numbers G Varoquaux 6
  9. 9. 1 Understanding the data, between human and machine Age 60 26 38 ?? 139 52 86 17 48 Numbers with a meaning A numerical column expresses a quantity, with a corresponding scale... G Varoquaux 6
  10. 10. 1 Understanding the data, between human and machine Age Name 60 Bono 26 Justin Bieber 38 Giselle Knowles-Carter? 139 Pablo Picasso 52 Céline Dion 86 Léonard Cohen 17 Greta Thunberg 48 Justin Trudeau ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers G Varoquaux 6
  11. 11. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) G Varoquaux 6
  12. 12. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) And find errors Knowledge representation, relational algebra G Varoquaux 6
  13. 13. 1 Assembling data, of different natures and sources Age Name Position 60 John Doe Electrician 48 Jane Austen Senior Professor 52 Jack Daniels Professor Position Salary Electrician 35 lizards Professor 13 horses Senior Professor 1 dragon To model the link between age and salary, a join is necessary Databases: To maintain consistency and min- imize storage, data are normal- ized: multiple tables are use to minimize redundancy. Statistics: Needs samples and features: mul- tiple observations of the same kind ⇒ data is denormalized in 1 table Age Name Position Salary Coffees/day 60 John Doe Electrician 35 lizards 2 48 Jane Austen Senior Professor 1 dragon 128 G Varoquaux 7
  14. 14. 1 Aggregations – long vs wide tables Person ID Measure type Value 12345 Blood Pressure 139 45673 Sugar Level 113 12345 Heart Rate 71 45673 Blood Pressure 84 Long table Flexible data representation Person Blood Sugar Heart Rate ID Pressure Level Rate 12345 139 NA 71 45673 84 113 NA Wide table Amenable to statistics on Person Long to wide in Pandas: unstack, pivot Also: count coffes per day per person from coffee-machine logs G Varoquaux 8
  15. 15. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) Age Name Country Position Coffees/day 48 Justin Trudeau Canada Prime minister 3000 NA Gaël Varoquaux NA NA NA G Varoquaux 9
  16. 16. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) In health: Assembling information across large electronic health records systems G Varoquaux 9
  17. 17. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies G Varoquaux 10
  18. 18. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) G Varoquaux 10
  19. 19. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) G Varoquaux 10
  20. 20. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) Partly addressed by machine-learning models for dataset shift (transfer learning) if you know the bias. Brings us back to understanding the data G Varoquaux 10
  21. 21. Data-science is much more than fitting a statistical model Data require assembling information Different data sources = different conventions Measurements come with errors and biases These challenges require domain knowledge and data wrangling G Varoquaux 11
  22. 22. 2 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 12
  23. 23. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem G Varoquaux 13
  24. 24. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem Categorical entries are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 13
  25. 25. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] observed(x0 , mi) = observed(xi, mi) ⇒ gφ(mi|x0 ) = gφ(mi|xi) Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). G Varoquaux 14
  26. 26. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable G Varoquaux 14
  27. 27. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR G Varoquaux 14
  28. 28. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR But There isn’t always an unobserved value Age of spouse of singles? Machine-learning’s goal is not to maximize likelihoods G Varoquaux 14
  29. 29. 2 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA –2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA –2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA –2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 15
  30. 30. 2 Imputation and prediction with test-time missing values Settings: y = f (x) + ε Theorem [Josse... 2019] f : trained predictor achieving Bayes risk on full data Conditional multiple imputation achieves Bayes risk on test set with missing data (in MAR settings) f ? mult imput(x̃) = Exm|Xo=xo [f (xm, Xo)]. Notations: x̃ ∈ (R ∪ NA)p : data at hand xo: observed values xm: unobserved values G Varoquaux 16
  31. 31. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 17
  32. 32. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability G Varoquaux 17
  33. 33. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability Classic statistics point of view Mean imputation is disastrous, be- cause it disorts the distribution “Congeniality” conditions: good im- putation must preserve data propeties used by later analysis steps 2 0 2 3 2 1 0 1 2 3 G Varoquaux 17
  34. 34. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 18
  35. 35. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Constant imputation breaks simple models (eg linear models) [Morvan... 2020] G Varoquaux 18
  36. 36. 2 Imputation for supervised learning Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2 score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 19
  37. 37. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 20
  38. 38. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2 score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 20
  39. 39. 2 Tree models with missing values MIA (Missing Incorporated Attribute) [Josse... 2019] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn.ensemble.HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 21
  40. 40. 2 Tree models with missing values (MCAR) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.70 0.75 0.80 r2 score Inside trees Mean Iterative Convergence 0.75 0.80 r2 score Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 22
  41. 41. 2 Tree models with missing values (censored) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.7 0.8 0.9 r2 score Inside trees Mean Iterative Mean+ indicator Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Mean+ indicator Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 23
  42. 42. 2 Neural networks with missing values Gradient-based optimization of continuous models Difficulty: Half-discrete input space (NA ∪ R) Y = β? 1X1 + β? 2X2 + β? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 24
  43. 43. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly G Varoquaux 25
  44. 44. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data G Varoquaux 25
  45. 45. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data Also suitable for MNAR settings G Varoquaux 25
  46. 46. Learning with missing values Imputation is motivated only in MAR settings Rather than a sophisticated imputation, use a powerful supervised learner sklearn’s HistGradientBoostingClassifier readily models missing values Can work in MNAR settings Different regime as standard statistics G Varoquaux 26
  47. 47. 3 Machine learning on dirty categories [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I G Varoquaux 27
  48. 48. 3 Categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Master Police Officer Social Worker IV Police Officer II 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 One-hot encoding X ∈ Rn×p G Varoquaux 28
  49. 49. 3 Non-normalized categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 29
  50. 50. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 30
  51. 51. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Embedding closeby categories with the same y can help building a simple decision function. G Varoquaux 30
  52. 52. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 30
  53. 53. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 31
  54. 54. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 31
  55. 55. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 31
  56. 56. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 32
  57. 57. 3 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 33
  58. 58. 3 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 34
  59. 59. 3 Python implementation: DirtyCat DirtyCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 35
  60. 60. 3 Dirty categories blow up dimension G Varoquaux 36
  61. 61. 3 Dirty categories blow up dimension New words in natural language G Varoquaux 36
  62. 62. 3 Dirty categories blow up dimension New words in natural language X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 36
  63. 63. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 37
  64. 64. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes / ∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 37
  65. 65. 3 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 38
  66. 66. 3 Modeling substrings [Cerda and Varoquaux 2020] Model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l sklearn.feature extraction.text CountVectorizer analyzer : ’word’, ’char’, ’char wb’ HashingVectorizer fast, stateless TfidfVectorizer normalize counts G Varoquaux 39
  67. 67. 3 Latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 39
  68. 68. 3 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories b r a r y r a t o r a l i s t h o u s e n a g e r u n i t y e s c u e f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e s Categories G Varoquaux 40
  69. 69. 3 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names s t a n t , l i b r a r y m e n t , o p e r a t o r o n , s p e c i a l i s t k e r , w a r e h o u s e o g r a m , m a n a g e r n i c , c o m m u n i t y e s c u e r , r e s c u e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e a t u r e n a m e s Categories G Varoquaux 40
  70. 70. 3 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 41
  71. 71. Learning does not require clean entities Model continuous similarities across entries Sub-string models can capture theses Requires a powerful statistical model (Gradient-boosted trees) Explainable machine-learning techniques to give insight G Varoquaux 42
  72. 72. @GaelVaroquaux Machine learning with dirty data What models cannot fit Dirty categories Missing values Understanding and formatting data is unavoidable Master these aspects Powerful machine-learning models can cope with dirtyness - If it is well represented (representing similarities and missingness) - If they have supervision information
  73. 73. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Data and Knowledge Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
  74. 74. 4 References II M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK