6
Dirty data science machine learning on non-curated data
source link: https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Dirty data science machine learning on non-curated data
- 1. Dirty data science machine learning on non-curated data Gaël Varoquaux,
- 2. Dirty data science machine learning on non-curated data Gaël Varoquaux,
- 3. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster
- 4. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster On some dirty-data problems, progress in machine learning can ease the pain
- 5. Talk outline 1 What models cannot fit 2 Learning with missing values 3 Machine learning on dirty categories G Varoquaux 3
- 6. 1 What models cannot fit Outside of statistics’ comfort zone (X ∈ Rn×p ) G Varoquaux 4
- 7. 1 The full life-cycle of a data-science project Framing the domain question Finding and understanding the data Assembling and reshaping it Designing an AI / statistical model? Evaluating model performance Inspecting the model for unwanted behavior Bringing the model to stakeholders / production ?: what we think is cool G Varoquaux 5
- 8. 1 Understanding the data, between human and machine Age 60 26 38 139 52 86 17 48 Just numbers G Varoquaux 6
- 9. 1 Understanding the data, between human and machine Age 60 26 38 ?? 139 52 86 17 48 Numbers with a meaning A numerical column expresses a quantity, with a corresponding scale... G Varoquaux 6
- 10. 1 Understanding the data, between human and machine Age Name 60 Bono 26 Justin Bieber 38 Giselle Knowles-Carter? 139 Pablo Picasso 52 Céline Dion 86 Léonard Cohen 17 Greta Thunberg 48 Justin Trudeau ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers G Varoquaux 6
- 11. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) G Varoquaux 6
- 12. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) And find errors Knowledge representation, relational algebra G Varoquaux 6
- 13. 1 Assembling data, of different natures and sources Age Name Position 60 John Doe Electrician 48 Jane Austen Senior Professor 52 Jack Daniels Professor Position Salary Electrician 35 lizards Professor 13 horses Senior Professor 1 dragon To model the link between age and salary, a join is necessary Databases: To maintain consistency and min- imize storage, data are normal- ized: multiple tables are use to minimize redundancy. Statistics: Needs samples and features: mul- tiple observations of the same kind ⇒ data is denormalized in 1 table Age Name Position Salary Coffees/day 60 John Doe Electrician 35 lizards 2 48 Jane Austen Senior Professor 1 dragon 128 G Varoquaux 7
- 14. 1 Aggregations – long vs wide tables Person ID Measure type Value 12345 Blood Pressure 139 45673 Sugar Level 113 12345 Heart Rate 71 45673 Blood Pressure 84 Long table Flexible data representation Person Blood Sugar Heart Rate ID Pressure Level Rate 12345 139 NA 71 45673 84 113 NA Wide table Amenable to statistics on Person Long to wide in Pandas: unstack, pivot Also: count coffes per day per person from coffee-machine logs G Varoquaux 8
- 15. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) Age Name Country Position Coffees/day 48 Justin Trudeau Canada Prime minister 3000 NA Gaël Varoquaux NA NA NA G Varoquaux 9
- 16. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) In health: Assembling information across large electronic health records systems G Varoquaux 9
- 17. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies G Varoquaux 10
- 18. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) G Varoquaux 10
- 19. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) G Varoquaux 10
- 20. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) Partly addressed by machine-learning models for dataset shift (transfer learning) if you know the bias. Brings us back to understanding the data G Varoquaux 10
- 21. Data-science is much more than fitting a statistical model Data require assembling information Different data sources = different conventions Measurements come with errors and biases These challenges require domain knowledge and data wrangling G Varoquaux 11
- 22. 2 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 12
- 23. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem G Varoquaux 13
- 24. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem Categorical entries are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 13
- 25. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] observed(x0 , mi) = observed(xi, mi) ⇒ gφ(mi|x0 ) = gφ(mi|xi) Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). G Varoquaux 14
- 26. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable G Varoquaux 14
- 27. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR G Varoquaux 14
- 28. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR But There isn’t always an unobserved value Age of spouse of singles? Machine-learning’s goal is not to maximize likelihoods G Varoquaux 14
- 29. 2 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA –2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA –2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA –2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 15
- 30. 2 Imputation and prediction with test-time missing values Settings: y = f (x) + ε Theorem [Josse... 2019] f : trained predictor achieving Bayes risk on full data Conditional multiple imputation achieves Bayes risk on test set with missing data (in MAR settings) f ? mult imput(x̃) = Exm|Xo=xo [f (xm, Xo)]. Notations: x̃ ∈ (R ∪ NA)p : data at hand xo: observed values xm: unobserved values G Varoquaux 16
- 31. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 17
- 32. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability G Varoquaux 17
- 33. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability Classic statistics point of view Mean imputation is disastrous, be- cause it disorts the distribution “Congeniality” conditions: good im- putation must preserve data propeties used by later analysis steps 2 0 2 3 2 1 0 1 2 3 G Varoquaux 17
- 34. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 18
- 35. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Constant imputation breaks simple models (eg linear models) [Morvan... 2020] G Varoquaux 18
- 36. 2 Imputation for supervised learning Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2 score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 19
- 37. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 20
- 38. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2 score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 20
- 39. 2 Tree models with missing values MIA (Missing Incorporated Attribute) [Josse... 2019] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn.ensemble.HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 21
- 40. 2 Tree models with missing values (MCAR) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.70 0.75 0.80 r2 score Inside trees Mean Iterative Convergence 0.75 0.80 r2 score Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 22
- 41. 2 Tree models with missing values (censored) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.7 0.8 0.9 r2 score Inside trees Mean Iterative Mean+ indicator Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Mean+ indicator Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 23
- 42. 2 Neural networks with missing values Gradient-based optimization of continuous models Difficulty: Half-discrete input space (NA ∪ R) Y = β? 1X1 + β? 2X2 + β? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 24
- 43. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly G Varoquaux 25
- 44. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data G Varoquaux 25
- 45. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data Also suitable for MNAR settings G Varoquaux 25
- 46. Learning with missing values Imputation is motivated only in MAR settings Rather than a sophisticated imputation, use a powerful supervised learner sklearn’s HistGradientBoostingClassifier readily models missing values Can work in MNAR settings Different regime as standard statistics G Varoquaux 26
- 47. 3 Machine learning on dirty categories [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I G Varoquaux 27
- 48. 3 Categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Master Police Officer Social Worker IV Police Officer II 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 One-hot encoding X ∈ Rn×p G Varoquaux 28
- 49. 3 Non-normalized categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 29
- 50. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 30
- 51. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Embedding closeby categories with the same y can help building a simple decision function. G Varoquaux 30
- 52. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 30
- 53. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 31
- 54. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 31
- 55. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 31
- 56. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 32
- 57. 3 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 33
- 58. 3 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 34
- 59. 3 Python implementation: DirtyCat DirtyCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 35
- 60. 3 Dirty categories blow up dimension G Varoquaux 36
- 61. 3 Dirty categories blow up dimension New words in natural language G Varoquaux 36
- 62. 3 Dirty categories blow up dimension New words in natural language X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 36
- 63. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 37
- 64. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes / ∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 37
- 65. 3 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 38
- 66. 3 Modeling substrings [Cerda and Varoquaux 2020] Model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l sklearn.feature extraction.text CountVectorizer analyzer : ’word’, ’char’, ’char wb’ HashingVectorizer fast, stateless TfidfVectorizer normalize counts G Varoquaux 39
- 67. 3 Latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 39
- 68. 3 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories b r a r y r a t o r a l i s t h o u s e n a g e r u n i t y e s c u e f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e s Categories G Varoquaux 40
- 69. 3 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names s t a n t , l i b r a r y m e n t , o p e r a t o r o n , s p e c i a l i s t k e r , w a r e h o u s e o g r a m , m a n a g e r n i c , c o m m u n i t y e s c u e r , r e s c u e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e a t u r e n a m e s Categories G Varoquaux 40
- 70. 3 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 41
- 71. Learning does not require clean entities Model continuous similarities across entries Sub-string models can capture theses Requires a powerful statistical model (Gradient-boosted trees) Explainable machine-learning techniques to give insight G Varoquaux 42
- 72. @GaelVaroquaux Machine learning with dirty data What models cannot fit Dirty categories Missing values Understanding and formatting data is unavoidable Master these aspects Powerful machine-learning models can cope with dirtyness - If it is well represented (representing similarities and missingness) - If they have supervision information
- 73. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Data and Knowledge Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
- 74. 4 References II M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK